Jump to content

string similarity in G


ensegre

Recommended Posts

Posted

Out of curiosity, does anyone know of an available labview implementation of a string similarity algorithm? I might have a use case for it in a coming project.

I've seen that out there there is a lot, so at worst I might end up toying with reimplementing one.

e.g. https://stackoverflow.com/questions/653157/a-better-similarity-ranking-algorithm-for-variable-length-strings?noredirect=1&lq=1

https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Dice's_coefficient

 

Posted

I do have a very quick and dirty solution that I wrote many years ago without consulting the internet or what would be the right way to do this.  That being said I just ran it on the Stack Overflow test and for: Robert, and Amy Robertson my code had a 44% match, and for Robert and Richard it had a 0% match.  That being said if you can come up with anything better or more standard that would probably be better than this.

Compare String Confidence.vi

Posted (edited)

I see. So your one seems pretty similar to the Strike a Match, IIUC, with the differences that 1) it strips away from the strings all chars not in [0-9,a-z] (what, no uppercase?) (good for literal strings, but otherwise not always) (could be probably done more synthetically with a regexp), and that it uses the arithmetic sum of two ascii codes instead of a true digram, which can create more false matches. A starting point anyway, thanks.

Edited by ensegre
Posted

Yeah a lot of the decisions made there seem pretty dumb looking at it now.  But it was a personal project and really just needed something that could tell me if a string pulled from a webpage, had a similar enough string to a file name on disk.  Good luck.

Posted

You can check out my code I used for a run time spell checker.  It is used in a Spell Check QControl I wrote but I always meant to publish the engine standalone.  It is useable standalone.  It can be found here: https://gitlab.com/QSI_Shared_Code/SharedQControls/SpellcheckString.

It uses the Damerau–Levenshtein distance to calculate the "distance" between two words by number of operations to transform one word into the other word.  It counts operations as insertions, deletions or substitutions of a single character, or transposition of two adjacent characters.

Posted (edited)

Thanks, I'll check your one too. I've seen out on the internet that the most popular method suggested is the Dice (strike a match) metric, followed by the Levenshtein, beyond that it looks as it is mostly for theoreticians. Since the metrics have different merits, I have yet to figure out which may be better for what I have in mind.

In the meantime, however, here is my quick implementation of Dice. LV2021, based on sets, hence not so backportable (down to 2019, perhaps).

StringDigrams.vi StrikeAMatch.vi

Edited by ensegre
Posted

Thanks for the 2019 VIs. I compared the three algorithms.

Levenshtein works different. For better comparison, I normalized the value with:  1 - LevenshteinDistance / (Length(a)+Length(b)

Dice: Sørensen–Dice coefficient
CSC: Compare String Confidence.vi
Lev.: Levenshtein

snip.PNG.8425304390d0d27f6ba46680824ea68c.PNG

 

Posted (edited)

well, they are different, they measure different things. The question would be what is more meaningful in one or another use case. When I started thinking at it I had in mind an input corrector (which I'm not sure I'll really pursue) which warned the user about possible misspellings and likely corrections, but then one might consider that compared to "Digital input D1", "digital input D2" may be more right than "diggital input D1". Heuristics.

Edited by ensegre
Posted
15 hours ago, ensegre said:

When I started thinking at it I had in mind an input corrector (which I'm not sure I'll really pursue) which warned the user about possible misspellings and likely corrections,

I've never done somethig like this. I used the LevenshteinDistance to compare a lot of smaller strings (between 8 and 12 characters long). Because each of the strings was more or less significant, I used an additional factor for each of the strings and finally generated some king of an overall coefficient.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.