string similarity in G

ensegre · November 17, 2021

Out of curiosity, does anyone know of an available labview implementation of a string similarity algorithm? I might have a use case for it in a coming project.

I've seen that out there there is a lot, so at worst I might end up toying with reimplementing one.

e.g. https://stackoverflow.com/questions/653157/a-better-similarity-ranking-algorithm-for-variable-length-strings?noredirect=1&lq=1

https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Dice's_coefficient

hooovahh · November 17, 2021

I do have a very quick and dirty solution that I wrote many years ago without consulting the internet or what would be the right way to do this. That being said I just ran it on the Stack Overflow test and for: Robert, and Amy Robertson my code had a 44% match, and for Robert and Richard it had a 0% match. That being said if you can come up with anything better or more standard that would probably be better than this.

Compare String Confidence.vi

ensegre · November 17, 2021

I see. So your one seems pretty similar to the Strike a Match, IIUC, with the differences that 1) it strips away from the strings all chars not in [0-9,a-z] (what, no uppercase?) (good for literal strings, but otherwise not always) (could be probably done more synthetically with a regexp), and that it uses the arithmetic sum of two ascii codes instead of a true digram, which can create more false matches. A starting point anyway, thanks.

Edited November 17, 2021 by ensegre

hooovahh · November 17, 2021

Yeah a lot of the decisions made there seem pretty dumb looking at it now. But it was a personal project and really just needed something that could tell me if a string pulled from a webpage, had a similar enough string to a file name on disk. Good luck.

The Q · November 17, 2021

You can check out my code I used for a run time spell checker. It is used in a Spell Check QControl I wrote but I always meant to publish the engine standalone. It is useable standalone. It can be found here: https://gitlab.com/QSI_Shared_Code/SharedQControls/SpellcheckString.

It uses the Damerau–Levenshtein distance to calculate the "distance" between two words by number of operations to transform one word into the other word. It counts operations as insertions, deletions or substitutions of a single character, or transposition of two adjacent characters.

ensegre · November 17, 2021

Thanks, I'll check your one too. I've seen out on the internet that the most popular method suggested is the Dice (strike a match) metric, followed by the Levenshtein, beyond that it looks as it is mostly for theoreticians. Since the metrics have different merits, I have yet to figure out which may be better for what I have in mind.

In the meantime, however, here is my quick implementation of Dice. LV2021, based on sets, hence not so backportable (down to 2019, perhaps).

StringDigrams.vi StrikeAMatch.vi

Edited November 17, 2021 by ensegre

mhenz · November 21, 2021

Regarding Levenshtein: Wladimir Levenshtein developed 1995 an algorithm for this. It is called the Levenshtein Distance.

Some years ago I developed a VI to calculate the Levenshtein Distance. Here it is (LabVIEW 2016).

Can you post your VIs in LV2020 or 2019, please.

Levenshtein Distance.vi

Edited November 21, 2021 by mhenz

ensegre · November 21, 2021

Here they come for 2019. For the little these require sets, i.e. only to find the common elements among two character arrays, I think that it would be pretty easy to reimplement them without, for earlier versions.

StrikeAMatch.vi StringDigrams.vi

mhenz · November 21, 2021

Thanks for the 2019 VIs. I compared the three algorithms.

Levenshtein works different. For better comparison, I normalized the value with: 1 - LevenshteinDistance / (Length(a)+Length(b)

Dice: Sørensen–Dice coefficient
CSC: Compare String Confidence.vi
Lev.: Levenshtein

ensegre · November 21, 2021

well, they are different, they measure different things. The question would be what is more meaningful in one or another use case. When I started thinking at it I had in mind an input corrector (which I'm not sure I'll really pursue) which warned the user about possible misspellings and likely corrections, but then one might consider that compared to "Digital input D1", "digital input D2" may be more right than "diggital input D1". Heuristics.

Edited November 21, 2021 by ensegre

mhenz · November 22, 2021

15 hours ago, ensegre said:

When I started thinking at it I had in mind an input corrector (which I'm not sure I'll really pursue) which warned the user about possible misspellings and likely corrections,

I've never done somethig like this. I used the LevenshteinDistance to compare a lot of smaller strings (between 8 and 12 characters long). Because each of the strings was more or less significant, I used an additional factor for each of the strings and finally generated some king of an overall coefficient.

Sign In

string similarity in G

Recommended Posts

ensegre

hooovahh

ensegre

hooovahh

The Q

ensegre

mhenz

ensegre

mhenz

ensegre

mhenz

Join the conversation

Browse

Activity

Important Information