ensegre Posted November 17, 2021 Report Posted November 17, 2021 Out of curiosity, does anyone know of an available labview implementation of a string similarity algorithm? I might have a use case for it in a coming project. I've seen that out there there is a lot, so at worst I might end up toying with reimplementing one. e.g. https://stackoverflow.com/questions/653157/a-better-similarity-ranking-algorithm-for-variable-length-strings?noredirect=1&lq=1 https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Dice's_coefficient Quote
hooovahh Posted November 17, 2021 Report Posted November 17, 2021 I do have a very quick and dirty solution that I wrote many years ago without consulting the internet or what would be the right way to do this. That being said I just ran it on the Stack Overflow test and for: Robert, and Amy Robertson my code had a 44% match, and for Robert and Richard it had a 0% match. That being said if you can come up with anything better or more standard that would probably be better than this. Compare String Confidence.vi Quote
ensegre Posted November 17, 2021 Author Report Posted November 17, 2021 (edited) I see. So your one seems pretty similar to the Strike a Match, IIUC, with the differences that 1) it strips away from the strings all chars not in [0-9,a-z] (what, no uppercase?) (good for literal strings, but otherwise not always) (could be probably done more synthetically with a regexp), and that it uses the arithmetic sum of two ascii codes instead of a true digram, which can create more false matches. A starting point anyway, thanks. Edited November 17, 2021 by ensegre Quote
hooovahh Posted November 17, 2021 Report Posted November 17, 2021 Yeah a lot of the decisions made there seem pretty dumb looking at it now. But it was a personal project and really just needed something that could tell me if a string pulled from a webpage, had a similar enough string to a file name on disk. Good luck. Quote
The Q Posted November 17, 2021 Report Posted November 17, 2021 You can check out my code I used for a run time spell checker. It is used in a Spell Check QControl I wrote but I always meant to publish the engine standalone. It is useable standalone. It can be found here: https://gitlab.com/QSI_Shared_Code/SharedQControls/SpellcheckString. It uses the Damerau–Levenshtein distance to calculate the "distance" between two words by number of operations to transform one word into the other word. It counts operations as insertions, deletions or substitutions of a single character, or transposition of two adjacent characters. Quote
ensegre Posted November 17, 2021 Author Report Posted November 17, 2021 (edited) Thanks, I'll check your one too. I've seen out on the internet that the most popular method suggested is the Dice (strike a match) metric, followed by the Levenshtein, beyond that it looks as it is mostly for theoreticians. Since the metrics have different merits, I have yet to figure out which may be better for what I have in mind. In the meantime, however, here is my quick implementation of Dice. LV2021, based on sets, hence not so backportable (down to 2019, perhaps). StringDigrams.vi StrikeAMatch.vi Edited November 17, 2021 by ensegre Quote
mhenz Posted November 21, 2021 Report Posted November 21, 2021 (edited) Regarding Levenshtein: Wladimir Levenshtein developed 1995 an algorithm for this. It is called the Levenshtein Distance. Some years ago I developed a VI to calculate the Levenshtein Distance. Here it is (LabVIEW 2016). Can you post your VIs in LV2020 or 2019, please. Levenshtein Distance.vi Edited November 21, 2021 by mhenz 1 Quote
ensegre Posted November 21, 2021 Author Report Posted November 21, 2021 Here they come for 2019. For the little these require sets, i.e. only to find the common elements among two character arrays, I think that it would be pretty easy to reimplement them without, for earlier versions. StrikeAMatch.vi StringDigrams.vi 1 Quote
mhenz Posted November 21, 2021 Report Posted November 21, 2021 Thanks for the 2019 VIs. I compared the three algorithms. Levenshtein works different. For better comparison, I normalized the value with: 1 - LevenshteinDistance / (Length(a)+Length(b) Dice: Sørensen–Dice coefficient CSC: Compare String Confidence.vi Lev.: Levenshtein Quote
ensegre Posted November 21, 2021 Author Report Posted November 21, 2021 (edited) well, they are different, they measure different things. The question would be what is more meaningful in one or another use case. When I started thinking at it I had in mind an input corrector (which I'm not sure I'll really pursue) which warned the user about possible misspellings and likely corrections, but then one might consider that compared to "Digital input D1", "digital input D2" may be more right than "diggital input D1". Heuristics. Edited November 21, 2021 by ensegre Quote
mhenz Posted November 22, 2021 Report Posted November 22, 2021 15 hours ago, ensegre said: When I started thinking at it I had in mind an input corrector (which I'm not sure I'll really pursue) which warned the user about possible misspellings and likely corrections, I've never done somethig like this. I used the LevenshteinDistance to compare a lot of smaller strings (between 8 and 12 characters long). Because each of the strings was more or less significant, I used an additional factor for each of the strings and finally generated some king of an overall coefficient. Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.