I’ve been studying up on fuzzy string match after controlling for misspellings, typos, dyslexia etc. and I found a few articles discussing various approaches like:
Levenstein distance
Damerau–Levenshtein distance
n-gram
Soundex
Jaro-Winkler distance
Jaccard index
I found this video from two guys which took a process of checking to see if a name was on a terrorist watch lists which originally took 14 days to compute down to 5 minutes
What’s in a Name? Fast Fuzzy String Matching – Seth Verrinder & Kyle Putnam – Midwest.io 2015
Below are my notes from watching the fuzzy string match video (it is ~40 minutes long but very interesting)
1) throw more hardware
2) use another variable/field (zip code / country etc.)
3) n-grams
4) metric trees (example: Lowenstein distance)
5) Brute force (Jaro Winkler is pretty fast already) (5X down to 70hrs )
6) Filtering- estimate similarity first then filter (7x down to 50 hrs 18 minutes in video)
· Length of strings (name length often is not normally distributed so doesn’t rule out too much) Probably still look at 70%
· 26 Character filter- search for character that isn’t shared- This dropped out quite a bit but was slow (300x down to 65 minutes)
o Bitmap filter- use bitwise operations to get unmatched count- very fast! (340X down to 60 minutes 20 minutes in video)
o 64 character filter (used all bits)- checked for multiple occurrences of a given letter
7) Minimize recalculation (4,000x down to 5 minutes – 28 minutes in video)
· sort names and groups into segments
· common length and first character
· used WolframAlpha to help show formula
Learnings from Fuzzy String Match process
· Measure performance and focus on bottleneck
· Order of magnitude doesn’t always tell you about actual performance
· Favor simplicity
Approximate text matching – Wikipedia, the free encyclopedia
kiyoka/fuzzy-text-match · GitHub
text-match | RubyGems.org | your community gem host