## Various fuzzy string match algorithms & Excellent video review

I’ve been studying up on fuzzy string match after controlling for misspellings, typos, dyslexia etc.  and I found a few articles discussing various approaches like:

I found this video from two guys which took a process of checking to see if a name was on a terrorist watch lists which originally took 14 days to compute down to 5 minutes
What’s in a Name? Fast Fuzzy String Matching – Seth Verrinder & Kyle Putnam – Midwest.io 2015

Below are my notes from watching the fuzzy string match video (it is ~40 minutes long but very interesting)

1) throw more hardware
2) use another variable/field (zip code / country etc.)
3) n-grams
4) metric trees (example: Lowenstein distance)
5) Brute force (Jaro Winkler is pretty fast already) (5X down to 70hrs )
6) Filtering- estimate similarity first then filter (7x down to 50 hrs 18 minutes in video)
· Length of strings (name length often is not normally distributed so doesn’t rule out too much) Probably still look at 70%
· 26 Character filter- search for character that isn’t shared- This dropped out quite a bit but was slow (300x down to 65 minutes)
o Bitmap filter- use bitwise operations to get unmatched count- very fast! (340X down to 60 minutes 20 minutes in video)
o 64 character filter (used all bits)- checked for multiple occurrences of a given letter

7) Minimize recalculation (4,000x down to 5 minutes – 28 minutes in video)
· sort names and groups into segments
· common length and first character
· used WolframAlpha to help show formula

## Learnings from Fuzzy String Match process

· Measure performance and focus on bottleneck
· Order of magnitude doesn’t always tell you about actual performance
· Favor simplicity

Approximate text matching – Wikipedia, the free encyclopedia

In computer science, approximate string matching (often colloquially referred to as fuzzy string searching) is the technique of finding strings that match a pattern …

kiyoka/fuzzy-text-match · GitHub

fuzzy string match library for ruby. Contribute to text-match development by creating an account on GitHub.

text-match | RubyGems.org | your community gem host

text-match 0.9.7. calculate Jaro Winkler distance. Versions: 0.9.7 – December 21, 2013 (13.5 KB); 0.9.6 – December 21, 2013 (13.5 KB); 0.9.5 – March 26, …

## AHK Studio is an impressive IDE / editor for AutoHotkey

AHK Studio is an amazing and impressive IDE / Editor for AutoHotkey

I had an in-depth Hangout with Chad Wilson (Maestrith on the forum), the Author and Designer of AHK Studio.   Check out my AHK Studio tutorials here

While I’ve been, and still am, a very satisfied SciTE4AHK user, I was very impressed with many aspects of the tool.  It is very intuitive to use and offers some great features that will simplify a coding.  Not surprisingly AHK Studio is loaded with HotKeys that, once you familiarize yourself with them, will be awesome!  While advanced programmers in AutoHotkey will love the advanced functionality, Noobs will enjoy it’s simplicity.

Here are links where you can you can download it from the AHK forum or from GitHub.  Please keep in mind it is still in development.  (This is both good and bad.  It is good because Chad is very active and open to tweaks/fixes/improvements, bad because “kinks” are never fun)

Here are a few videos on AHK Studio  and below is the nearly 2-hour video demonstrating some of the configuration settings and functionality.

## Get text from a list box; use AutoHotKey to grab items from programs

Sometimes I’d like to be able to, progromatically, extract values listed inside a program.  Unfortunately many programs I use do not allow the ability to get text from a list box.

One of AHKs great strengths is how well it “hooks” into Windows.  I wrote an AutoHotKey script which allows me to copy and paste a list of items selected in the window.  There are lots of options like retrieve all items, only those selected, obtain the count of either previously mentioned.  Once you have all the items you can send instructions back to the list box and specify which one you want selected (thus if you frequently go back and select the same items, it can automate the process.

```
loop {
ControlGet, Sel_CT, List, Count Selected, SysListView321, A ;Gets count of items selected from last active window
ToolTip % Sel_CT
sleep, 300
}

#IfWinActive ahk_class #32770 ;Only run below if in Specific window type
Browser_Back::
ControlGet, Selected_Items,List,Selected      ,SysListView321, A ;gets Selected Items in last active window
ControlGet, Selected_CT   ,List,Selected Count,SysListView321, A ;gets count of selected items in last active window

ControlGet, All_Items,List,      ,SysListView321, A ;gets list of all items in last active window
ControlGet, All_CT   ,List, Count,SysListView321, A ;gets count of all items in last active window

Clipboard:=Selected_Items
MsgBox % "Number of Items selected: " Selected_CT "`r`r" Selected_Items
MsgBox % "Number of Items selected: " All_CT "`r`r" All_Items
return
#IfWinActive
```

## SPSS missing values

The syntax around SPSS variables with missing values is not intuitive, confusing, and poorly documented!

I know of 3 different types of commands and knowing which one to use when is not clear.  Setting SPSS missing values is a great way to simplify your analysis.  It is also a user-friendly way to remove (hide) outliers.  This video gives a short demo of how to use the three that I use frequently.

If you want to declare a value in a cell as missing the following syntax will give you a good start.

``` if   Var1=1  Var1=\$sysmis.
exe.```

If you want to remove the values that are in a variable (define them as missing) the following syntax will be what you need.

` MISSING VALUES Var1 to Var10 (99).`

SPSS missing values Macros

Below are two macros to help with missing data.  The first one is used when you first want to see if there is a given value present in another variable before declaring it recoding the missing a zero.  The second one will recode all variables with missing values a zero.

```
*///////////////.
DEFINE !Rep_Miss (Beg !TOKENS (1) /Prez !TOKENS (1) /End !TOKENS (1))
Do if !PREZ>0.
do repeat v=!BEG to !END.
if missing (v) v=0.
end repeat.
end if.
exe.
!ENDDEFINE.
*///////////////.

!Rep_Miss Prez=presentvariable Beg=v11 End=v19.

*///////////////.
DEFINE !Rep_Miss2 (Beg !TOKENS (1) /End !TOKENS (1))
do repeat v=!BEG to !END.
if missing (v) v=0.
end repeat.
exe.
!ENDDEFINE.
*///////////////.
/*!Rep_Miss2 Beg=v11 End=v19.
```