Moving data from one column to many; Parsing a variable using an SPSS macro

one column to many

Older online vendor tools and databases would frequently put multi-select questions into one column having a pipe,tab,semicolon or comma delimiter (what was real fun is when they would use a comma for a delimiter in a CSV file).

This can be very problematic in nearly any tool. In this video I demonstrate how easy it can be to move data from one column to many with an SPSS macro.

Move one column to many

Here is the SPSS macro demonstrated in the video:

 
*///////////////.
DEFINE !Parse (Var !TOKENS (1)  / Stem !TOKENS (1) /Del !TOKENS (1))
STRING #(A1000).
VECTOR !Stem(25A500).
COMPUTE #=CONCAT(RTRIM(!Var),!Del).
COMPUTE #cnt=1.
LOOP IF INDEX(#,!Del)>0.
COMPUTE !Stem(#cnt)=SUBSTR(#,1,INDEX(#,!Del)-1).
COMPUTE #cnt=#cnt + 1.
COMPUTE #=SUBSTR(#,INDEX(#,!Del)+1).
Var width !Concat(!Stem,1) to !Concat(!Stem,25) (10).
END LOOP.
EXECUTE.
!ENDDEFINE.
*///////////////.
/* !Parse Var=NAIC_All Stem=NAIC  Del=";".

 

SPSS Macros for replacing missing data

SPSS macro

SPSS Macros for replacing missing data

A lot of online vendor tools provide a data value on a multi-select question however they leave the value missing if the respondent didn’t select it. Most of the analysis I want to do needs to have a zero (or some other value) present in order to calculate the statistics correctly.

I wrote a few SPSS Macros that make it very easy to replace the missing values on your data.  Here are the two macros

 
*///////////////.
DEFINE !Rep_Miss  (Beg !TOKENS (1) /Prez !TOKENS (1)  /End !TOKENS (1))
Do if !PREZ>0.
do repeat v=!BEG to !END.
if missing (v) v=0.
end repeat.
end if.
exe.
!ENDDEFINE.
*///////////////.

/*!Rep_Miss  Prez=presentvariable  Beg=v11  End=v19.

*///////////////.
DEFINE !Rep_Miss2  (Beg !TOKENS (1)   /End !TOKENS (1))
do repeat v=!BEG to !END.
if missing (v) v=0.
end repeat.
exe.
!ENDDEFINE.
*///////////////.
/*!Rep_Miss2   Beg=v11  End=v19.

 

Various fuzzy string match algorithms & Excellent video review

Fuzzy String Match

I’ve been studying up on fuzzy string match after controlling for misspellings, typos, dyslexia etc.  and I found a few articles discussing various approaches like:

Levenstein distance
Damerau–Levenshtein distance
n-gram
Soundex
Jaro-Winkler distance
Jaccard index

I found this video from two guys which took a process of checking to see if a name was on a terrorist watch lists which originally took 14 days to compute down to 5 minutes
What’s in a Name? Fast Fuzzy String Matching – Seth Verrinder & Kyle Putnam – Midwest.io 2015

Below are my notes from watching the fuzzy string match video (it is ~40 minutes long but very interesting)

1) throw more hardware
2) use another variable/field (zip code / country etc.)
3) n-grams
4) metric trees (example: Lowenstein distance)
5) Brute force (Jaro Winkler is pretty fast already) (5X down to 70hrs )
6) Filtering- estimate similarity first then filter (7x down to 50 hrs 18 minutes in video)
· Length of strings (name length often is not normally distributed so doesn’t rule out too much) Probably still look at 70%
· 26 Character filter- search for character that isn’t shared- This dropped out quite a bit but was slow (300x down to 65 minutes)
o Bitmap filter- use bitwise operations to get unmatched count- very fast! (340X down to 60 minutes 20 minutes in video)
o 64 character filter (used all bits)- checked for multiple occurrences of a given letter

7) Minimize recalculation (4,000x down to 5 minutes – 28 minutes in video)
· sort names and groups into segments
· common length and first character
· used WolframAlpha to help show formula

Learnings from Fuzzy String Match process

· Measure performance and focus on bottleneck
· Order of magnitude doesn’t always tell you about actual performance
· Favor simplicity

Approximate text matching – Wikipedia, the free encyclopedia

In computer science, approximate string matching (often colloquially referred to as fuzzy string searching) is the technique of finding strings that match a pattern …

kiyoka/fuzzy-text-match · GitHub

fuzzy string match library for ruby. Contribute to text-match development by creating an account on GitHub.

text-match | RubyGems.org | your community gem host

text-match 0.9.7. calculate Jaro Winkler distance. Versions: 0.9.7 – December 21, 2013 (13.5 KB); 0.9.6 – December 21, 2013 (13.5 KB); 0.9.5 – March 26, …

SPSS missing values & corresponding “missing” syntax to solve your needs

SPSS missing values

SPSS missing values

The syntax around SPSS variables with missing values is not intuitive, confusing, and poorly documented!

I know of 3 different types of commands and knowing which one to use when is not clear.  Setting SPSS missing values is a great way to simplify your analysis.  It is also a user-friendly way to remove (hide) outliers.  This video gives a short demo of how to use the three that I use frequently.

 

If you want to declare a value in a cell as missing the following syntax will give you a good start.

 if   Var1=1  Var1=$sysmis.
exe.

 

If you want to remove the values that are in a variable (define them as missing) the following syntax will be what you need.

 MISSING VALUES Var1 to Var10 (99).

 

SPSS missing values Macros

Below are two macros to help with missing data.  The first one is used when you first want to see if there is a given value present in another variable before declaring it recoding the missing a zero.  The second one will recode all variables with missing values a zero.

 
*///////////////.
DEFINE !Rep_Miss (Beg !TOKENS (1) /Prez !TOKENS (1) /End !TOKENS (1))
Do if !PREZ>0.
do repeat v=!BEG to !END.
if missing (v) v=0.
end repeat.
end if.
exe.
!ENDDEFINE.
*///////////////.

!Rep_Miss Prez=presentvariable Beg=v11 End=v19.

*///////////////.
DEFINE !Rep_Miss2 (Beg !TOKENS (1) /End !TOKENS (1))
do repeat v=!BEG to !END.
if missing (v) v=0.
end repeat.
exe.
!ENDDEFINE.
*///////////////.
/*!Rep_Miss2 Beg=v11 End=v19.