Fingers Are Fat
Fat finger is a real common problem. A lot of data that comes into analytics and reporting are often data entry originated. This means the data contains typos. And lots of them.
For example, people with strange long names will often show up in multiple data tables with different spelling because some data entry clerk somewhere just couldn’t get the name right.
Same or Kind of Same?
To solve this problem, there are algorithms to calculate how different two strings are. Levenshtein is one such algorithm. It calculates how much edit it takes from one string to get to another.
This works well on typos where you mistype a couple of letters, or even accidentally adding or deleting a couple of characters. Levenshtein correctly calculates how many edits, and you can apply a threshold and decide what is and isn’t a typo.
Unspeakable Typos
Levenshtein works well on regular words, but what about codes? Codes, such as part numbers or model numbers are frequently mis-typed because they are non-sensical to humans.
Considering how Levenshtein works, seems that it should work equally well on mis-typed codes. But what about the nature of mistakes themselves? People can mistype a letter or two in a word, but is it the same kind of mistakes people make when typing in alpha numeric codes? Do they make a couple of characters worth of mistakes, or do they make a whole lot in a field? Would Levenshtein sitll work in these cases?
VIN Problem
I recently ran into such a problem with Vehicle Identification Number (VIN). VIN is a unique number assigned to every car sold since 1954 in US. Today, it is used to identify any vehicle – well, almost any vehicle since cars sold in some countries do not have VIN’s. (surprisingly, Japan)
What I had was two sets of data containing VINs. They were supposedly the same dataset with same VINs, but many VIN’s did not match between the two.
The reason for the mismatch was because VINs were corrected. First table had original entries that contained mistakes. Second table had the mistakes corrected.
I needed data from both VIN was really the only way to uniquely identify the vehicle. What we had was a two table join, except that field used to join was edited on one table. Consequently, I ended up with a lot less records after the join. The edited records were dropped because VIN’s didn’t match exactly.
Levenshtein to the Rescue
So we figured if we calculated Levenshtein distance between two VINs and if they are lower than certain threshold, they are likely the same VIN.
We tested this out by taking some rows from first set, and blindingly calculated VIN Levenshtein against each and every VIN from another set.
Turns out, Levenshtein distribution was either >13 (entirely different) or <5 (pretty close). There was nothing in between. With this, I was able to match every single VIN to the other table.
Fat Fingering VIN
We then analyzed the fat-finger patterns of these VINs closely. As it turns out, people entering these codes are pretty careful. Common pattern was that they just missed one character, usually a number. People often mis-read 0, 8, 9 and 6 because apparently they looked similar. Worst distance was caused by a VIN that had 5 zeros and nines that were switched with nines and zeros instead.
The thing to note is that the nature of the mistakes and the data content was such that mistakes didn’t result in multiple-proximity. If you have a lot of similar codes that are close to each other, a mis-typed code can have similar Levenshtein distance to many of these codes. In this case, we would have to find another way or additional qualifier to match.
Still, Levenshtein or a similar algorithm should be evaluated for cleansing manually entered code fields like VINs.
Applying MapReduce
In this case, we calculated Levenshtein distances between every entry of large set against every entry of another large set, because we were concerned about multiple-proximity. Running such an nested loop comparison can be time consuming if the data sets are large.
For this study, we coded a quick MapReduce on Hadoop that calculates Levenshtein of an entry against every entry in the other set, to run this in an Embarrasingly Parallel manner.
This is basically doing a join where the join operator is (Levenshtein Value < threshold). As awkward as joins are in Hadoop MapReduce, joining on some arbitrary calculated value is quite powerful.
And performance was actually quite good. Even though we only had 10 Map tasks running in parallell on our petite Hadoop cluster, this was the same as executing 10 inner loops in parallel.