Levenshtein on VIN

Fingers Are Fat

Fat finger is a real common problem. A lot of data that comes into analytics and reporting are often data entry originated. This means the data contains typos. And lots of them.

For example, people with strange long names will often show up in multiple data tables with different spelling because some data entry clerk somewhere just couldn’t get the name right.

Same or Kind of Same?

To solve this problem, there are algorithms to calculate how different two strings are. Levenshtein is one such algorithm. It calculates how much edit it takes from one string to get to another.

This works well on typos where you mistype a couple of letters, or even accidentally adding or deleting a couple of characters. Levenshtein correctly calculates how many edits, and you can apply a threshold and decide what is and isn’t a typo.

Unspeakable Typos

Levenshtein works well on regular words, but what about codes? Codes, such as part numbers or model numbers are frequently mis-typed because they are non-sensical to humans.

Considering how Levenshtein works, seems that it should work equally well on mis-typed codes. But what about the nature of mistakes themselves? People can mistype a letter or two in a word, but is it the same kind of mistakes people make when typing in alpha numeric codes? Do they make a couple of characters worth of mistakes, or do they make a whole lot in a field? Would Levenshtein sitll work in these cases?

VIN Problem

I recently ran into such a problem with Vehicle Identification Number (VIN). VIN is a unique number assigned to every car sold since 1954 in US. Today, it is used to identify any vehicle – well, almost any vehicle since cars sold in some countries do not have VIN’s. (surprisingly, Japan)

What I had was two sets of data containing VINs. They were supposedly the same dataset with same VINs, but many VIN’s did not match between the two.

The reason for the mismatch was because VINs were corrected. First table had original entries that contained mistakes. Second table had the mistakes corrected.

I needed data from both VIN was really the only way to uniquely identify the vehicle. What we had was a two table join, except that field used to join was edited on one table. Consequently, I ended up with a lot less records after the join. The edited records were dropped because VIN’s didn’t match exactly.

Levenshtein to the Rescue

So we figured if we calculated Levenshtein distance between two VINs and if they are lower than certain threshold, they are likely the same VIN.

We tested this out by taking some rows from first set, and blindingly calculated VIN Levenshtein against each and every VIN from another set.

Turns out, Levenshtein distribution was either >13 (entirely different) or <5 (pretty close). There was nothing in between. With this, I was able to match every single VIN to the other table.

Fat Fingering VIN

We then analyzed the fat-finger patterns of these VINs closely. As it turns out, people entering these codes are pretty careful. Common pattern was that they just missed one character, usually a number. People often mis-read 0, 8, 9 and 6 because apparently they looked similar. Worst distance was caused by a VIN that had 5 zeros and nines that were switched with nines and zeros instead.

The thing to note is that the nature of the mistakes and the data content was such that mistakes didn’t result in multiple-proximity. If you have a lot of similar codes that are close to each other, a mis-typed code can have similar Levenshtein distance to many of these codes. In this case, we would have to find another way or additional qualifier to match.

Still, Levenshtein or a similar algorithm should be evaluated for cleansing manually entered code fields like VINs.

Applying MapReduce

In this case, we calculated Levenshtein distances between every entry of large set against every entry of another large set, because we were concerned about multiple-proximity. Running such an nested loop comparison can be time consuming if the data sets are large.

For this study, we coded a quick MapReduce on Hadoop that calculates Levenshtein of an entry against every entry in the other set, to run this in an Embarrasingly Parallel manner.

This is basically doing a join where the join operator is (Levenshtein Value < threshold). As awkward as joins are in Hadoop MapReduce, joining on some arbitrary calculated value is quite powerful.

And performance was actually quite good. Even though we only had 10 Map tasks running in parallell on our petite Hadoop cluster, this was the same as executing 10 inner loops in parallel.

Reverse a string

Reversing a string is a pretty common interview question.

I can’t recall ever having to reverse a single damn string in my programming career, so it’s probably safe to say string reversal is very specific to job interviews.

That may make it seem useless, but it is a realistic use case that any upwardly mobile programmer like yourselves will eventually run into.

So let’s think about this a little bit.

How would we go about figuring out how to reverse strings in preparation for an interview? Well, it goes something like this for most people:

In C – Google for an answer
In C++ – Google for an answer
In Java – Google for an answer

Seriously, this really is what people do, and here is what Google Trends shows to back it up.

Now before coding purists denounce this whole string reversal exercise as stupid and waste of time, let’s dig this down a little further.

The string reversal question takes on a different context given that it’s an interview question. I mean, everyone comes to the interview after looking up the same damn answer. So why ask this question?

I for one, don’t care if a candidate looked up the answer on the Internet. Actually, I don’t care if you know how to reverse a string at all.

If I ask the string reversal question, that’s because I am looking for traits. Is this person a copypaste monkey who just looks things up and try them all till something works? Is there any logical thinking, adaptation or creativity involved?

After all, the whole purpose is to see if the person is capable of turning knowledge into skill and expertise that we, the team, can turn to.

So if I ever asked a string reversal question, it’s not because I want to know if you know how. It’s really because I want to know what you are made of.

Now try googling that.

Cheers.

One liner aggregation

Aggregation Problem

Getting a count of occurrence is common. A real world example would be Apache log. We often want a list of all IP addresses and their occurrences in this log. This would tell us how many accesses were made by this IP address.

This kind of aggregation is a pretty common thing. It keeps coming up often enough that we decided we should find the most efficient way to do this.

We took a 3GB httpd log with 12 million lines from a real production web server and processed it several ways.

RDBMS vs Hadoop

We first loaded the data into an RDBMS. Well, we tried to, anyways. This literally took forever despite loading 10000 rows per transaction. It wasn’t done after 90 minutes so we cancelled.

Then we tried it with Hadoop. We set up a 3 node Hadoop cluster and ran a simple Map/Reduce Java code. It took 2 minutes to copy 3GB file into HDFS, and another 3.5 minutes to Map Reduce to get the output we want.

To be fair, with smaller data set, RDBMS has been able to deliver this query result in seconds. But that’s only after the data is loaded and indexes are built. That’s a lot of work for ad-hoc query. End to end time is much quicker with Hadoop.

Then There’s Awk

With all that said and done, we don’t use RDBMS or Hadoop if we want simple column aggregation on a delimited text file less than 10GB. We use this one liner from a shell command line:

awk '{!x[$1]++}END{for(i in x) print x[i],i }' logfile | sort -n > ip.txt

This gives exactly the results we are looking for in 67 seconds flat.