Aggregation Problem
Getting a count of occurrence is common. A real world example would be Apache log. We often want a list of all IP addresses and their occurrences in this log. This would tell us how many accesses were made by this IP address.
This kind of aggregation is a pretty common thing. It keeps coming up often enough that we decided we should find the most efficient way to do this.
We took a 3GB httpd log with 12 million lines from a real production web server and processed it several ways.
RDBMS vs Hadoop
We first loaded the data into an RDBMS. Well, we tried to, anyways. This literally took forever despite loading 10000 rows per transaction. It wasn’t done after 90 minutes so we cancelled.
Then we tried it with Hadoop. We set up a 3 node Hadoop cluster and ran a simple Map/Reduce Java code. It took 2 minutes to copy 3GB file into HDFS, and another 3.5 minutes to Map Reduce to get the output we want.
To be fair, with smaller data set, RDBMS has been able to deliver this query result in seconds. But that’s only after the data is loaded and indexes are built. That’s a lot of work for ad-hoc query. End to end time is much quicker with Hadoop.
Then There’s Awk
With all that said and done, we don’t use RDBMS or Hadoop if we want simple column aggregation on a delimited text file less than 10GB. We use this one liner from a shell command line:
awk '{!x[$1]++}END{for(i in x) print x[i],i }' logfile | sort -n > ip.txt
This gives exactly the results we are looking for in 67 seconds flat.