So a week or so ago somebody at AOL made a big old snafu and released the search logs for over 20 million search queries from like 600,000 AOL users who were searching from March to June of this year.
This has, obviously, sparked some privacy concerns, etc, etc. But for me, it was just something else to tinker with. The very night I heard about this I downloaded said data dump and started loading it into a mySQL database I have running on a testbox linux machine that I use for all manner of ‘tinkering with toys’ type stuff when I find something I want to play with.
This by itself took a little while, as the search data for over half a million users over a couple months adds up pretty quick. The download was right around 500megs, but after uncompressing everything I found myself staring at a buttload of data to import into mySQL…
Then I realized that in order to do some of the things I had in mind, I might need to siphon off some data into other tables. For instance, I’m not that interested in what any particular person was searching for, but I’m extremely interested in little things like WHAT they searched for, and what websites they ultimately ended up at.
So I created a keywords table, simple, just had a list of all the actual searches (only listed in this table once, regardless of how many times it was actually searched), and a counter that showed how many times said search was performed. (Yes, I could have gleaned the same type of data using SQL queries on the original table, but do you have any idea how long it takes to scan an entire table of 36MILLION searches to do that kind of simple add up on the fly? By taking the time to do ALL the math at once, I saved myself much more time down the road.
While one quickly hacked php script was doing the build of my custom keyword table, I threw together another table, same concept, but for URLs that were clicked on, how often they were clicked on, and whereabouts they were in the search results when they were clicked on. A 2nd php hack was run to do the math and compute all those up for me.
So while all this was running (a process that took the better part of 20 hours on the itty bitty little pentium-733 box with a single slow IDE drive and 512megs of RAM that serves as my ‘testy box’) I was telling the wife about it, and I guess my “woowho” factor was showing throught, because she looked at me a minute and said:
“Okay, so, um, exactly what are you going to do with all this stuff when it’s done whatever it is it’s doing?”
And that’s when I realized… I really hadn’t planned on doing ANYTHING with it as of yet. Sure, I could do like everyone else and their brother and throw a php interface around it and slap it up on the web for anyone who wants to go digging through it.. but that’s been done a dozen times already, but people with much faster hardware to spare then I have at my disposal, so why bother?
So anyway, I’m not entirely sure *what* I’m going to do with the data.. (I have had a couple of small ideas, but they’re just research for other projects that Im working on).. but I can atleast rest easy, because once and for all, I have confirmed that the average internet user is as big of an idiot as I thought they were… I present to you, the top 12 searches from the AOL search database:
mysql> select keyword,cnt from keywords order by cnt desc limit 12; +----------------------+--------+ | keyword | cnt | +----------------------+--------+ | google | 332192 | | ebay | 139207 | | yahoo | 130538 | | yahoo.com | 97518 | | mapquest | 88279 | | google.com | 79991 | | myspace.com | 77211 | | myspace | 74365 | | www.yahoo.com | 43038 | | www.google.com | 42597 | | internet | 39622 | | http | 30125 | +----------------------+--------+
Why does this prove to me everyone on the web is an idiot? Because you don’t have to search for “google” to get to google people, you just type GOOGLE into the little address bar at the top of your screen.. the same goes for everything almost else on this list.
A search engine is for when you don’t know the name of a site you’re looking for.. or even on what site you may find what your looking for…. when your looking “h0t animal butt-secks” you put that into a search.. if you’re trying to bring up “buttsecks.com”, you just put that in the address bar and be done with it… ~shrug~
and dont even get me started on the number of people searching for “http” or “internet”…. wtf.