Monday, September 1, 2008

Needle In A Haystack

For a long time, I kept all my important documents in a single box. That would be stuff like my passport, car title, insurance information, and so on. After my wife and I bought a home, we had enough papers to file that our simple box wasn't sufficient. We bought large filing cabinet and it's served our needs ever since.

It's not that we needed to look up stuff any more frequently than before. But when we went from 100 pieces of paper to 1000, we didn't want it to take 10 times longer to find anything. To find stuff in the box, I riffled through it until I found the paper I needed, sometimes searching the entire thing. More stuff means the search takes longer.

A filing cabinet solves this problem by providing a natural structure for organizing the papers. For example, all the papers on the home can go in one file folder, all the car information in another, and all the identification documents like passports in a third. Then the lookup process becomes simpler, as I can ignore everything that's not part of the folder that keeps whatever I'm looking for.

This is a real savings in time. Imagine you have 100 different pieces of paper. You can either stuff them all in one box or you can organize them into 10 folders, each containing 10 papers. What's the average number of papers you'll have to look at to find any given piece, assuming you always know what folder should contain that piece of paper?

When looking through the box, you could look at anywhere between 1 and 100 pieces with equal probability. So you'll average (1 + 100) / 2 = 50.5 pieces before you find it. When looking through the folders, you have to find the correct folder from 10 options and then the correct paper of the 10 pieces in that folder. It will take between 1 and 10 tries to find the folder, or 5.5 on average. And similarly it takes on average 5.5 tries to find the paper in the folder. That's 11 tries total for the folder compared to 50.5 for the box. The filing system can be searched roughly five times faster!

Or is it really five times faster? What happens if you have 1000 pieces of paper instead of 100? When you have this much information, you could create folders and subfolders to further organize the papers. If you have 10 main folders, each of which contain 10 subfolders, and each of those contain 10 papers, then it will take 5.5 + 5.5 + 5.5 = 16.5 tries to find any one of the 1000 papers. It would take 500.5 tries if they were in a box. So in this case the filing system is about 30 times faster.

The more data there is to store, the more efficient the filing system is compared to the box. That's because the filing system is a whole order of magnitude more efficient. In mathematical terms, searching through the box takes linear time. When you add ten times as much stuff, it takes ten times as long. But searching the filing system takes logarithmic time. Adding ten times as much stuff just means it takes one more search step. Very roughly speaking, searching is proportionate to the number of zeros on the size of the data. Searching 100 things takes twice as long as 10 things, and searching 1000 things takes three times long. Want to search 1,000,000 things? That's six times longer than searching 10 things.

You might be wondering what all of this has to do with artificial intelligence. At the heart of almost all AI is data, and data doesn't mean anything if you can't find it when you need it. That means efficient data storage is the required foundation for artificial intelligence. An algorithm is only as good as the data it has access to, and good AI needs a lot of data. If all the data were just shoved in a box, there's no way the computer could run the AI's thought process fast enough. Indeed in almost all computer science, the way in which data is organized is just as important as the actual data being stored. After all, what good is information if you can't find it? Next week I'll give a concrete example of how BrainWorks puts this concept to use.

No comments: