Back

Inktomi Harvests Private Web Data

by Ed Sawicki - President
Accelerated Learning Center
Tailored Computers

Inktomi, the Internet search engine company, appears to be ignoring instructions in web server robots.txt files and harvesting private data from web sites. A robots.txt file placed in the top level directory of a web site is supposed to allow us to control what robots are allowed to do when scanning our web sites. We can instruct a robot NOT to index the content in certain directories. This is useful if you want to place content on your site that you want to make available to employees and customers and you don't want the whole world to know about it.

In my case, I've created a substantial amount of information on my web site that is used for research. It is intended for internal use and, occassionally, for customer use. I don't want to make this information available to the general public for numerous reasons. There is no link on our home page to the information. You must know the directory exists.

To be certain that no robot would index the information, I added the directory to my robots.txt file with the following statement.

Disallow: /standards

Up until now, robots have respected this. When I checked my web server access log recently, I noticed that Inktomi's robot indexed the entire /standards tree. Now, a lot of activity on my site is hitting this directory.

Now what? The first thing I did was to remove all of the Disallow statements from my robots.txt file. It makes no sense to tell robots about the location of information you don't want them to know about.

Next, I disabled access to the /standards directory to everyone but those on our internal network. Fortunately, the Apache web server makes this easy. I just added a statement to the Apache configuration file giving access to our local subnet and denying access to all other addresses. This took less than one minute (no GUIs to slow things down). I can add customer addresses easily should they need access.

Then it occurred to me that Inktomi's sin could work to my benefit. Most people that run web sites want more traffic. More traffic may mean more customers. Why not modify things so that when people come to my site from a link on the Inktomi site, I redirect them to my home page rather than denying them access?

This took about five minutes to engineer because I had to consult the Apache documentation. People going to my /standards directory are now brought to my home page. The stuff that was in /standards is now elsewhere.

The Inktomi robots are not following the standards - they're not playing fair. Perhaps Inktomi needs to be taught a lesson. We could all add Disallow statements to our robots.txt files that point to directories that contain provocative content. Then after the Inktomi robot indexes the directories, we can redirect the directory elsewhere. Eventually, people using the Inktomi search engine would tire of the "broken" links and use another engine.