Ask+Jeeves/Teoma has been visiting one of my sites all week, a site where all users disallowed as it is under construction. Seems to have focused on the specicially disallowed spider trap directory. As a result, our "Confidential Email Lists - For internal use only. Any public use is prohibited..." files containing many thousands of bogus email addresses are indexed in their database.
Google visits regularly and honors the robots.txt, as do the other major search engines. The file validates so there seems to be no logical reason for the indexing.
(My spider links page will be generating another 50 or 100 thousand bogus URL's to follow later today. Should keep some errant bot busy for a few extra minutes.)
Jeevesguy: Is there a current email address for reporting a problem with your robot? I sent a message via your site's "feedback" form. (Again.)
What are the odds of getting all references to a particular site removed from your index?
>has your site ever been a live site, not underconstruction and heavily interlinked?
Was briefly 'live' as a would-be shopping site and was linked, again briefly, through a only couple of shared files from several of my content sites. It's more "on hold" now than "under construction."
The spider trap directory was always disallowed and for some time now, the whole site is. (Ever since Ask Jeeves first spidered the trap directory last April.)
I'm not one to suspect Jeeves has any motive here. I have the same directory on my other sites and have never seen them fall into the trap on any of them. That's why I looked again at my robots.txt file.
I believe I know what might have caused the problem. (Me!) At the beginning of the file was: User-agent: * Disallow: / Disallow: /log/ Disallow: /catcher/
When I used the robots.txt validation tool I found that later in the file another set of disallow entries (different than above) were included. A cut and paste screw-up, I guess. The file validated but told me there were duplicates. In effect, I was rewriting the file halfway through and the bot seems to have restarted itself at the point it saw the second User-agent: *.
Checked all my other robots.txt files to make sure they're ok.
Thanks to JeevesGuy for contacting me via stickymail.