Alavista's spider Scooter = SPAM!?

Forum Moderators: open

Message Too Old, No Replies

Alavista's spider Scooter = SPAM!?

Spider accessing site generates 100s of 404 reports daily!..

Joao

12:53 am on Jan 19, 2001 (gmt 0)

I have a website that has in place a CGI script that catches and informs me about 404 errors. I am received at this point HUNDREDS of messages a day, ALL triggered by Alavista's SCOOTER.
I don't know what business Altavista has indexing my site, because I NEVER submited it to ALtavista, nor do I want to have it indexed by Altavista or any other search engine; I don't know how to solve this problem, or if it is solvable on Altavist'a side, but I am certainly not thrilled with this experience, that amounts to something worse, FAR WORSE!, than spamming!...
How can I stop this!
Any suggestions? Help would be much appreciated.
Thank you.
Joao

BoneHeadicus

1:12 am on Jan 19, 2001 (gmt 0)

Welcome to WmW

Are you referring to the Apache Guardian script and lots of robots.txt 404's?

tedres

1:49 am on Jan 19, 2001 (gmt 0)

Hi Joao,

Check out:
[doc.altavista.com...]

The page explains how to prevent their robot from visiting your site by using robots.txt and/or robots meta tags. Hope this helps.

Joao

2:16 am on Jan 19, 2001 (gmt 0)

THANKS a million! Much appreciated. I have a txt file there, but probably not using it right. Will check, also, the metatags solution. One extra question: Are there spiders that unethically ignore these restrictions and go crawling and fllowing links anyway?!...

And there is another problem:

The original page is not there anymore, so Scooter looks for it, maybe to update links/indexes (?) and since it is not found, my cgi script returns a 404-someone-was-looking-for-a-page-that-doesn't-exist-any-longer.

Should the txt exclusion file prevent Scooter (and other spiders/robots) from even start searching for the page that was once there?!

Again, thanks!

Joao

mivox

2:39 am on Jan 19, 2001 (gmt 0)

If you don't want any robots indexing ANYTHING on your site at all, just use the robots.txt file to disallow all robots from your root directory completely.

save this file as plain text named "robots.txt" with this content:

User-agent: *
Disallow: /

all robots that follow the robots.txt protocol should leave your site entirely alone.

But yes, there are robots that ignore proper manners and never even look at the robots.txt file. In those cases, looking up where the robot is sent from (can take a bit of detective work, tracking IP numbers & whatnot) and mailing their administrators (or their upstream providers' adminitrators) to complain about the behavior has worked for me in the past.