Forum Moderators: open

Message Too Old, No Replies

Please Make Alta Vista Aware...

of it's spider abuse

         

msgraph

1:43 pm on Oct 16, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I've been playing E-mail tag with AV regarding their spider abuse. "Supposedly" they are investigating the issue but I'm not so sure how far they are going. I'm sure others have already contacted them and received similar mails to what I have received, but maybe a few more of you can do the same. If they receive more and more mails making them aware that their spider is not following the one_page_per_second_rule, that they claim they are following, then they will really look into it.

If you are receiving abuse by Alta Vista's robots, please contact them at:

Crawl Support <crawl-support@av.com>

...if no reply then send to : corporate@support.altavista.com

Make sure to really let them know what is happening on your server. The more details you give them to their activity, they more likely they will reply back to you.

- snip your log files (renaming the files of course
- let them know if they are not obeying the robots.txt file
- Give them your server IP (THAT IS IF YOU HAVE NOTHING TO HIDE) :)

If all goes well you should receive a reply like:

Thanks for contacting us about the machine at IP address xxx.xxx.xxx.xx.
We have forwarded your message to the crawl engineers, and have started to
process the investigation.
We would like to confirm with you if your web site is domainname.com.
Would you please help?

For not being crawled by AltaVista crawlers, you may set up robots.txt as -
User-agent: scooter # AltaVista web page search
Disallow: /

For the further robots.txt information, you may check the websites at -
[help.altavista.com...]
[info.webcrawler.com...]

In addition, here is some general information about our crawler and how it
should normally behave.
The crawler should not be consuming a large percentage of your server's
capacity. In the past, we have limited crawlers so that they receive only
one page per second, but most modern servers are capable of serving far more
than this, and some webmasters complained that not all of their pages were
being indexed. However, the crawler will still limit itself to one request
at a time, and it should wait for a request to finish before starting the
next one. For example, if your server can process a single request in a
tenth of a second, you may get as many as 10 requests per second from the
crawler, and this should be within the load that your server can routinely
bear.

This wastes our resources and yours, since a given page should appear
in the index only once. However, some URLs are different enough to confuse
the crawler and make it think the pages are unique. For example, the crawler
knows that URLs like this tend to be similar:
[my.site.com...]
[my.site.com...]
However, some websites have URLs that change without using a question mark,
such as
[my.site.com...]
[my.site.com...]
This may confuse the crawler and cause it to request the same script over
and over, using slightly different URLs each time. In this case, the best
approach is usually to prevent the crawler from accessing the script (or,
alternatively, from accessing the entire script directory).

Again, we have started to process the investigation.
Looking forward to your confirmation.
Thanks.

sincerely,
AltaVista Crawl Support

Ove

1:48 pm on Oct 16, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks for the info

/Ove

bufferzone

1:53 pm on Oct 16, 2001 (gmt 0)

10+ Year Member



What do you mean by abuse (is it more than not obeying the robots.txt). I see Scooter regularly and It behaves it self (Ithink)

msgraph

2:03 pm on Oct 16, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



1. They haven't been obeying my robots.txt and it is set up to the standards.

2 They have been grabbing 5+ static pages per second per domain on one server. Mix that with a high level of traffic and you have a server on it's knees.