Forum Moderators: open

Message Too Old, No Replies

yahoo crawlers hammering my site

         

jawjbww

9:35 am on Jun 22, 2004 (gmt 0)



My site is being crawled by about 20 yahoo spiders and they are all hitting the same pages and they are now using about 30% of my bandwidth. Can any one explain what is going on or have a similar experience.
(30,000 pages today).

I don't even get more than a handful of hits from yahoo generally so I don't know what they are doing.

IP addresses used are 66.163.170.170 and ,165,172 etc.
which are part of yahoo.com when I do a nslookup.

The agent is
Yahoo-VerticalCrawler-FormerWebCrawler/3.9 crawler at trd dot overture dot com; [alltheweb.com...]

Span

8:51 am on Jun 23, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I have not seen it hammering but it looks like it's a buggy crawler. I've seen it replacing dashes with slashes, asking for pages with a slash behind it (page.html/) or changing 9-digit filenames into 8 digits..

dhatz

9:38 am on Jun 23, 2004 (gmt 0)

10+ Year Member



"Buggy crawler"?

I really don't get any of this.

I get lots of crawler hits by "amateurish" local (greek) crawlers and they're all WELL BEHAVED, i.e. follow robots.txt, limit requests to e.g. 1 every 5 seconds, support Last-Modified headers to give http 304, follow 301/302 etc.

On the other hand, from the big boys, ONLY GOOGLEBOT is well behaved!

A complaint I have with Yahoo is with their http 404 error testing or whatever that is. Plus the handling of 301/302 redirects.

I have a site where I chose to name the files using numbers, or a letter and numbers, e.g. b1562.html. On that particular site, 15% of all Slurp requests are 404s as it keeps requesting non-existant files.

Inktomi was much worse, at one point it was responsible for 10% of the bandwidth of the site. Also, Inktomi/Slurp used to never create 304 on my sites, just 200. Recently I see several 304s so it's getting better.

I noticed that Ink/Slurp will try to view the directory listing, i.e. if I have a URL like

[mysite.tld...]

it'll send a request for

[mysite.tld...]

and I've explicitly allowed directory browsing for Slurp, to HELP it quickly determine which files had new timestamp. Downside: It followed the links from directory browsing to include some "orphan" files (ie not linked from anywhere)

MSNBOT is the most annoying sofar. I blocked it after it had generated 25.000 page hits on a 4.000 page site, in just 10 days. 99% of the pages had not changed during that time, yet it downloaded all of them with http 200 code.

Unbelievable!

dhatz

11:29 am on Jun 23, 2004 (gmt 0)

10+ Year Member



Just re-checked this, out of 1075 page hits by Slurp, 427 were requesting an invalid URL. In a site with less than 4000 files!

Addition: I tend to think it's a "flag", when Slurp suspects it could be a against a big machine-generated junk-site or something.

I have another, much bigger site, where I also name files using numbers and a few letters, and my stats show 3210 page hits by Slurp and NONE of them is looking for non-existant files (unless following broken links ofcourse)

This is imo a leftover of Inktomi into Slurp, which exhibited this "bug" on that site and that site only. On other sites, it does well.

[edited by: dhatz at 11:44 am (utc) on June 23, 2004]

trillianjedi

11:36 am on Jun 23, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



My site is being crawled by about 20 yahoo spiders and they are all hitting the same pages and they are now using about 30% of my bandwidth.

Spans post above may well be the answer, but also worth checking you're not serving yahoobot with any kind of Session ID.

TJ