Forum Moderators: open
I don't even get more than a handful of hits from yahoo generally so I don't know what they are doing.
IP addresses used are 66.163.170.170 and ,165,172 etc.
which are part of yahoo.com when I do a nslookup.
The agent is
Yahoo-VerticalCrawler-FormerWebCrawler/3.9 crawler at trd dot overture dot com; [alltheweb.com...]
I really don't get any of this.
I get lots of crawler hits by "amateurish" local (greek) crawlers and they're all WELL BEHAVED, i.e. follow robots.txt, limit requests to e.g. 1 every 5 seconds, support Last-Modified headers to give http 304, follow 301/302 etc.
On the other hand, from the big boys, ONLY GOOGLEBOT is well behaved!
A complaint I have with Yahoo is with their http 404 error testing or whatever that is. Plus the handling of 301/302 redirects.
I have a site where I chose to name the files using numbers, or a letter and numbers, e.g. b1562.html. On that particular site, 15% of all Slurp requests are 404s as it keeps requesting non-existant files.
Inktomi was much worse, at one point it was responsible for 10% of the bandwidth of the site. Also, Inktomi/Slurp used to never create 304 on my sites, just 200. Recently I see several 304s so it's getting better.
I noticed that Ink/Slurp will try to view the directory listing, i.e. if I have a URL like
[mysite.tld...]
it'll send a request for
[mysite.tld...]
and I've explicitly allowed directory browsing for Slurp, to HELP it quickly determine which files had new timestamp. Downside: It followed the links from directory browsing to include some "orphan" files (ie not linked from anywhere)
MSNBOT is the most annoying sofar. I blocked it after it had generated 25.000 page hits on a 4.000 page site, in just 10 days. 99% of the pages had not changed during that time, yet it downloaded all of them with http 200 code.
Unbelievable!
Addition: I tend to think it's a "flag", when Slurp suspects it could be a against a big machine-generated junk-site or something.
I have another, much bigger site, where I also name files using numbers and a few letters, and my stats show 3210 page hits by Slurp and NONE of them is looking for non-existant files (unless following broken links ofcourse)
This is imo a leftover of Inktomi into Slurp, which exhibited this "bug" on that site and that site only. On other sites, it does well.
[edited by: dhatz at 11:44 am (utc) on June 23, 2004]