Forum Moderators: open
I've searched the forum but cannot seem to find any answer - please forgive me if the topic has been covered before.
Our logs are showing that Yahoo crawls folders eg if a link such as www.example.com/files/pdf/excellent_document.pdf exists on our site the Yahoo crawler requests www.example.com/files/ and www.example.com/files/pdf/ - which generate either 403 or 404 responses.
Will Yahoo think our site is lousy because it generates so many errors?
We're on IIS so a .htaccess solution is not possible. And I'm not sure how to use robots.txt to ban Yahoo from /files/ and /files/pdf/ and still let it through to excellent_document.pdf (which I feel is a shabby solution because I'll need to update robots.txt whenever we add files with new folder structures).
Any advice would be greatly appreciated.
Cheers and thanks.
At some point, Yahoo! may realize that it is "very bad form" to crawl unlinked URLs... They ought to realize it, anyway.
Jim
I added
User-agent: Slurp
Disallow: /
to get rid of yahoo for a while.
Slurp is also using 10 times the bandwidth as Google!
Yahoo also blocks email from our shared mail server with:
X-YahooFilteredBulk: xx.xx.xx.xx but that is another nightmare story ..
it makes me feel a *little* better that others are being subjected to these meaningless, bandwidth-sucking automated queries :)
I guess I'll head on over to the robots.txt forum and see if I can get some good oil on blocking Slurp.
I note though that GeneB says that Slurp is naughty so we'll see what happens.
Thanks and Cheers
dj101005.crawl.yahoo.net [74.6.131.201]
rz502640.inktomisearch.com [74.6.17.10]
rz502480.inktomisearch.com [74.6.18.10]
rz502320.inktomisearch.com [74.6.19.10]
lj512083.crawl.yahoo.net [74.6.20.103]
rz502240.inktomisearch.com [74.6.21.100]
lj511965.crawl.yahoo.net [74.6.22.15]
lj511880.crawl.yahoo.net [74.6.23.10]
lj511681.crawl.yahoo.net [74.6.24.101]
lj511560.crawl.yahoo.net [74.6.25.10]
lj511400.crawl.yahoo.net [74.6.26.10]
lj511202.crawl.yahoo.net [74.6.27.102]
lj511121.crawl.yahoo.net [74.6.28.101]
lj511000.crawl.yahoo.net [74.6.29.10]
lj612596.crawl.yahoo.net [74.6.65.202]
lj612273.crawl.yahoo.net [74.6.66.27]
lj612202.crawl.yahoo.net [74.6.67.100]
lj612055.crawl.yahoo.net [74.6.68.100]
lj611926.crawl.yahoo.net [74.6.69.10]
lj611681.crawl.yahoo.net [74.6.70.118]
lj611547.crawl.yahoo.net [74.6.71.118]
lj611375.crawl.yahoo.net [74.6.72.118]
lj611203.crawl.yahoo.net [74.6.73.120]
lj611068.crawl.yahoo.net [74.6.74.117]
lj612537.crawl.yahoo.net [74.6.75.10]
dj501008.crawl.yahoo.net [74.6.76.12]
ct501297.inktomisearch.com [74.6.85.134]
ct501168.inktomisearch.com [74.6.86.100]
ct501040.inktomisearch.com [74.6.87.10]
If you added bots within the same Class C, the list would be a lot larger. I don't think any of these bots talk to each other. I've seen the same IP address grab as many as three per second. Now if you have a couple dozen IP addresses working you over simultaneously, you can imagine what sort of load you start seeing. A snapshot of my Linux process table can show over 400 Yahoo fetches in the table at once.
The good thing is that there's almost zero traffic to my site that is coming in from the 74.6.*.* Class B block, apart from Yahoo. So I just stuck in a kernel block for the entire Class B range. That ought to make Yahoo get the message.
Four days after I installed this Class B block, I had occasion to reboot. In the three minutes it took me to reinstall my blocks following the reboot, Yahoo made 102 fetch requests. Let me put it differently: Yahoo's bots were hanging on my block for four days straight, and they still didn't get the message!
On another server, I've had about 1,000 files that I took down six weeks ago. Today I checked who's been asking for these 404 files. Believe it or not, 81 percent of all requests for these files are from the Yahoo bot. There aren't enough files on that server to get overloaded by Yahoo's rogue bots, fortunately.
P.S.: I don't like robots.txt much, so I don't use it.
In fact, when we lost all our rankings (have had them for ages) our pages were still indexed but with no cache (I don't know what it means as I'm not a Yahoo! SEO expert, may be someone can help me to understand this).
Now, the cache seems returned for the urls I checked at random (don't know if it came back for all, they are tens of thousands and I was checking our urls at random).
A very "primitive" interpretation of what I have experienced (although still with no traffic from Y!) is that they "lost" our cache data, and they are quickly trying to rebuild their knowledge about our sites, by crawling all our urls again.
Indeed, they crawled all our urls. It's like, really, if they had kept the urls data in one place but not the html related to them (the cahce), they are now to rebuild their databases of cached pages... may be because they want to save them in a different way (I don't understand why: if you change your algo, just pick the same htmls and apply a different logic to them, there is no point in "trashing" our old, hem!, current html if you have changed the glasses with which you whatch to them).
I have no other explanation for this: again: never seen such a huge Y! spiders activity as in these days (after the "black Thursday" of last week)...