Yahoo crawls folders - 404s and 403s everywhere - (deprecated) Yahoo SE and Directory forum at WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

Yahoo crawls folders - 404s and 403s everywhere

Will this impact on our Yahoo standing?

pavlovapete

2:20 am on Jun 12, 2007 (gmt 0)

Hi all,

I've searched the forum but cannot seem to find any answer - please forgive me if the topic has been covered before.

Our logs are showing that Yahoo crawls folders eg if a link such as www.example.com/files/pdf/excellent_document.pdf exists on our site the Yahoo crawler requests www.example.com/files/ and www.example.com/files/pdf/ - which generate either 403 or 404 responses.

Will Yahoo think our site is lousy because it generates so many errors?

We're on IIS so a .htaccess solution is not possible. And I'm not sure how to use robots.txt to ban Yahoo from /files/ and /files/pdf/ and still let it through to excellent_document.pdf (which I feel is a shabby solution because I'll need to update robots.txt whenever we add files with new folder structures).

Any advice would be greatly appreciated.

Cheers and thanks.

martinibuster

4:46 pm on Jun 12, 2007 (gmt 0)

How many requests for phantom files is it making?

I think Yahoo sometimes makes a request for phantom files in order to check for an autogenerated response.

pavlovapete

11:02 pm on Jun 12, 2007 (gmt 0)

Thanks martinibuster,

we got about 60 403/404 requests yesterday. I'm in the process of uploading logs into a DB so I can do multi-day queries consequently I can't say for sure how long this has been going on.

Cheers

jdMorgan

11:37 pm on Jun 12, 2007 (gmt 0)

I've seen the same thing on Apache for several months -- Since there are no links to these "folder" URLs, and since I do not wish to provide a "table of contents"-style directory index listing, I just let them 403 and ignore them. I haven't seen any damage to listings or rankings that I can ascribe to this.

At some point, Yahoo! may realize that it is "very bad form" to crawl unlinked URLs... They ought to realize it, anyway.

Jim

Gene_B

2:18 am on Jun 13, 2007 (gmt 0)

I thought I was the only one with that problem. I have Apache Guardian sending me an email for every 40x error. The all are caused by yahoo slurp, most looking for invalid directories. example, I have example.org/folder1/page1.htm, slurp keeps looking for example.org/folder1/folder1/page1.htm. I issued a complaint and the reply was that sombody had a link to that url. WRONG! they are the only one. They said slurp obeys robots.txt, WRONG! I added
User-agent: *
Disallow: /folder1/folder1/
and slurp is still giving me errors.

I added
User-agent: Slurp
Disallow: /
to get rid of yahoo for a while.
Slurp is also using 10 times the bandwidth as Google!

Yahoo also blocks email from our shared mail server with:
X-YahooFilteredBulk: xx.xx.xx.xx but that is another nightmare story ..

pavlovapete

6:32 am on Jun 13, 2007 (gmt 0)

Thanks for your replies jdMorgan and Gene_B,

it makes me feel a *little* better that others are being subjected to these meaningless, bandwidth-sucking automated queries :)

I guess I'll head on over to the robots.txt forum and see if I can get some good oil on blocking Slurp.

I note though that GeneB says that Slurp is naughty so we'll see what happens.

Thanks and Cheers

blend27

8:30 pm on Jun 13, 2007 (gmt 0)

Last Night created a page(2 am)

www.site.tld/dir1/sub1/sub2/iblah.thml

there are no links to

site.tld/dir1/sub1/sub2/iblax.thml
site.tld/dir1/sub1/
www.site.tld/dir1/sub1/sub2/
www.site.tld/dir1/sub1/

with in few of hours it is trying to get in to thouse

WHY, WHY WHY?

Scarecrow

3:49 pm on Jun 21, 2007 (gmt 0)

Yahoo is doing the worst crawling I have ever seen. It uses multiple IP addresses. These are merely the bots on different Class C blocks, as of June 6:

dj101005.crawl.yahoo.net [74.6.131.201]
rz502640.inktomisearch.com [74.6.17.10]
rz502480.inktomisearch.com [74.6.18.10]
rz502320.inktomisearch.com [74.6.19.10]
lj512083.crawl.yahoo.net [74.6.20.103]
rz502240.inktomisearch.com [74.6.21.100]
lj511965.crawl.yahoo.net [74.6.22.15]
lj511880.crawl.yahoo.net [74.6.23.10]
lj511681.crawl.yahoo.net [74.6.24.101]
lj511560.crawl.yahoo.net [74.6.25.10]
lj511400.crawl.yahoo.net [74.6.26.10]
lj511202.crawl.yahoo.net [74.6.27.102]
lj511121.crawl.yahoo.net [74.6.28.101]
lj511000.crawl.yahoo.net [74.6.29.10]
lj612596.crawl.yahoo.net [74.6.65.202]
lj612273.crawl.yahoo.net [74.6.66.27]
lj612202.crawl.yahoo.net [74.6.67.100]
lj612055.crawl.yahoo.net [74.6.68.100]
lj611926.crawl.yahoo.net [74.6.69.10]
lj611681.crawl.yahoo.net [74.6.70.118]
lj611547.crawl.yahoo.net [74.6.71.118]
lj611375.crawl.yahoo.net [74.6.72.118]
lj611203.crawl.yahoo.net [74.6.73.120]
lj611068.crawl.yahoo.net [74.6.74.117]
lj612537.crawl.yahoo.net [74.6.75.10]
dj501008.crawl.yahoo.net [74.6.76.12]
ct501297.inktomisearch.com [74.6.85.134]
ct501168.inktomisearch.com [74.6.86.100]
ct501040.inktomisearch.com [74.6.87.10]

If you added bots within the same Class C, the list would be a lot larger. I don't think any of these bots talk to each other. I've seen the same IP address grab as many as three per second. Now if you have a couple dozen IP addresses working you over simultaneously, you can imagine what sort of load you start seeing. A snapshot of my Linux process table can show over 400 Yahoo fetches in the table at once.

The good thing is that there's almost zero traffic to my site that is coming in from the 74.6.*.* Class B block, apart from Yahoo. So I just stuck in a kernel block for the entire Class B range. That ought to make Yahoo get the message.

Four days after I installed this Class B block, I had occasion to reboot. In the three minutes it took me to reinstall my blocks following the reboot, Yahoo made 102 fetch requests. Let me put it differently: Yahoo's bots were hanging on my block for four days straight, and they still didn't get the message!

On another server, I've had about 1,000 files that I took down six weeks ago. Today I checked who's been asking for these 404 files. Believe it or not, 81 percent of all requests for these files are from the Yahoo bot. There aren't enough files on that server to get overloaded by Yahoo's rogue bots, fortunately.

P.S.: I don't like robots.txt much, so I don't use it.

giuliorapetti

9:31 pm on Jun 21, 2007 (gmt 0)

Well, if they are using crawlers so extensively in these days, it is surely due to something happening or happened since Thursday last (see the other post).

In fact, when we lost all our rankings (have had them for ages) our pages were still indexed but with no cache (I don't know what it means as I'm not a Yahoo! SEO expert, may be someone can help me to understand this).

Now, the cache seems returned for the urls I checked at random (don't know if it came back for all, they are tens of thousands and I was checking our urls at random).

A very "primitive" interpretation of what I have experienced (although still with no traffic from Y!) is that they "lost" our cache data, and they are quickly trying to rebuild their knowledge about our sites, by crawling all our urls again.

Indeed, they crawled all our urls. It's like, really, if they had kept the urls data in one place but not the html related to them (the cahce), they are now to rebuild their databases of cached pages... may be because they want to save them in a different way (I don't understand why: if you change your algo, just pick the same htmls and apply a different logic to them, there is no point in "trashing" our old, hem!, current html if you have changed the glasses with which you whatch to them).

I have no other explanation for this: again: never seen such a huge Y! spiders activity as in these days (after the "black Thursday" of last week)...