Forum Moderators: open

Message Too Old, No Replies

Yahoo

         

wilderness

9:32 pm on Aug 4, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Their joking! Right?

216.252.116.nnn - - [04/Aug/2008:17:45:13 +0100] "GET /robots.txt HTTP/1.0" 403 998 "-" "Mozilla/5.0"
216.252.116.nnn - - [04/Aug/2008:17:45:13 +0100] "GET / HTTP/1.0" 403 1012 "-" "Mozilla/5.0"

[edited by: incrediBILL at 9:41 pm (utc) on Aug. 4, 2008]
[edit reason] Obscured IPs [/edit]

jdMorgan

10:17 pm on Aug 4, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



From what I can tell, that's not a standard crawler address range, so I will certainly treat it as a 'bad UA' for now...

"Yahoo! -- All kinds of spiders, all the time!"

Five times the page fetches of anyone else, unidentified BonEcho browser UA, and now this. They really need to get their crawling policies under control...

Jim

Samizdata

10:49 pm on Aug 4, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



All kinds of spiders, all the time!

I have never understood why they have to be so bad in this department.

I am not suggesting cause-and-effect, but the behaviour of Slurp and MSN recorded in my logs compares very poorly with the efficiency of Googlebot, in the same way the resulting SERPs do.

There are many reasons to criticise and be concerned about the Google behemoth, but in my experience they show a lot more respect to webmasters than any of their supposed competitors.

Yahoo's attitude (and that of others) seems contemptuous in comparison.

I find them very easy to dislike, despite their underdog status.

...

Lord Majestic

11:36 pm on Aug 4, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Generally speaking the first 403 response can be legitimately treated as lack of robots.txt, here is relevant "standard" text:

"Specific behaviors for other server responses are not required by this specification, though the following behaviours are recommended:
- On server response indicating access restrictions (HTTP Status Code 401 or 403) a robot should regard access to the site completely restricted."

Our (we are not Yahoo) bot though follows this recommendation.

It is really a good idea to always allow crawling of robots.txt - it may save unnecessary aggro.

Key_Master

11:43 pm on Aug 4, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Generally speaking the first 403 response can be legitimately treated as lack of robots.txt.....It is really a good idea to always allow crawling of robots.txt - it may save unnecessary aggro.

My thoughts exactly.

wilderness

11:58 pm on Aug 4, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Generally speaking the first 403 response can be legitimately treated as lack of robots.txt, here is relevant "standard" text:

Many thanks.
Back in March I moved many of the UA's over to Rewrites (which allows access to robots.txt), rather than SetEnvIf.

However for this vague UA "Mozilla/5.0" I'd never consider any access, even robots/txt

incrediBILL

12:35 am on Aug 5, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



They really need to get their crawling policies under control...

I have a simple server policy that solves the problem.

Slurp is allowed from crawl.yahoo.net, everything else from that range bounces.

Anything else from Yahoo is treated like any other IP address and goes through the rest of my tests and "Mozilla/5.0" doesn't pass the "is it a browser?" filter so it would also get kicked to the curb.

Lord Majestic

1:12 am on Aug 5, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



However for this vague UA "Mozilla/5.0" I'd never consider any access, even robots/txt

Usually whatever requests robots.txt is probably trying to support it, so given that robots.txt is a very small static file it seems good idea to allow it to anybody.

Small time bots would probably not support recommended course of action, so not allowing them to access robots.txt kind of puts you in the wrong, no need for that really.

This does not mean you should not give out 403 to that weird user-agent for any other request though. Perhaps you could auto-generate Disallow: / robots.txt for any weird user-agent so that they can't say that they tried to check it but your server did not allow them to get it.

Samizdata

1:56 am on Aug 5, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Perhaps you could auto-generate Disallow: / robots.txt for any weird user-agent

I would suggest a cloaked (restrictive) version of robots.txt for anything not white-listed.

I know it is all "voluntary", but I have never understood what Yahoo hope to gain by indexing content that has been specifically disallowed - mistakes are inevitable but they are regular offenders, and they also list and cache content marked with "noindex" meta tags, and do so quite deliberately.

It doesn't benefit their users (or their bank balance) and it only makes me despise them.

Businesses usually cultivate good relations with their suppliers.

Google (and Lord Majestic) play nice, and it costs nothing.

...

wilderness

2:13 am on Aug 5, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



so not allowing them to access robots.txt kind of puts you in the wrong, no need for that really.

Except!
That I chose to simply be pig-headed and inflexible to Yahoo's lack of UA conformity!

What came first, the chicken or the egg ;)

I would suggest a cloaked (restrictive) version of robots.txt for anything not white-listed.

As webmasters, the variety of requests to our pages have grown so diverese that we as a group are required custom configuration (s) of If, when, how, what-color, etc.

Personally I'm simply not interested in maintaining another in-depth list of who or what requires a custom robots.txt, with indifference to all other visitors.

Samizdata

2:40 am on Aug 5, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I know the feeling Don, but white-lists tend to be very short.

RewriteCond %{HTTP_USER_AGENT} !(Google¦Yahoo¦msnbot¦Teoma¦whoever) [NC]
RewriteRule ^robots\.txt$ /denied.txt [L]

Contents of denied.txt:

User-agent: *
Disallow: /

I bet your white-list would be shorter than mine, anyway.

...

Samizdata

3:16 am on Aug 5, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I should add this for any readers who are unfamiliar:

# List of
# unwanted
# nuisance
# conditions
# goes here
RewriteCond %{REQUEST_URI} !^/(robots¦denied)
RewriteRule .* - [F]

Both versions of the file must be excluded.

...

Staffa

6:48 am on Aug 5, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Google (and Lord Majestic) play nice, and it costs nothing.

Just in the last half hour I caught Google reading robots.txt and accessing files in a disallowed directory.
Banned the IP number to let it cool a while, then it can come back.

incrediBILL

7:09 am on Aug 5, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Usually whatever requests robots.txt is probably trying to support it

That would be incorrect.

Many scrapers look at robots.txt to avoid spider traps which are disallowed for SEs.

Samizdata

12:03 pm on Aug 5, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Just in the last half hour I caught Google

I have never known Googlebot to be disobedient myself, but vigilance is always required.

For balance, here is an example of their willingness to be as bad as the others:

[mattcutts.com...]

...

Lord Majestic

12:32 pm on Aug 5, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Just in the last half hour I caught Google reading robots.txt and accessing files in a disallowed directory.

I don't speak for Google but I think their implementation of robots.txt is kind of decoupled from crawling in a way that robots.txt gets fetched by some other process than crawler itself and then this robots.txt is applied to outstanding urls to clean them up, so there is a time lag between fetching robots.txt and actual crawling which can lead to the situation that you described above.

Our implementation is different in this respect in that for each batch of urls (400 or less) each crawler would check robots.txt prior to fetching them (unless that robots.txt for that batch was cached 24 hours ago - usually cache is not used), this way our bot does not have to wait for robots.txt changes to feed through the system, though recently we started additional cleaning up of urls for fairly large sites - this seems to work well both for us and also for the sites that disallowed bots.

Many scrapers look at robots.txt to avoid spider traps which are disallowed for SEs.

But surely those bots can't read well and would also avoid crawling all other urls disallowed by robots.txt thus fulfilling the purpose of robots.txt? I suppose I am just not an evil person to appreciate what bad can be done using robots.txt :)