Yahoo now violating robots.txt - Crawler, Spider, and User Agent ID forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

Yahoo now violating robots.txt

... and heading for a complete ban from our sites

Mokita

3:17 am on Dec 28, 2007 (gmt 0)

10+ Year Member

Yahoo has been hammering all our sites over the last few days, but always only asking for the html pages, no supporting files - as it should if respecting our robots.txt.

But today I find this in the logs of one site (local file details obfuscated):

rz502454.crawl.yahoo.net - - [28/Dec/2007:11:22:49 +1000] "GET /productpage.htm HTTP/1.1" 200 2372 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; [help.yahoo.com...]
rz502454.crawl.yahoo.net - - [28/Dec/2007:11:22:49 +1000] "GET /style.css HTTP/1.1" 200 2724 "http://www.example.com/productpage.htm" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; [help.yahoo.com...]
rz502454.crawl.yahoo.net - - [28/Dec/2007:11:22:50 +1000] "GET /images/product-pic.jpg HTTP/1.1" 200 6517 "http://www.example.com/productpage.htm" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; [help.yahoo.com...]
rz502454.crawl.yahoo.net - - [28/Dec/2007:11:22:51 +1000] "GET /images/image1.jpg HTTP/1.1" 200 12077 "http://www.example.com/productpage.htm" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; [help.yahoo.com...]
rz502454.crawl.yahoo.net - - [28/Dec/2007:11:22:51 +1000] "GET /images/image2.gif HTTP/1.1" 200 45 "http://www.example.com/style.css" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; [help.yahoo.com...]
rz502454.crawl.yahoo.net - - [28/Dec/2007:11:22:51 +1000] "GET /images/image3.jpg HTTP/1.1" 200 6942 "http://www.example.com/style.css" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; [help.yahoo.com...]

I don't know what they think they are up to, but if they try that again they will be banned from all our sites. The traffic we get from Yahoo is negligible anyway.

Has anyone else seen this behaviour?

volatilegx

6:52 pm on Dec 28, 2007 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

but always only asking for the html pages, no supporting files - as it should if respecting our robots.txt.

What do you mean? You can't require them to ask for files via robots.txt.

thetrasher

9:00 pm on Dec 28, 2007 (gmt 0)

10+ Year Member

Has anyone else seen this behaviour?

Yes and no. Slurp/3.0 asks for non-HTML files (even imported CSS) with correct referrers, just like a Gecko browser. But I can't confirm a violation of robots.txt (nothing disallowed here).

Seems to be a quality check. Over and over again.

Mokita

11:37 pm on Dec 28, 2007 (gmt 0)

10+ Year Member

volatilegx wrote:

What do you mean? You can't require them to ask for files via robots.txt.

Sorry, I see now that my sentence is somewhat ambiguous.

What I meant to say, is that normally Yahoo Slurp correctly and properly obeys our robots.txt, which disallows all images and style sheets.

On the one occasion shown in the logs posted above, Slurp requested all supporting files, which it should not have if respecting robots.txt - as Yahoo claims it does.

wilderness

11:55 pm on Dec 28, 2007 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Mokita,
Even the major blots, blow a gasket from time time.

During 2006 the Google-image bot began spidering all my images which are contained in folder, exempted in robots.txt.
After a day or two I added a denial (which I still have intact) and then I contacted google. They apologized and the spidering ceased immediately. At least at that time.
A few months later, it began again, the second time, I didn;t even contact them.

Don

jdMorgan

2:38 am on Dec 29, 2007 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

This one appears to be intentional, fetching robots.txt and then switching user-agents (Dear Yahoo!, Switching user-agents in mid-stream is a sure way to hit my anti-abuse traps).

74.6.22.120 - - [28/Dec/2007:18:12:00 -0500] "GET /robots.txt HTTP/1.0" 200 3192 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]
74.6.22.120 - - [28/Dec/2007:18:12:00 -0500] "GET /widget.html HTTP/1.0" 403 666 "-" "Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.4) Gecko/20071214 BonEcho/2.0.0.4"
74.6.22.120 - - [28/Dec/2007:18:12:03 -0500] "GET /histyle.css HTTP/1.0" 403 666 "http://www.example.com/widget.html" "Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.4) Gecko/20071214 BonEcho/2.0.0.4"

If you want to know if I'm cloaking, I am. But it's to properly support mobile devices and keep some pages out of your index by forbidding silliness like the above. And I say so plainly with the "Vary: User-agent" response header on every page. :)

I'm already very annoyed with MSN's failure to recognize robots.txt records addressed to each of their various and sundry robots, but it looks like Yahoo! is about to join the "abusive and annoying robots" club... :(

Jim

Marcia

2:58 am on Dec 29, 2007 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

>>I'm already very annoyed with MSN's failure to recognize robots.txt records addressed to each of their various and sundry robots

Isn't it supposed to be for detecting and thwarting cloaking?

Mokita

3:16 am on Dec 29, 2007 (gmt 0)

10+ Year Member

Marcia - I think Jim is referring to MSNbot failing to follow individual directions in robots.txt for its multiple bots - like msnbot-media, msnbot-news, msnbot-products etc. If you try to disallow those from crawling your site, it also has the effect of disallowing the generic msnbot and your site disappears from their index.

See this thread for more info:
[webmasterworld.com...]

Trying to restrict individual bots using robots.txt is not cloaking.

Marcia

3:46 am on Dec 29, 2007 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I believe it was something I read that was written by Nathan Buggia on their blog, possibly at their in-house forum, too.

OK, I re-read the post by MSNdude. They are too! indexing pages with a robots noindex, nofollow. Also grabbing links off pages and indexing those LINKS instead of the page itself.

Is all that that confused, rude or accidental?

This is both MSN and Yahoo both, tons of them

[webmasterworld.com...]

[edited by: Marcia at 3:51 am (utc) on Dec. 29, 2007]

youfoundjake

4:20 am on Dec 29, 2007 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I have 2 sites, one about 2 years old, the other a month. Same exact robots.txt file on both sites. The older site, Slurp obeys. The new site, Slurp plowed right through my site hitting a bot trap that was disallowed in robots.txt, they got banned. Since it's Yahoo, I did unban, but just this once. If they do it again, out they stay!

wilderness

2:53 pm on Dec 29, 2007 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Has anybody used:

SetEnvIf User-Agent "Slurp/3.0;" keep_out
or

RewriteCond %{HTTP_USER_AGENT} "Slurp/3.0;" [NC]
RewriteRule .* - [F]

and seen any change in either their Yahoo page listings or crawls from the other Yahoo bots?

keyplyr

1:01 am on Dec 31, 2007 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

[webmasterworld.com...]

wilderness

1:14 am on Dec 31, 2007 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

keyplr,
I've added Slurp/3.0 to denies and will thus see how it effects my listings and other crawls.

keyplyr

3:20 am on Dec 31, 2007 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

keyplr,
I've added Slurp/3.0 to denies and will thus see how it effects my listings and other crawls - wilderness

You're a better man than I Gunga Din.

incrediBILL

1:16 am on Jan 4, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

You don't need to deny Slurp, silly boys.

Robots.txt is just the polite way of telling the bot what to do.

htaccess is how you enforce it if they decide to color outside the lines.

No need to completely ban, sheesh.

BTW, I suspect you'll see more bizarre behavior as everyone starts to add thumbnails and also attempts to stop SE cloaking so brace for impact.

wilderness

4:03 am on Jan 4, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

You don't need to deny Slurp, silly boys.
Robots.txt is just the polite way of telling the bot what to do.

Are you suggesting that I may add

User-agent: Slurp/3.0
Disallow: /

to my robots.txt and Slurp/3.0 will comply and all the other Yahoo bots will continue?

Don

wilderness

4:40 pm on Jan 12, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

keyplr,
I've added Slurp/3.0 to denies and will thus see how it effects my listings and other crawls.

Just a follow up on this.

This denial has NOT hindered the adding of some recent additions (new pages in the past ten days) of my websites to Yahoo listings.

Don

wilderness

4:47 am on Feb 19, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I slipped this in another topic:
[webmasterworld.com...]

even though that thread was regarding Java. (off topic).

It's technically off the subject line (Yahoo and robots.txt) here as well, however more focused.

Early Friday I contacted Yahoo regarding the appearance of a Yahoo bot from a non-traditional IP range.
An automated reply of recipt arrived instantly.

Twelve hours later a response apologizing for excess crawling arrived and pointing me to towards their basic crawl FAQ and suggesting "delay techniques".

I immediately responded that my intitial inqury had been misunderstood and "delay techniques" were not the reason on inquiry.
Rather, my attention was focused on the appearance of the new IP range and Yahoo's persistence to spider the same two pages, four times daily, for four consecutive days.
In addition, I provided my visitor logs Referrer links which drew Yahoo's attention to the aformentioned two pages and provided a supplemental explantion that a similar result would take place in a few days over recent focused referrals of similar searches.
Utilizing the Yahoo reference number provided.

Twelve hourse later, I receieved a 2d response (likely from their employee on Mars) suggesting that I failed to provide enough information (also addressing me as the the Yahoo employes from the 1st response) and providing a link to the Yahoo main site page, and indicating that my inquiry was a "Search or Directory" issue.

Furthermore, and within minutes of the 2d reply?
Yahoo began crawling many pages on my sites with a focused Class C of the non-traditional (new) Class A.

Anybody else seeing Yahoo crawls from 67.195.zz.zzz

volatilegx

2:45 am on Feb 21, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Anybody else seeing Yahoo crawls from 67.195.zz.zzz

Yes, I can confirm that.

wilderness

3:54 am on Feb 21, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Many thanks Dan.

Now have the Yahoo bot hitting on a 3d Class C or the same A.

Don