Welcome to WebmasterWorld Guest from 54.146.201.80

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

Yahoo now violating robots.txt

... and heading for a complete ban from our sites

     
3:17 am on Dec 28, 2007 (gmt 0)

Preferred Member

10+ Year Member

joined:Sept 21, 2005
posts: 379
votes: 0


Yahoo has been hammering all our sites over the last few days, but always only asking for the html pages, no supporting files - as it should if respecting our robots.txt.

But today I find this in the logs of one site (local file details obfuscated):

rz502454.crawl.yahoo.net - - [28/Dec/2007:11:22:49 +1000] "GET /productpage.htm HTTP/1.1" 200 2372 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; [help.yahoo.com...]
rz502454.crawl.yahoo.net - - [28/Dec/2007:11:22:49 +1000] "GET /style.css HTTP/1.1" 200 2724 "http://www.example.com/productpage.htm" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; [help.yahoo.com...]
rz502454.crawl.yahoo.net - - [28/Dec/2007:11:22:50 +1000] "GET /images/product-pic.jpg HTTP/1.1" 200 6517 "http://www.example.com/productpage.htm" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; [help.yahoo.com...]
rz502454.crawl.yahoo.net - - [28/Dec/2007:11:22:51 +1000] "GET /images/image1.jpg HTTP/1.1" 200 12077 "http://www.example.com/productpage.htm" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; [help.yahoo.com...]
rz502454.crawl.yahoo.net - - [28/Dec/2007:11:22:51 +1000] "GET /images/image2.gif HTTP/1.1" 200 45 "http://www.example.com/style.css" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; [help.yahoo.com...]
rz502454.crawl.yahoo.net - - [28/Dec/2007:11:22:51 +1000] "GET /images/image3.jpg HTTP/1.1" 200 6942 "http://www.example.com/style.css" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; [help.yahoo.com...]

I don't know what they think they are up to, but if they try that again they will be banned from all our sites. The traffic we get from Yahoo is negligible anyway.

Has anyone else seen this behaviour?

6:52 pm on Dec 28, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 22, 2001
posts:2450
votes: 0


but always only asking for the html pages, no supporting files - as it should if respecting our robots.txt.

What do you mean? You can't require them to ask for files via robots.txt.

9:00 pm on Dec 28, 2007 (gmt 0)

Junior Member

10+ Year Member

joined:June 25, 2005
posts:179
votes: 1


Has anyone else seen this behaviour?
Yes and no. Slurp/3.0 asks for non-HTML files (even imported CSS) with correct referrers, just like a Gecko browser. But I can't confirm a violation of robots.txt (nothing disallowed here).

Seems to be a quality check. Over and over again.

11:37 pm on Dec 28, 2007 (gmt 0)

Preferred Member

10+ Year Member

joined:Sept 21, 2005
posts: 379
votes: 0


volatilegx wrote:
What do you mean? You can't require them to ask for files via robots.txt.

Sorry, I see now that my sentence is somewhat ambiguous.

What I meant to say, is that normally Yahoo Slurp correctly and properly obeys our robots.txt, which disallows all images and style sheets.

On the one occasion shown in the logs posted above, Slurp requested all supporting files, which it should not have if respecting robots.txt - as Yahoo claims it does.

11:55 pm on Dec 28, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5408
votes: 2


Mokita,
Even the major blots, blow a gasket from time time.

During 2006 the Google-image bot began spidering all my images which are contained in folder, exempted in robots.txt.
After a day or two I added a denial (which I still have intact) and then I contacted google. They apologized and the spidering ceased immediately. At least at that time.
A few months later, it began again, the second time, I didn;t even contact them.

Don

2:38 am on Dec 29, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


This one appears to be intentional, fetching robots.txt and then switching user-agents (Dear Yahoo!, Switching user-agents in mid-stream is a sure way to hit my anti-abuse traps).

74.6.22.120 - - [28/Dec/2007:18:12:00 -0500] "GET /robots.txt HTTP/1.0" 200 3192 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]
74.6.22.120 - - [28/Dec/2007:18:12:00 -0500] "GET /widget.html HTTP/1.0" 403 666 "-" "Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.4) Gecko/20071214 BonEcho/2.0.0.4"
74.6.22.120 - - [28/Dec/2007:18:12:03 -0500] "GET /histyle.css HTTP/1.0" 403 666 "http://www.example.com/widget.html" "Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.4) Gecko/20071214 BonEcho/2.0.0.4"

If you want to know if I'm cloaking, I am. But it's to properly support mobile devices and keep some pages out of your index by forbidding silliness like the above. And I say so plainly with the "Vary: User-agent" response header on every page. :)

I'm already very annoyed with MSN's failure to recognize robots.txt records addressed to each of their various and sundry robots, but it looks like Yahoo! is about to join the "abusive and annoying robots" club... :(

Jim

2:58 am on Dec 29, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member marcia is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Sept 29, 2000
posts:12095
votes: 0


>>I'm already very annoyed with MSN's failure to recognize robots.txt records addressed to each of their various and sundry robots

Isn't it supposed to be for detecting and thwarting cloaking?

3:16 am on Dec 29, 2007 (gmt 0)

Preferred Member

10+ Year Member

joined:Sept 21, 2005
posts: 379
votes: 0


Marcia - I think Jim is referring to MSNbot failing to follow individual directions in robots.txt for its multiple bots - like msnbot-media, msnbot-news, msnbot-products etc. If you try to disallow those from crawling your site, it also has the effect of disallowing the generic msnbot and your site disappears from their index.

See this thread for more info:
[webmasterworld.com...]

Trying to restrict individual bots using robots.txt is not cloaking.

3:46 am on Dec 29, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member marcia is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Sept 29, 2000
posts:12095
votes: 0


I believe it was something I read that was written by Nathan Buggia on their blog, possibly at their in-house forum, too.

OK, I re-read the post by MSNdude. They are too! indexing pages with a robots noindex, nofollow. Also grabbing links off pages and indexing those LINKS instead of the page itself.

Is all that that confused, rude or accidental?

This is both MSN and Yahoo both, tons of them

[webmasterworld.com...]

[edited by: Marcia at 3:51 am (utc) on Dec. 29, 2007]

4:20 am on Dec 29, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:Apr 28, 2006
posts:1043
votes: 1


I have 2 sites, one about 2 years old, the other a month. Same exact robots.txt file on both sites. The older site, Slurp obeys. The new site, Slurp plowed right through my site hitting a bot trap that was disallowed in robots.txt, they got banned. Since it's Yahoo, I did unban, but just this once. If they do it again, out they stay!
2:53 pm on Dec 29, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5408
votes: 2


Has anybody used:

SetEnvIf User-Agent "Slurp/3.0;" keep_out
or

RewriteCond %{HTTP_USER_AGENT} "Slurp/3.0;" [NC]
RewriteRule .* - [F]

and seen any change in either their Yahoo page listings or crawls from the other Yahoo bots?

1:01 am on Dec 31, 2007 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:5805
votes: 64

1:14 am on Dec 31, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5408
votes: 2


keyplr,
I've added Slurp/3.0 to denies and will thus see how it effects my listings and other crawls.
3:20 am on Dec 31, 2007 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:5805
votes: 64


keyplr,
I've added Slurp/3.0 to denies and will thus see how it effects my listings and other crawls - wilderness

You're a better man than I Gunga Din.
1:16 am on Jan 4, 2008 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14622
votes: 88


You don't need to deny Slurp, silly boys.

Robots.txt is just the polite way of telling the bot what to do.

htaccess is how you enforce it if they decide to color outside the lines.

No need to completely ban, sheesh.

BTW, I suspect you'll see more bizarre behavior as everyone starts to add thumbnails and also attempts to stop SE cloaking so brace for impact.

4:03 am on Jan 4, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5408
votes: 2


You don't need to deny Slurp, silly boys.

Robots.txt is just the polite way of telling the bot what to do.

Are you suggesting that I may add

User-agent: Slurp/3.0
Disallow: /

to my robots.txt and Slurp/3.0 will comply and all the other Yahoo bots will continue?

Don

4:40 pm on Jan 12, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5408
votes: 2


keyplr,
I've added Slurp/3.0 to denies and will thus see how it effects my listings and other crawls.

Just a follow up on this.

This denial has NOT hindered the adding of some recent additions (new pages in the past ten days) of my websites to Yahoo listings.

Don

4:47 am on Feb 19, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5408
votes: 2


I slipped this in another topic:
[webmasterworld.com...]

even though that thread was regarding Java. (off topic).

It's technically off the subject line (Yahoo and robots.txt) here as well, however more focused.

Early Friday I contacted Yahoo regarding the appearance of a Yahoo bot from a non-traditional IP range.
An automated reply of recipt arrived instantly.

Twelve hours later a response apologizing for excess crawling arrived and pointing me to towards their basic crawl FAQ and suggesting "delay techniques".

I immediately responded that my intitial inqury had been misunderstood and "delay techniques" were not the reason on inquiry.
Rather, my attention was focused on the appearance of the new IP range and Yahoo's persistence to spider the same two pages, four times daily, for four consecutive days.
In addition, I provided my visitor logs Referrer links which drew Yahoo's attention to the aformentioned two pages and provided a supplemental explantion that a similar result would take place in a few days over recent focused referrals of similar searches.
Utilizing the Yahoo reference number provided.

Twelve hourse later, I receieved a 2d response (likely from their employee on Mars) suggesting that I failed to provide enough information (also addressing me as the the Yahoo employes from the 1st response) and providing a link to the Yahoo main site page, and indicating that my inquiry was a "Search or Directory" issue.

Furthermore, and within minutes of the 2d reply?
Yahoo began crawling many pages on my sites with a focused Class C of the non-traditional (new) Class A.

Anybody else seeing Yahoo crawls from 67.195.zz.zzz

2:45 am on Feb 21, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 22, 2001
posts:2450
votes: 0


Anybody else seeing Yahoo crawls from 67.195.zz.zzz

Yes, I can confirm that.

3:54 am on Feb 21, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5408
votes: 2


Many thanks Dan.

Now have the Yahoo bot hitting on a 3d Class C or the same A.

Don