homepage Welcome to WebmasterWorld Guest from 50.19.169.37
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Yahoo now violating robots.txt
... and heading for a complete ban from our sites
Mokita




msg:3536189
 3:17 am on Dec 28, 2007 (gmt 0)

Yahoo has been hammering all our sites over the last few days, but always only asking for the html pages, no supporting files - as it should if respecting our robots.txt.

But today I find this in the logs of one site (local file details obfuscated):

rz502454.crawl.yahoo.net - - [28/Dec/2007:11:22:49 +1000] "GET /productpage.htm HTTP/1.1" 200 2372 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; [help.yahoo.com...]
rz502454.crawl.yahoo.net - - [28/Dec/2007:11:22:49 +1000] "GET /style.css HTTP/1.1" 200 2724 "http://www.example.com/productpage.htm" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; [help.yahoo.com...]
rz502454.crawl.yahoo.net - - [28/Dec/2007:11:22:50 +1000] "GET /images/product-pic.jpg HTTP/1.1" 200 6517 "http://www.example.com/productpage.htm" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; [help.yahoo.com...]
rz502454.crawl.yahoo.net - - [28/Dec/2007:11:22:51 +1000] "GET /images/image1.jpg HTTP/1.1" 200 12077 "http://www.example.com/productpage.htm" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; [help.yahoo.com...]
rz502454.crawl.yahoo.net - - [28/Dec/2007:11:22:51 +1000] "GET /images/image2.gif HTTP/1.1" 200 45 "http://www.example.com/style.css" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; [help.yahoo.com...]
rz502454.crawl.yahoo.net - - [28/Dec/2007:11:22:51 +1000] "GET /images/image3.jpg HTTP/1.1" 200 6942 "http://www.example.com/style.css" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; [help.yahoo.com...]

I don't know what they think they are up to, but if they try that again they will be banned from all our sites. The traffic we get from Yahoo is negligible anyway.

Has anyone else seen this behaviour?

 

volatilegx




msg:3536542
 6:52 pm on Dec 28, 2007 (gmt 0)

but always only asking for the html pages, no supporting files - as it should if respecting our robots.txt.

What do you mean? You can't require them to ask for files via robots.txt.

thetrasher




msg:3536618
 9:00 pm on Dec 28, 2007 (gmt 0)

Has anyone else seen this behaviour?
Yes and no. Slurp/3.0 asks for non-HTML files (even imported CSS) with correct referrers, just like a Gecko browser. But I can't confirm a violation of robots.txt (nothing disallowed here).

Seems to be a quality check. Over and over again.

Mokita




msg:3536741
 11:37 pm on Dec 28, 2007 (gmt 0)

volatilegx wrote:
What do you mean? You can't require them to ask for files via robots.txt.

Sorry, I see now that my sentence is somewhat ambiguous.

What I meant to say, is that normally Yahoo Slurp correctly and properly obeys our robots.txt, which disallows all images and style sheets.

On the one occasion shown in the logs posted above, Slurp requested all supporting files, which it should not have if respecting robots.txt - as Yahoo claims it does.

wilderness




msg:3536751
 11:55 pm on Dec 28, 2007 (gmt 0)

Mokita,
Even the major blots, blow a gasket from time time.

During 2006 the Google-image bot began spidering all my images which are contained in folder, exempted in robots.txt.
After a day or two I added a denial (which I still have intact) and then I contacted google. They apologized and the spidering ceased immediately. At least at that time.
A few months later, it began again, the second time, I didn;t even contact them.

Don

jdMorgan




msg:3536817
 2:38 am on Dec 29, 2007 (gmt 0)

This one appears to be intentional, fetching robots.txt and then switching user-agents (Dear Yahoo!, Switching user-agents in mid-stream is a sure way to hit my anti-abuse traps).

74.6.22.120 - - [28/Dec/2007:18:12:00 -0500] "GET /robots.txt HTTP/1.0" 200 3192 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]
74.6.22.120 - - [28/Dec/2007:18:12:00 -0500] "GET /widget.html HTTP/1.0" 403 666 "-" "Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.4) Gecko/20071214 BonEcho/2.0.0.4"
74.6.22.120 - - [28/Dec/2007:18:12:03 -0500] "GET /histyle.css HTTP/1.0" 403 666 "http://www.example.com/widget.html" "Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.4) Gecko/20071214 BonEcho/2.0.0.4"

If you want to know if I'm cloaking, I am. But it's to properly support mobile devices and keep some pages out of your index by forbidding silliness like the above. And I say so plainly with the "Vary: User-agent" response header on every page. :)

I'm already very annoyed with MSN's failure to recognize robots.txt records addressed to each of their various and sundry robots, but it looks like Yahoo! is about to join the "abusive and annoying robots" club... :(

Jim

Marcia




msg:3536824
 2:58 am on Dec 29, 2007 (gmt 0)

>>I'm already very annoyed with MSN's failure to recognize robots.txt records addressed to each of their various and sundry robots

Isn't it supposed to be for detecting and thwarting cloaking?

Mokita




msg:3536827
 3:16 am on Dec 29, 2007 (gmt 0)

Marcia - I think Jim is referring to MSNbot failing to follow individual directions in robots.txt for its multiple bots - like msnbot-media, msnbot-news, msnbot-products etc. If you try to disallow those from crawling your site, it also has the effect of disallowing the generic msnbot and your site disappears from their index.

See this thread for more info:
[webmasterworld.com...]

Trying to restrict individual bots using robots.txt is not cloaking.

Marcia




msg:3536838
 3:46 am on Dec 29, 2007 (gmt 0)

I believe it was something I read that was written by Nathan Buggia on their blog, possibly at their in-house forum, too.

OK, I re-read the post by MSNdude. They are too! indexing pages with a robots noindex, nofollow. Also grabbing links off pages and indexing those LINKS instead of the page itself.

Is all that that confused, rude or accidental?

This is both MSN and Yahoo both, tons of them

[webmasterworld.com...]

[edited by: Marcia at 3:51 am (utc) on Dec. 29, 2007]

youfoundjake




msg:3536843
 4:20 am on Dec 29, 2007 (gmt 0)

I have 2 sites, one about 2 years old, the other a month. Same exact robots.txt file on both sites. The older site, Slurp obeys. The new site, Slurp plowed right through my site hitting a bot trap that was disallowed in robots.txt, they got banned. Since it's Yahoo, I did unban, but just this once. If they do it again, out they stay!

wilderness




msg:3537026
 2:53 pm on Dec 29, 2007 (gmt 0)

Has anybody used:

SetEnvIf User-Agent "Slurp/3.0;" keep_out
or

RewriteCond %{HTTP_USER_AGENT} "Slurp/3.0;" [NC]
RewriteRule .* - [F]

and seen any change in either their Yahoo page listings or crawls from the other Yahoo bots?

keyplyr




msg:3537649
 1:01 am on Dec 31, 2007 (gmt 0)

[webmasterworld.com...]

wilderness




msg:3537652
 1:14 am on Dec 31, 2007 (gmt 0)

keyplr,
I've added Slurp/3.0 to denies and will thus see how it effects my listings and other crawls.

keyplyr




msg:3537707
 3:20 am on Dec 31, 2007 (gmt 0)

keyplr,
I've added Slurp/3.0 to denies and will thus see how it effects my listings and other crawls - wilderness

You're a better man than I Gunga Din.

incrediBILL




msg:3540076
 1:16 am on Jan 4, 2008 (gmt 0)

You don't need to deny Slurp, silly boys.

Robots.txt is just the polite way of telling the bot what to do.

htaccess is how you enforce it if they decide to color outside the lines.

No need to completely ban, sheesh.

BTW, I suspect you'll see more bizarre behavior as everyone starts to add thumbnails and also attempts to stop SE cloaking so brace for impact.

wilderness




msg:3540141
 4:03 am on Jan 4, 2008 (gmt 0)

You don't need to deny Slurp, silly boys.

Robots.txt is just the polite way of telling the bot what to do.

Are you suggesting that I may add

User-agent: Slurp/3.0
Disallow: /

to my robots.txt and Slurp/3.0 will comply and all the other Yahoo bots will continue?

Don

wilderness




msg:3546648
 4:40 pm on Jan 12, 2008 (gmt 0)

keyplr,
I've added Slurp/3.0 to denies and will thus see how it effects my listings and other crawls.

Just a follow up on this.

This denial has NOT hindered the adding of some recent additions (new pages in the past ten days) of my websites to Yahoo listings.

Don

wilderness




msg:3578517
 4:47 am on Feb 19, 2008 (gmt 0)

I slipped this in another topic:
[webmasterworld.com...]

even though that thread was regarding Java. (off topic).

It's technically off the subject line (Yahoo and robots.txt) here as well, however more focused.

Early Friday I contacted Yahoo regarding the appearance of a Yahoo bot from a non-traditional IP range.
An automated reply of recipt arrived instantly.

Twelve hours later a response apologizing for excess crawling arrived and pointing me to towards their basic crawl FAQ and suggesting "delay techniques".

I immediately responded that my intitial inqury had been misunderstood and "delay techniques" were not the reason on inquiry.
Rather, my attention was focused on the appearance of the new IP range and Yahoo's persistence to spider the same two pages, four times daily, for four consecutive days.
In addition, I provided my visitor logs Referrer links which drew Yahoo's attention to the aformentioned two pages and provided a supplemental explantion that a similar result would take place in a few days over recent focused referrals of similar searches.
Utilizing the Yahoo reference number provided.

Twelve hourse later, I receieved a 2d response (likely from their employee on Mars) suggesting that I failed to provide enough information (also addressing me as the the Yahoo employes from the 1st response) and providing a link to the Yahoo main site page, and indicating that my inquiry was a "Search or Directory" issue.

Furthermore, and within minutes of the 2d reply?
Yahoo began crawling many pages on my sites with a focused Class C of the non-traditional (new) Class A.

Anybody else seeing Yahoo crawls from 67.195.zz.zzz

volatilegx




msg:3580683
 2:45 am on Feb 21, 2008 (gmt 0)

Anybody else seeing Yahoo crawls from 67.195.zz.zzz

Yes, I can confirm that.

wilderness




msg:3580746
 3:54 am on Feb 21, 2008 (gmt 0)

Many thanks Dan.

Now have the Yahoo bot hitting on a 3d Class C or the same A.

Don

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved