Forum Moderators: open
But today I find this in the logs of one site (local file details obfuscated):
rz502454.crawl.yahoo.net - - [28/Dec/2007:11:22:49 +1000] "GET /productpage.htm HTTP/1.1" 200 2372 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; [help.yahoo.com...]
rz502454.crawl.yahoo.net - - [28/Dec/2007:11:22:49 +1000] "GET /style.css HTTP/1.1" 200 2724 "http://www.example.com/productpage.htm" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; [help.yahoo.com...]
rz502454.crawl.yahoo.net - - [28/Dec/2007:11:22:50 +1000] "GET /images/product-pic.jpg HTTP/1.1" 200 6517 "http://www.example.com/productpage.htm" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; [help.yahoo.com...]
rz502454.crawl.yahoo.net - - [28/Dec/2007:11:22:51 +1000] "GET /images/image1.jpg HTTP/1.1" 200 12077 "http://www.example.com/productpage.htm" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; [help.yahoo.com...]
rz502454.crawl.yahoo.net - - [28/Dec/2007:11:22:51 +1000] "GET /images/image2.gif HTTP/1.1" 200 45 "http://www.example.com/style.css" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; [help.yahoo.com...]
rz502454.crawl.yahoo.net - - [28/Dec/2007:11:22:51 +1000] "GET /images/image3.jpg HTTP/1.1" 200 6942 "http://www.example.com/style.css" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; [help.yahoo.com...]
I don't know what they think they are up to, but if they try that again they will be banned from all our sites. The traffic we get from Yahoo is negligible anyway.
Has anyone else seen this behaviour?
What do you mean? You can't require them to ask for files via robots.txt.
Sorry, I see now that my sentence is somewhat ambiguous.
What I meant to say, is that normally Yahoo Slurp correctly and properly obeys our robots.txt, which disallows all images and style sheets.
On the one occasion shown in the logs posted above, Slurp requested all supporting files, which it should not have if respecting robots.txt - as Yahoo claims it does.
During 2006 the Google-image bot began spidering all my images which are contained in folder, exempted in robots.txt.
After a day or two I added a denial (which I still have intact) and then I contacted google. They apologized and the spidering ceased immediately. At least at that time.
A few months later, it began again, the second time, I didn;t even contact them.
Don
74.6.22.120 - - [28/Dec/2007:18:12:00 -0500] "GET /robots.txt HTTP/1.0" 200 3192 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]
74.6.22.120 - - [28/Dec/2007:18:12:00 -0500] "GET /widget.html HTTP/1.0" 403 666 "-" "Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.4) Gecko/20071214 BonEcho/2.0.0.4"
74.6.22.120 - - [28/Dec/2007:18:12:03 -0500] "GET /histyle.css HTTP/1.0" 403 666 "http://www.example.com/widget.html" "Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.4) Gecko/20071214 BonEcho/2.0.0.4"
If you want to know if I'm cloaking, I am. But it's to properly support mobile devices and keep some pages out of your index by forbidding silliness like the above. And I say so plainly with the "Vary: User-agent" response header on every page. :)
I'm already very annoyed with MSN's failure to recognize robots.txt records addressed to each of their various and sundry robots, but it looks like Yahoo! is about to join the "abusive and annoying robots" club... :(
Jim
See this thread for more info:
[webmasterworld.com...]
Trying to restrict individual bots using robots.txt is not cloaking.
OK, I re-read the post by MSNdude. They are too! indexing pages with a robots noindex, nofollow. Also grabbing links off pages and indexing those LINKS instead of the page itself.
Is all that that confused, rude or accidental?
This is both MSN and Yahoo both, tons of them
[webmasterworld.com...]
[edited by: Marcia at 3:51 am (utc) on Dec. 29, 2007]
[webmasterworld.com...]
Robots.txt is just the polite way of telling the bot what to do.
htaccess is how you enforce it if they decide to color outside the lines.
No need to completely ban, sheesh.
BTW, I suspect you'll see more bizarre behavior as everyone starts to add thumbnails and also attempts to stop SE cloaking so brace for impact.
even though that thread was regarding Java. (off topic).
It's technically off the subject line (Yahoo and robots.txt) here as well, however more focused.
Early Friday I contacted Yahoo regarding the appearance of a Yahoo bot from a non-traditional IP range.
An automated reply of recipt arrived instantly.
Twelve hours later a response apologizing for excess crawling arrived and pointing me to towards their basic crawl FAQ and suggesting "delay techniques".
I immediately responded that my intitial inqury had been misunderstood and "delay techniques" were not the reason on inquiry.
Rather, my attention was focused on the appearance of the new IP range and Yahoo's persistence to spider the same two pages, four times daily, for four consecutive days.
In addition, I provided my visitor logs Referrer links which drew Yahoo's attention to the aformentioned two pages and provided a supplemental explantion that a similar result would take place in a few days over recent focused referrals of similar searches.
Utilizing the Yahoo reference number provided.
Twelve hourse later, I receieved a 2d response (likely from their employee on Mars) suggesting that I failed to provide enough information (also addressing me as the the Yahoo employes from the 1st response) and providing a link to the Yahoo main site page, and indicating that my inquiry was a "Search or Directory" issue.
Furthermore, and within minutes of the 2d reply?
Yahoo began crawling many pages on my sites with a focused Class C of the non-traditional (new) Class A.
Anybody else seeing Yahoo crawls from 67.195.zz.zzz