homepage Welcome to WebmasterWorld Guest from 54.196.63.93
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 45 message thread spans 2 pages: < < 45 ( 1 [2]     
Naughty Yahoo User Agents
Please post them here
GaryK

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3276 posted 11:12 pm on Jun 7, 2006 (gmt 0)

I want to appeal to all of you who have reported problems with Yahoo! user agents that don't respect robots.txt to post those user agents here.

Through a side project of mine I have a contact at Yahoo! Engineering whom I contacted yesterday. He forwarded my e-mail to someone in search ops. That person requested I send him a list of user agents that aren't respecting robots.txt.

To me this is a unique opportunity to see if Yahoo! is serious about addressing this increasingly annoying issue. And thanks to Dan I have permission to deviate from our usual format to compile this list.

Thanks in advance for your help.

 

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3276 posted 6:01 pm on Jun 13, 2006 (gmt 0)

Gary, not to fret. You're trying to help Yahoo, and us, solve problems. That's a breath of fresh air because it's always easier to sit around and complain. And if it turns out Yahoo's folks dismiss detailed, debugging-oriented data from Web professionals? C'est la vie.

But hey, kick back and give things time. I know you're eager but they've got channels upon channels. (You were the first respondent in this, your own thread, because you didn't think we'd reply or that we weren't replying quickly enough. Heck, with mod-approval time and work skeds and such, I hadn't even seen your initial post until after you'd replied to it!)

Regardless of outcome, thank you for stepping up to the plate. Now get back to work:)

GaryK

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3276 posted 9:23 pm on Jun 13, 2006 (gmt 0)

Thanks for your support and understanding. One of these days I promise to grow up. :)

GaryK

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3276 posted 4:12 am on Jun 14, 2006 (gmt 0)

Sorry to double post. I just wanted to let you all know I heard from Warren and he told me he got the last batch of messages I sent him and he's working on the robots.txt user agent problem with Slurp China.

Bill, if you see this I've been trying to get in touch with you but your mailbox here always says it's full.

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3276 posted 4:45 am on Jun 14, 2006 (gmt 0)

Bill, if you see this I've been trying to get in touch with you but your mailbox here always says it's full.

LOL - sorry, I dumped a bunch of sticky's the other night, will try killing more, let me know ;)

Back to Yahoo...

Umbra

10+ Year Member



 
Msg#: 3276 posted 12:48 pm on Jun 16, 2006 (gmt 0)

(Hope this belongs in this thread)

I can't figure out why we've been seeing this in our logs:

68.142.249.51 "GET /mod_ssl:error:HTTP-request HTTP/1.0" 404 316 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]

Also from 72.30.111.87 and 72.30.129.59

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3276 posted 11:48 pm on Jun 29, 2006 (gmt 0)

Anyone else seeing this Slurpy sloppiness?

access_log (re the last entry, below)

wj500040.inktomisearch.com - - [29/Jun/2006:12:48:11 -0700]
"GET /SlurpConfirm404/letters/magasin/BasicTabbedPaneUI.TabSelectionHandler.htm HTTP/1.0" 404 2336 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]

error_log

[Thu Jun 29 12:41:08 2006] [error] [client 72.30.215.21] File does not exist:
/SlurpConfirm404/linkto.htm

[Thu Jun 29 12:41:38 2006] [error] [client 72.30.215.84] File does not exist:
/SlurpConfirm404/Sampler/ppv/Heartach.htm

[Thu Jun 29 12:42:44 2006] [error] [client 72.30.215.103] File does not exist:
/SlurpConfirm404.htm

[Thu Jun 29 12:43:14 2006] [error] [client 72.30.215.82] File does not exist:
/SlurpConfirm404/graph/mlm.htm

[Thu Jun 29 12:43:44 2006] [error] [client 72.30.215.103] File does not exist:
/SlurpConfirm404/exempt/PersonInfo.htm

[Thu Jun 29 12:44:14 2006] [error] [client 72.30.215.10] File does not exist:
/SlurpConfirm404/dotdon/southparkmain/holiday.htm

[Thu Jun 29 12:44:47 2006] [error] [client 72.30.215.88] File does not exist:
/SlurpConfirm404/linux/marc_d.htm

[Thu Jun 29 12:45:41 2006] [error] [client 72.30.215.18] File does not exist:
/SlurpConfirm404/mahfouad.htm

[Thu Jun 29 12:47:42 2006] [error] [client 72.30.215.80] File does not exist:
/SlurpConfirm404/livstand.htm

[Thu Jun 29 12:48:11 2006] [error] [client 72.30.215.15] File does not exist:
/SlurpConfirm404/letters/magasin/BasicTabbedPaneUI.TabSelectionHandler.htm

I thought it was a new set of exploits until I verified one of the IP as inktomi's:

IP address: 72.30.215.15
Reverse DNS: wj500040.inktomisearch.com
Reverse DNS authenticity: [Verified]

I can see doing one 404 test (well, not really, but I know some SEs do a one-file check). But 10? And from 10 IPs in under 10 minutes? Gimme a break. Besides, inktomi already asks for robots.txt about 50 times a day. So wow, why the sudden 404 assault?

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3276 posted 4:38 am on Jun 30, 2006 (gmt 0)

Oh, man. At it again as I type --

[Thu Jun 29 21:19:35 2006] [error] [client 72.30.215.105] File does not exist:
/SlurpConfirm404/veronika.htm

[Thu Jun 29 21:20:47 2006] [error] [client 72.30.215.85] File does not exist:
/SlurpConfirm404/mjavary/adg.htm

[Thu Jun 29 21:21:17 2006] [error] [client 72.30.215.92] File does not exist:
/SlurpConfirm404/JenniferLopez.htm

[Thu Jun 29 21:23:30 2006] [error] [client 72.30.215.10] File does not exist:
/SlurpConfirm404/SkiNLP/MeridieShireTrollfen/infmslist.htm

[Thu Jun 29 21:24:00 2006] [error] [client 72.30.215.101] File does not exist:
/SlurpConfirm404/Constitution/ReviewQ.htm

[Thu Jun 29 21:24:30 2006] [error] [client 72.30.215.17] File does not exist:
/SlurpConfirm404/solution/somewhere/beukema.htm

[Thu Jun 29 21:25:00 2006] [error] [client 72.30.215.19] File does not exist:
/SlurpConfirm404/montages/tree.draw.Tree.htm

[Thu Jun 29 21:26:53 2006] [error] [client 72.30.215.94] File does not exist:
/SlurpConfirm404.htm

[Thu Jun 29 21:28:05 2006] [error] [client 72.30.215.108] File does not exist:
/SlurpConfirm404/ibento.htm

No one else is seeing this?

GaryK

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3276 posted 7:05 am on Jun 30, 2006 (gmt 0)

It's on one of my sites right now:

/SlurpConfirm404/Noid2K/TclCmd/komaba.htm
/SlurpConfirm404.htm
/SlurpConfirm404/stage4_options.htm
/SlurpConfirm404/table19f/john.humphries.htm

...and the list goes on and and on. None of these files has ever existed on any of my websites.

72.30.215.9
72.30.215.12
72.30.215.84
72.30.215.85
72.30.215.105
72.30.215.106

...and the list goes on and on. They all belong to Inktomi.

I'll take a chance and forward this to Warren at Inktomi when I wake up.

thetrasher

5+ Year Member



 
Msg#: 3276 posted 1:46 pm on Jun 30, 2006 (gmt 0)

Slurp just checks for 404 response.

Official FAQ may help:
[help.yahoo.com ]

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3276 posted 2:39 pm on Jun 30, 2006 (gmt 0)

Thanks for the link, thetrasher. People have talked about deliberate 404s but I didn't know Slurp might request up to 10 URLs at once. Usually people ask about one or maybe two oddities.

Apparently the testing is not as "rare" as stated by that page -- unless yesterday was my lucky day. Shoot. Now I find out! :)

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3276 posted 12:19 am on Jul 13, 2006 (gmt 0)

After a long absence, "Yahoo! Slurp China;" has returned to my sites, and now seems to heed robots.txt in this format:

User-agent: Slurp China
Disallow: /

User-agent: Slurp
Crawl-delay: 3
Disallow: /cgi-bin
Disallow: /widget-scripts
Disallow: /styles-nn4.css
Disallow: /styles.css


I can't vouch for whether it will obey specific directory or file Disallows, or whether it will obey
"User-agent: *"
or any other variants.

However, it does seem to recognize that it should go away when it sees the code above, rather than accepting the
User-agent: Slurp
record and subsequently hitting my user-agent blocking code in .htaccess.

Now if I can just get "Yahoo! Slurp;" to quit listing my .css files in SERPs... Grumble, grumble... I've never seen this done by any other search engine before, but I had to add the Disallows for my .css files so that they wouldn't show up when search terms coincided with the terms I used in my .css file comments...

Jim

[edited by: jdMorgan at 12:20 am (utc) on July 13, 2006]

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3276 posted 4:57 pm on Jul 13, 2006 (gmt 0)

Don't think I've been crawled by this strain of Slurp before, or if I was it was a long time ago but it's baaaaack:

74.6.131.201 "Mozilla/5.0 (compatible; Yahoo! DE Slurp; [help.yahoo.com...]

Why in the heck can't Yahoo just crawl pages from one place and let everyone share the pages?

I already block Yahoo China, don't make me block more...

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3276 posted 5:23 pm on Jul 13, 2006 (gmt 0)

Slurp DE is their Yahoo Directory engine, according to GaryK's earlier post (# 400194 above). I certainly don't think I'd want to block it, since I've got several 'grandfathered' free listings in their directory, and you have to pay to get in (and pay again annually to stay in) now... Blocking Slurp DE Could cost me thousands!

Jim

Yahoo_Mike

10+ Year Member



 
Msg#: 3276 posted 3:02 pm on Jul 14, 2006 (gmt 0)

Thanks for all your comments and suggestions on this thread.

I have posted a response from Yahoo! Search on a new thread started on this forum.

Please check the information in the thread entitled Yahoo! Crawlers - A response from Yahoo! Search at
[webmasterworld.com...]

Thanks.

GaryK

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3276 posted 5:04 pm on Jul 14, 2006 (gmt 0)

Thanks for your reply Mike. Thanks also to Warren and of course Mason. Without Mason our concerns never would have made it this far because he was initially my only contact at Yahoo!

This 45 message thread spans 2 pages: < < 45 ( 1 [2]
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved