homepage Welcome to WebmasterWorld Guest from 54.211.235.255
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 45 message thread spans 2 pages: < < 45 ( 1 [2]     
Naughty Yahoo User Agents
Please post them here
GaryK




msg:400179
 11:12 pm on Jun 7, 2006 (gmt 0)

I want to appeal to all of you who have reported problems with Yahoo! user agents that don't respect robots.txt to post those user agents here.

Through a side project of mine I have a contact at Yahoo! Engineering whom I contacted yesterday. He forwarded my e-mail to someone in search ops. That person requested I send him a list of user agents that aren't respecting robots.txt.

To me this is a unique opportunity to see if Yahoo! is serious about addressing this increasingly annoying issue. And thanks to Dan I have permission to deviate from our usual format to compile this list.

Thanks in advance for your help.

 

Pfui




msg:400209
 6:01 pm on Jun 13, 2006 (gmt 0)

Gary, not to fret. You're trying to help Yahoo, and us, solve problems. That's a breath of fresh air because it's always easier to sit around and complain. And if it turns out Yahoo's folks dismiss detailed, debugging-oriented data from Web professionals? C'est la vie.

But hey, kick back and give things time. I know you're eager but they've got channels upon channels. (You were the first respondent in this, your own thread, because you didn't think we'd reply or that we weren't replying quickly enough. Heck, with mod-approval time and work skeds and such, I hadn't even seen your initial post until after you'd replied to it!)

Regardless of outcome, thank you for stepping up to the plate. Now get back to work:)

GaryK




msg:400210
 9:23 pm on Jun 13, 2006 (gmt 0)

Thanks for your support and understanding. One of these days I promise to grow up. :)

GaryK




msg:400211
 4:12 am on Jun 14, 2006 (gmt 0)

Sorry to double post. I just wanted to let you all know I heard from Warren and he told me he got the last batch of messages I sent him and he's working on the robots.txt user agent problem with Slurp China.

Bill, if you see this I've been trying to get in touch with you but your mailbox here always says it's full.

incrediBILL




msg:400212
 4:45 am on Jun 14, 2006 (gmt 0)

Bill, if you see this I've been trying to get in touch with you but your mailbox here always says it's full.

LOL - sorry, I dumped a bunch of sticky's the other night, will try killing more, let me know ;)

Back to Yahoo...

Umbra




msg:400213
 12:48 pm on Jun 16, 2006 (gmt 0)

(Hope this belongs in this thread)

I can't figure out why we've been seeing this in our logs:

68.142.249.51 "GET /mod_ssl:error:HTTP-request HTTP/1.0" 404 316 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]

Also from 72.30.111.87 and 72.30.129.59

Pfui




msg:400214
 11:48 pm on Jun 29, 2006 (gmt 0)

Anyone else seeing this Slurpy sloppiness?

access_log (re the last entry, below)

wj500040.inktomisearch.com - - [29/Jun/2006:12:48:11 -0700]
"GET /SlurpConfirm404/letters/magasin/BasicTabbedPaneUI.TabSelectionHandler.htm HTTP/1.0" 404 2336 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]

error_log

[Thu Jun 29 12:41:08 2006] [error] [client 72.30.215.21] File does not exist:
/SlurpConfirm404/linkto.htm

[Thu Jun 29 12:41:38 2006] [error] [client 72.30.215.84] File does not exist:
/SlurpConfirm404/Sampler/ppv/Heartach.htm

[Thu Jun 29 12:42:44 2006] [error] [client 72.30.215.103] File does not exist:
/SlurpConfirm404.htm

[Thu Jun 29 12:43:14 2006] [error] [client 72.30.215.82] File does not exist:
/SlurpConfirm404/graph/mlm.htm

[Thu Jun 29 12:43:44 2006] [error] [client 72.30.215.103] File does not exist:
/SlurpConfirm404/exempt/PersonInfo.htm

[Thu Jun 29 12:44:14 2006] [error] [client 72.30.215.10] File does not exist:
/SlurpConfirm404/dotdon/southparkmain/holiday.htm

[Thu Jun 29 12:44:47 2006] [error] [client 72.30.215.88] File does not exist:
/SlurpConfirm404/linux/marc_d.htm

[Thu Jun 29 12:45:41 2006] [error] [client 72.30.215.18] File does not exist:
/SlurpConfirm404/mahfouad.htm

[Thu Jun 29 12:47:42 2006] [error] [client 72.30.215.80] File does not exist:
/SlurpConfirm404/livstand.htm

[Thu Jun 29 12:48:11 2006] [error] [client 72.30.215.15] File does not exist:
/SlurpConfirm404/letters/magasin/BasicTabbedPaneUI.TabSelectionHandler.htm

I thought it was a new set of exploits until I verified one of the IP as inktomi's:

IP address: 72.30.215.15
Reverse DNS: wj500040.inktomisearch.com
Reverse DNS authenticity: [Verified]

I can see doing one 404 test (well, not really, but I know some SEs do a one-file check). But 10? And from 10 IPs in under 10 minutes? Gimme a break. Besides, inktomi already asks for robots.txt about 50 times a day. So wow, why the sudden 404 assault?

Pfui




msg:400215
 4:38 am on Jun 30, 2006 (gmt 0)

Oh, man. At it again as I type --

[Thu Jun 29 21:19:35 2006] [error] [client 72.30.215.105] File does not exist:
/SlurpConfirm404/veronika.htm

[Thu Jun 29 21:20:47 2006] [error] [client 72.30.215.85] File does not exist:
/SlurpConfirm404/mjavary/adg.htm

[Thu Jun 29 21:21:17 2006] [error] [client 72.30.215.92] File does not exist:
/SlurpConfirm404/JenniferLopez.htm

[Thu Jun 29 21:23:30 2006] [error] [client 72.30.215.10] File does not exist:
/SlurpConfirm404/SkiNLP/MeridieShireTrollfen/infmslist.htm

[Thu Jun 29 21:24:00 2006] [error] [client 72.30.215.101] File does not exist:
/SlurpConfirm404/Constitution/ReviewQ.htm

[Thu Jun 29 21:24:30 2006] [error] [client 72.30.215.17] File does not exist:
/SlurpConfirm404/solution/somewhere/beukema.htm

[Thu Jun 29 21:25:00 2006] [error] [client 72.30.215.19] File does not exist:
/SlurpConfirm404/montages/tree.draw.Tree.htm

[Thu Jun 29 21:26:53 2006] [error] [client 72.30.215.94] File does not exist:
/SlurpConfirm404.htm

[Thu Jun 29 21:28:05 2006] [error] [client 72.30.215.108] File does not exist:
/SlurpConfirm404/ibento.htm

No one else is seeing this?

GaryK




msg:400216
 7:05 am on Jun 30, 2006 (gmt 0)

It's on one of my sites right now:

/SlurpConfirm404/Noid2K/TclCmd/komaba.htm
/SlurpConfirm404.htm
/SlurpConfirm404/stage4_options.htm
/SlurpConfirm404/table19f/john.humphries.htm

...and the list goes on and and on. None of these files has ever existed on any of my websites.

72.30.215.9
72.30.215.12
72.30.215.84
72.30.215.85
72.30.215.105
72.30.215.106

...and the list goes on and on. They all belong to Inktomi.

I'll take a chance and forward this to Warren at Inktomi when I wake up.

thetrasher




msg:400217
 1:46 pm on Jun 30, 2006 (gmt 0)

Slurp just checks for 404 response.

Official FAQ may help:
[help.yahoo.com ]

Pfui




msg:400218
 2:39 pm on Jun 30, 2006 (gmt 0)

Thanks for the link, thetrasher. People have talked about deliberate 404s but I didn't know Slurp might request up to 10 URLs at once. Usually people ask about one or maybe two oddities.

Apparently the testing is not as "rare" as stated by that page -- unless yesterday was my lucky day. Shoot. Now I find out! :)

jdMorgan




msg:3004857
 12:19 am on Jul 13, 2006 (gmt 0)

After a long absence, "Yahoo! Slurp China;" has returned to my sites, and now seems to heed robots.txt in this format:

User-agent: Slurp China
Disallow: /

User-agent: Slurp
Crawl-delay: 3
Disallow: /cgi-bin
Disallow: /widget-scripts
Disallow: /styles-nn4.css
Disallow: /styles.css


I can't vouch for whether it will obey specific directory or file Disallows, or whether it will obey
"User-agent: *"
or any other variants.

However, it does seem to recognize that it should go away when it sees the code above, rather than accepting the
User-agent: Slurp
record and subsequently hitting my user-agent blocking code in .htaccess.

Now if I can just get "Yahoo! Slurp;" to quit listing my .css files in SERPs... Grumble, grumble... I've never seen this done by any other search engine before, but I had to add the Disallows for my .css files so that they wouldn't show up when search terms coincided with the terms I used in my .css file comments...

Jim

[edited by: jdMorgan at 12:20 am (utc) on July 13, 2006]

incrediBILL




msg:3005811
 4:57 pm on Jul 13, 2006 (gmt 0)

Don't think I've been crawled by this strain of Slurp before, or if I was it was a long time ago but it's baaaaack:

74.6.131.201 "Mozilla/5.0 (compatible; Yahoo! DE Slurp; [help.yahoo.com...]

Why in the heck can't Yahoo just crawl pages from one place and let everyone share the pages?

I already block Yahoo China, don't make me block more...

jdMorgan




msg:3005859
 5:23 pm on Jul 13, 2006 (gmt 0)

Slurp DE is their Yahoo Directory engine, according to GaryK's earlier post (# 400194 above). I certainly don't think I'd want to block it, since I've got several 'grandfathered' free listings in their directory, and you have to pay to get in (and pay again annually to stay in) now... Blocking Slurp DE Could cost me thousands!

Jim

Yahoo_Mike




msg:3007385
 3:02 pm on Jul 14, 2006 (gmt 0)

Thanks for all your comments and suggestions on this thread.

I have posted a response from Yahoo! Search on a new thread started on this forum.

Please check the information in the thread entitled Yahoo! Crawlers - A response from Yahoo! Search at
[webmasterworld.com...]

Thanks.

GaryK




msg:3007857
 5:04 pm on Jul 14, 2006 (gmt 0)

Thanks for your reply Mike. Thanks also to Warren and of course Mason. Without Mason our concerns never would have made it this far because he was initially my only contact at Yahoo!

This 45 message thread spans 2 pages: < < 45 ( 1 [2]
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved