homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

Yahoo! Slurp China

 4:00 pm on Nov 15, 2005 (gmt 0)

Some heavy spidering from: Mozilla/5.0 (compatible; Yahoo! Slurp China; [misc.yahoo.com.cn...]

Comes from variuous IP's beginning with 202.160.178. and 202.160.179. and resolving to *.inktomisearch.com.



 5:57 pm on Nov 15, 2005 (gmt 0)



 1:38 pm on Nov 20, 2005 (gmt 0)

Mozilla/5.0 (compatible; Yahoo! Slurp China; [misc.yahoo.com.cn...]

It's been visiting my sites every 2-3 days to retrieve the same file.

I really don't want to discriminate against China, and yet if Inktomi isn't going to at least read robots.txt they really leave me no choice but to ban them.


 6:16 pm on Nov 20, 2005 (gmt 0)

My sites have nothing in particular to offer to China.

Viewing the help page in the UA shows, obviously, a page in a language that my browser cannot read, though where it mentions robots.txt the exclusion is set for Slurp so following that all Y! bots would be banned.

I've banned them by full user agent.


 10:23 am on Nov 29, 2005 (gmt 0)

I've banned them by full user agent.

What is the full user agent in this case?



 12:41 pm on Nov 29, 2005 (gmt 0)

Mozilla/5.0 (compatible; Yahoo! Slurp China; [misc.yahoo.com.cn...]

is what I have, the + + + in your string may have been added by your logfile.


 8:35 pm on Nov 29, 2005 (gmt 0)

Slurp treats

User-agent: slurp
User-agent: slurp china

as the same thing, so you can't use robots.txt to disallow Slurp China without also Disallowing the "U.S." slurp.

So, for now, I've had to block Slurp China with a 403 in .htaccess. :(

Another approach that some Webmasters can use is to serve an alternate robots.txt to slurp china; You can internally rewrite slurp china's requests to a secondary robots.txt file that Disallows it. That's not an option for me, since my host -- for some reason -- grabs robots.txt requests and diverts them to a script which then serves the site's robots.txt file before my .htaccess processing can have any effect.

It strikes me as odd that the big search engine players put so little thought into making life easier for Webmasters who target only domestic markets... not to mention all the other problems we've seen -- like redirect handling -- with the fundamental function of search engines -- the robots themselves.

slurp china, tsai chien! (goodbye)



 6:35 pm on Nov 30, 2005 (gmt 0)

It wouldn't surprise me a bit if the decision to use essentially one robot name was intentional. This way we're forced into an all or nothing dilemma. Either accept everything from Slurp or ban everything. Unless you're willing to use more sophisticated means to stop the China bot, and I'd venture a guess that most webmasters aren't able or willing to do that.


 9:26 pm on Nov 30, 2005 (gmt 0)

GaryK, guess again ;o)


 1:00 am on Dec 1, 2005 (gmt 0)

I knew I should have been more specific. ;)

To me it was a given that the folks here would know how to do that. I was referring to the rest of the webmasters in the world.


 2:50 am on Dec 1, 2005 (gmt 0)

I have only one heartburn to the continuous serving up of 403's to Slurp China, and it's not at all unique to that one bot.

What possibly bennie is there for any bot to come back and hit a 403 multi times a day/week/month/etc.? It's like every bot has a line of code that says never give up, never surrender.


 3:51 am on Dec 1, 2005 (gmt 0)

"Never give up, never surrender..."

Yes, that's the "Churchill" subroutine... ;)

I have a special place for those... I rewrite their requests to subdirectory where all access is bloked except for a custom 403 page. And that custom 403 page is two bytes long... It contains only "no". So, this at least minimizes the bandwidth they waste. Of course, I'd rather block them at the router, but alas, I haven't purchased my own data center yet.


Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved