Welcome to WebmasterWorld Guest from 54.196.217.43

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

Yahoo! Slurp China

     
4:00 pm on Nov 15, 2005 (gmt 0)

Junior Member

10+ Year Member

joined:Apr 12, 2004
posts:45
votes: 0


Some heavy spidering from: Mozilla/5.0 (compatible; Yahoo! Slurp China; [misc.yahoo.com.cn...]

Comes from variuous IP's beginning with 202.160.178. and 202.160.179. and resolving to *.inktomisearch.com.

5:57 pm on Nov 15, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5408
votes: 2

1:38 pm on Nov 20, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 17, 2002
posts:2251
votes: 0


Mozilla/5.0 (compatible; Yahoo! Slurp China; [misc.yahoo.com.cn...]
202.160.180.9

It's been visiting my sites every 2-3 days to retrieve the same file.

I really don't want to discriminate against China, and yet if Inktomi isn't going to at least read robots.txt they really leave me no choice but to ban them.

6:16 pm on Nov 20, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 24, 2002
posts:894
votes: 0


My sites have nothing in particular to offer to China.

Viewing the help page in the UA shows, obviously, a page in a language that my browser cannot read, though where it mentions robots.txt the exclusion is set for Slurp so following that all Y! bots would be banned.

I've banned them by full user agent.

10:23 am on Nov 29, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 29, 2003
posts:790
votes: 0


I've banned them by full user agent.

What is the full user agent in this case?
Mozilla/5.0+(compatible;+Yahoo!+Slurp+China;+http://misc.yahoo.com.cn/help.html)?

nerd

12:41 pm on Nov 29, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 24, 2002
posts:894
votes: 0


Mozilla/5.0 (compatible; Yahoo! Slurp China; [misc.yahoo.com.cn...]

is what I have, the + + + in your string may have been added by your logfile.

8:35 pm on Nov 29, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


Slurp treats

User-agent: slurp
and
User-agent: slurp china

as the same thing, so you can't use robots.txt to disallow Slurp China without also Disallowing the "U.S." slurp.

So, for now, I've had to block Slurp China with a 403 in .htaccess. :(

Another approach that some Webmasters can use is to serve an alternate robots.txt to slurp china; You can internally rewrite slurp china's requests to a secondary robots.txt file that Disallows it. That's not an option for me, since my host -- for some reason -- grabs robots.txt requests and diverts them to a script which then serves the site's robots.txt file before my .htaccess processing can have any effect.

It strikes me as odd that the big search engine players put so little thought into making life easier for Webmasters who target only domestic markets... not to mention all the other problems we've seen -- like redirect handling -- with the fundamental function of search engines -- the robots themselves.

slurp china, tsai chien! (goodbye)

Jim

6:35 pm on Nov 30, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 17, 2002
posts:2251
votes: 0


It wouldn't surprise me a bit if the decision to use essentially one robot name was intentional. This way we're forced into an all or nothing dilemma. Either accept everything from Slurp or ban everything. Unless you're willing to use more sophisticated means to stop the China bot, and I'd venture a guess that most webmasters aren't able or willing to do that.
9:26 pm on Nov 30, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 24, 2002
posts:894
votes: 0


GaryK, guess again ;o)
1:00 am on Dec 1, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 17, 2002
posts:2251
votes: 0


I knew I should have been more specific. ;)

To me it was a given that the folks here would know how to do that. I was referring to the rest of the webmasters in the world.

2:50 am on Dec 1, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 2, 2002
posts:1167
votes: 0


I have only one heartburn to the continuous serving up of 403's to Slurp China, and it's not at all unique to that one bot.

What possibly bennie is there for any bot to come back and hit a 403 multi times a day/week/month/etc.? It's like every bot has a line of code that says never give up, never surrender.

3:51 am on Dec 1, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


"Never give up, never surrender..."

Yes, that's the "Churchill" subroutine... ;)

I have a special place for those... I rewrite their requests to subdirectory where all access is bloked except for a custom 403 page. And that custom 403 page is two bytes long... It contains only "no". So, this at least minimizes the bandwidth they waste. Of course, I'd rather block them at the router, but alas, I haven't purchased my own data center yet.

Jim