Forum Moderators: open

Message Too Old, No Replies

Sosospider

how to stop it?

         

smallcompany

2:53 am on Oct 19, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I issue 403 to Sosospider for a long time, but that thing keeps coming back making requests one after another.

Is there a way of "bombing" such rude spiders back?

Thanks

incrediBILL

3:22 am on Oct 19, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Write a script that just keeps giving them huge page after page of junk to index, each page with more new links, links that don't exist except they're created by your script, endless time suck for the spider.

Just put a delay in so you don't let them overload the server or exceed your bandwidth as you fill their index with gigabytes of garbage.

However, that can backfire if something like Google eventually gets those links and bombards your server with requests for those pages and totally blows your WMT control panel to hell.

In other words, yes you can mess with them, but it can come back and bite you even worse.

caribguy

3:36 am on Oct 19, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



301 redirect to localhost or drop 'em in the firewall...

Pfui

4:25 am on Oct 19, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



1.) I rewrite the worst of the worst to 127.0.0.1 via 301.

2.) For IPs and related UAs: Sosospider [webmasterworld.com...]

Dijkgraaf

8:54 pm on Oct 19, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Have you actually added Sosospider to your robots.txt file? I did and after a few days all requests except for robots.txt stopped.

Note: I add a disallow as both Sosospider and sosospider

Staffa

9:20 pm on Oct 19, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I have this same problem on just one of my sites. On all other sites it reads robots.txt where it is disallowed and goes away on that one particular site it calls for robots.txt without www. and gets redirected to www but it doesn't hang around long enough for the redirect to finish so it never gets to read the robots.txt file then it comes back for pages with www. In the last fortnight it came some 1500 times to that one site though the IP range is blocked and it gets nothing it keeps coming back again and again. Annoying.

tangor

11:11 pm on Oct 19, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The legitimate Sosospider has (all this last year at any rate) honored a blanket disallow in robots.txt. I white list only what I accept and reject all the rest.

HOWEVER, from an IP NOT usual for Sosospider, but bearing a UA that "looked Similar", I did have one (1) page taken. That IP was in the middle of AWS and just helped me refine my block of THAT band of bots. :)

Make sure you allow ALL to get robots.txt. If it can't be found then most (all?) bots take that as permission to hammer the site.

Staffa

12:22 pm on Oct 20, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



In respect to bots and scrapers all my sites are setup in the same way so there is no reason for Soso to make an exception for one site.
On that particular site it has requested robots.txt 433 times without www over the past year all with IP 124.115.6.nn and the full Soso UA.

I hope Soso reads here and cleans up its act ;o)

jdMorgan

1:41 pm on Oct 20, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Just a note: I recommend that you do not redirect non-canonical requests for robots.txt or for your custom 403 error page. There's also no use redirecting to canonicalize unwelcome requests from any user-agent -- it's just a waste of time, bandwidth, and CPU resources.

An effective ploy to use against malicious 'bots in certain authoritarian countries is to return content that is forbidden in those countries for political or other reasons. I'll leave the details to your imagination, but some countries offer big opportunities for the use of this method...

Jim

Staffa

4:34 pm on Oct 20, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Jim I love your idea for "certain" countries, I know exactly what to write about and it will be up in a couple of hours then wait and see how long it will take for Soso to back off.

Thank you :o)

blend27

9:16 pm on Oct 21, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'll leave the details to your imagination


It actually works, I started similar tests on few sites couple years back and traffic from those countries disappeared with in month all together.

Staffa

9:53 pm on Oct 24, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I hope Soso reads here

I think it does....
At the 440th request it finally asks for www, gets the robots.txt reads that it's disallowed and goes away. Same for 441 and 442 .... finally !

blend27

1:00 am on Oct 25, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Staffa, you could of stopped at #403 ;)

Staffa

7:41 pm on Oct 25, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



True, or Soso could have stopped at 404, either one would have done ;)

Anyway, keeping fingers crossed, it's still at its best behaviour ... for how long ?

I also noticed that the swarm of its satellites which always accompanied 124.115.n.nn has disappeared [webmasterworld.com ]

no more visits from 114.80.93.nn, 58.61.164.nn and 61.135.167.nn either.

I don't know what I'm going to do with myself now ;o)

AndrewCF

5:49 am on Oct 29, 2010 (gmt 0)

10+ Year Member



Sosospider is a China search engine spider, try to disable it in your robots.txt:


User-agent: Sosospider
Disallow: /


About Sosospider: [httpuseragent.org ]