Forum Moderators: open

Message Too Old, No Replies

Discobot crawler

How to stop this horrible bot?

         

grandma genie

4:33 am on Dec 6, 2010 (gmt 0)

10+ Year Member



Hello,
This miserable bot has been attempting to index my site for about a month. I have it blocked in robots.txt and htaccess and have also tried to stop it using the osCommerce spiders.txt file, but nothing works. It is fed nothing but 403s and continues to hammer my site. Today I tried sending an e-mail to them. Let's hope that works. Every time they index a page they include a session ID number, which would cause havoc with my site if anyone were to follow their links. I notice in GWT there are thousands of links to my site that include a session ID. Here is a sample from my logs of a discobot visit (notice the osCsid number):

38.101.148.nnn - - [05/Dec/2010:04:14:19 -0500] "GET /osc/index.php?cPath=945&osCsid=380c578840de9e8f9084f20d36361363 HTTP/1.1" 403 304 "-" "Mozilla/5.0 (compatible; discobot/1.1; +h**p://discoveryengine.com/discobot.html"

Bing used to do the same thing, but an e-mail to them stopped it.

If the e-mail to discoveryengine works, I will let everyone here know it.

wilderness

11:44 am on Dec 6, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This bot complies with robots.txt requests.
Try removing the denial (403) and give them a chance to comply with robots.txt.

The line you provided is a 403 (access denied).
Denial of access does NOT prevent the request from appearing in your raw logs.

Have you applied a robots.txt "exception" to your 403-denial, else this bot will never see the robots.txt and is therefore stuck in a loop because it believes the pages still exist.

jdMorgan

2:04 pm on Dec 6, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



robots.txt and your custom 403 error page (if any) must always be excluded from all rules or directive sections which can invoke 403 errors.

Otherwise, the 'bot cannot read robots.txt, and most 'bots will take this as carte-blanche to fetch all URLs from your site. And any denied access attempt will results in a 403, causing the server to attempt to serve your cust 403 error page. But if no exception is granted to serve that page, then a second 403 error will be invoked... and a third, and a fourth, etc.

In both cases, you can expect repeated requests -- and so have effectively created a "self-inflicted denial of service attack."

So... robots.txt and your custom 403 error page (if any) must always be excluded from all rules or directive sections which can invoke 403 errors.

It is also a good idea to exclude all 500-series custom error documents (if any) from all access controls as well.

Jim

grandma genie

4:48 pm on Dec 6, 2010 (gmt 0)

10+ Year Member



I think the problem is I have this IP range blocked in htaccess because I was getting so many questionable requests from that range (38.0.0.0/8). I am assuming that the Discobot IP 38.101.148.nnn is included in that range, so it is getting blocked. Should I give the discobot IP range entry? And how do you exclude the robots.txt in htaccess? Do you add this to the end of the htaccess file: RewriteRule ^/robots\.txt$ - [L]

tangor

4:57 pm on Dec 6, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



One way of doing it is found here:

[webmasterworld.com...] See mine for sample .htaccess and how to deal with robots.txt

wilderness

6:51 pm on Dec 6, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I think the problem is I have this IP range blocked


Precisely what was explained to you by two different people.

Do you add this to the end of the htaccess file:


Jim wrote:

robots.txt and your custom 403 error page (if any) must always be excluded from all rules or directive sections which can invoke 403 errors.


Yet you replied two-hours and forty-four-minutes after he and asking the same question he'd already answered.

grandma genie

12:33 am on Dec 8, 2010 (gmt 0)

10+ Year Member



Sorry, Wilderness. It is very hard for me to understand what I am reading on this forum. I understand you are trying to help. I spend hours on this forum and online trying to find answers to my questions. Most of the replies make my head spin. Yes, I've been told many times that "robots.txt must always be excluded from all rules or directive sections which can invoke 403 errors." But I don't know how to DO that! That is what I've been trying to figure out. Even the Apache site says their coding is difficult to understand, so please don't yell at me just because I don't understand it.

I prefer to use the mod-rewrite definition, which I think is:
RewriteRule ^robots\.txt$ - [L]

And, according to jdMorgan, that is supposed to go "above all the other RewriteRules in your .htaccess file."

By the way, Oscar e-mailed me from Discovery and he was very helpful. So, the discobot is not as bad as I thought at first. It was my confusion with robots.txt and htaccess that caused all the trouble.

wilderness

2:08 am on Dec 8, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



"must always be excluded from all rules or directive sections which can invoke 403 errors."