Forum Moderators: phranque

Message Too Old, No Replies

New bot to ban?

New scraper to ban

         

walrus

6:48 am on Nov 4, 2007 (gmt 0)

10+ Year Member



May be to late for anyone without a bot trap but heres an aggressive new scraper you might want to block with htaccess .

38.100.41.112

stapel

4:35 pm on Nov 4, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



walrus said: ...here's an aggressive new scraper you might want to block with htaccess: 38.100.41.112

Okay, but "38.100.41.112" is just an IP address, one of many owned by Cogent Communications. (CogentCo appears to own the entire 38.*.*.* block.)

What was the bot in question? What user-agent was declared?

Thank you.

Eliz.

jdMorgan

4:56 pm on Nov 4, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



A lot of folks ban the entire 38 range, while some make exceptions for a few subnets in that range --for example, for Gigabot-- and possibly for a few known 'corporate' subnets as well.

Jim

walrus

5:02 pm on Nov 4, 2007 (gmt 0)

10+ Year Member



I'm not sure if its bot or not so I Googled .htaccess and the ip,
and it has been tagged as a mail harvester,but the user agent doesn't reveal much.

Mozilla/4.0 (compatible; MSIE 6.0; Windows XP

jdMorgan

5:13 pm on Nov 4, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If that is the entire user-agent string, then the good news is that it is invalid and can easily be caught.

Jim

wilderness

5:58 pm on Nov 4, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Mozilla/4.0 (compatible; MSIE 6.0; Windows XP

Not sure if the UA you've provided is incomplete as a result of a copy and paste error?
OR
If the UA is EXACTLY as?

If exactly and the trailing ) is missing and that is a fake UA.

I seem to recall that there was at one time an ends with XP denial being used by some folks, however my reccolection could be playing tricks on me.

stapel

6:50 pm on Nov 4, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



jdMorgan said: A lot of folks ban the entire 38 range....

<alert type="stupid question"> 
Um... why do they do that...?
</alert>

Eliz.

wilderness

8:14 pm on Nov 4, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



why do they do that

1) they do not desire their active web pages to be harvested by less than major bots, regardless of criteria
2) they resent either the explanation or lack of accepted internet protocol for bots to comply with robots.txt
3) they do NOT desire any association with either the person, IP range or company that is behind the spidering.

There are a multitue of reasons which I have omitted.
There's an old thread in forum #11 in which Jim and others provided some valid explanations.

jdMorgan

8:22 pm on Nov 4, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Those points, plus the fact that the legitimate-visitor to bad-bot ratio in that range is very low (for some sites).

Jim

walrus

6:21 pm on Nov 5, 2007 (gmt 0)

10+ Year Member



I forgot to check the thread since my last post and JD must have posted while i was still writing cause i never saw that.
To clarify here is the whole string, but I guess JD has confirmed its a bad block anyways.
Thanks again everybody

38.100.41.112 - - [02/Nov/2007:23:51:29 -0400] "GET / HTTP/1.1" 200 11312 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows XP)"

jdMorgan

9:28 pm on Nov 5, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



OK, but even with the closing parenthese, that user-agent string is still invalid, so you can go ahead and block it by UA if you don't want to block that whole address range.

Jim

walrus

4:39 am on Nov 6, 2007 (gmt 0)

10+ Year Member



My htaccess skills are limited but from what i know on shared host i cant get access to htconfig to allow rewrite, so i guess I can't block by UA . Thanks for confirming its invalid, i think I will opt for blocking the whole range as your point about few legit visitors to bad bot ratio makes a lot of sense.
Thanks again Jim!

mlewitz

7:39 am on Nov 6, 2007 (gmt 0)

10+ Year Member



Just to make sure I'm gonna do this right (still learning my Apache)...

To block the entire 38. IP range, the specific line should read:

order allow,deny
deny from 38.
allow from all

Is this correct?

Thanks.
~Mike

wilderness

7:52 am on Nov 6, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



order allow,deny
deny from 38.
allow from all

Is this correct?

Yes and perhaps ;)
If you have an existing file and your just adding those lines.

Here's some links to assorted exaplanations and examples.

Some old threads:
[webmasterworld.com...]
[webmasterworld.com...]
[webmasterworld.com...]

<snip>

[edited by: jdMorgan at 12:39 am (utc) on Nov. 7, 2007]
[edit reason] Removed URLs per TOS. [/edit]