New bot to ban?

Forum Moderators: phranque

Message Too Old, No Replies

New bot to ban?

New scraper to ban

walrus

6:48 am on Nov 4, 2007 (gmt 0)

May be to late for anyone without a bot trap but heres an aggressive new scraper you might want to block with htaccess .

38.100.41.112

stapel

4:35 pm on Nov 4, 2007 (gmt 0)

walrus said: ...here's an aggressive new scraper you might want to block with htaccess: 38.100.41.112

Okay, but "38.100.41.112" is just an IP address, one of many owned by Cogent Communications. (CogentCo appears to own the entire 38.*.*.* block.)

What was the bot in question? What user-agent was declared?

Thank you.

Eliz.

jdMorgan

4:56 pm on Nov 4, 2007 (gmt 0)

A lot of folks ban the entire 38 range, while some make exceptions for a few subnets in that range --for example, for Gigabot-- and possibly for a few known 'corporate' subnets as well.

Jim

walrus

5:02 pm on Nov 4, 2007 (gmt 0)

I'm not sure if its bot or not so I Googled .htaccess and the ip,
and it has been tagged as a mail harvester,but the user agent doesn't reveal much.

Mozilla/4.0 (compatible; MSIE 6.0; Windows XP

jdMorgan

5:13 pm on Nov 4, 2007 (gmt 0)

If that is the entire user-agent string, then the good news is that it is invalid and can easily be caught.

Jim

wilderness

5:58 pm on Nov 4, 2007 (gmt 0)

Mozilla/4.0 (compatible; MSIE 6.0; Windows XP

Not sure if the UA you've provided is incomplete as a result of a copy and paste error?
OR
If the UA is EXACTLY as?

If exactly and the trailing ) is missing and that is a fake UA.

I seem to recall that there was at one time an ends with XP denial being used by some folks, however my reccolection could be playing tricks on me.

stapel

6:50 pm on Nov 4, 2007 (gmt 0)

jdMorgan said: A lot of folks ban the entire 38 range....

<alert type="stupid question"> 
 Um... why do they do that...? 
</alert> Eliz.

wilderness

8:14 pm on Nov 4, 2007 (gmt 0)

why do they do that

1) they do not desire their active web pages to be harvested by less than major bots, regardless of criteria
2) they resent either the explanation or lack of accepted internet protocol for bots to comply with robots.txt
3) they do NOT desire any association with either the person, IP range or company that is behind the spidering.

There are a multitue of reasons which I have omitted.
There's an old thread in forum #11 in which Jim and others provided some valid explanations.

jdMorgan

8:22 pm on Nov 4, 2007 (gmt 0)

Those points, plus the fact that the legitimate-visitor to bad-bot ratio in that range is very low (for some sites).

Jim

walrus

6:21 pm on Nov 5, 2007 (gmt 0)

I forgot to check the thread since my last post and JD must have posted while i was still writing cause i never saw that.
To clarify here is the whole string, but I guess JD has confirmed its a bad block anyways.
Thanks again everybody

38.100.41.112 - - [02/Nov/2007:23:51:29 -0400] "GET / HTTP/1.1" 200 11312 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows XP)"

jdMorgan

9:28 pm on Nov 5, 2007 (gmt 0)

OK, but even with the closing parenthese, that user-agent string is still invalid, so you can go ahead and block it by UA if you don't want to block that whole address range.

Jim

walrus

4:39 am on Nov 6, 2007 (gmt 0)

My htaccess skills are limited but from what i know on shared host i cant get access to htconfig to allow rewrite, so i guess I can't block by UA . Thanks for confirming its invalid, i think I will opt for blocking the whole range as your point about few legit visitors to bad bot ratio makes a lot of sense.
Thanks again Jim!

mlewitz

7:39 am on Nov 6, 2007 (gmt 0)

Just to make sure I'm gonna do this right (still learning my Apache)...

To block the entire 38. IP range, the specific line should read:

order allow,deny
deny from 38.
allow from all

Is this correct?

Thanks.
~Mike

wilderness

7:52 am on Nov 6, 2007 (gmt 0)

order allow,deny
deny from 38.
allow from all
Is this correct?

Yes and perhaps ;)
If you have an existing file and your just adding those lines.

Here's some links to assorted exaplanations and examples.

Some old threads:
[webmasterworld.com...]
[webmasterworld.com...]
[webmasterworld.com...]

<snip>

[edited by: jdMorgan at 12:39 am (utc) on Nov. 7, 2007]
[edit reason] Removed URLs per TOS. [/edit]

New bot to ban?

New scraper to ban

walrus

stapel

jdMorgan

walrus

jdMorgan

wilderness

stapel

wilderness

jdMorgan

walrus

jdMorgan

walrus

mlewitz

wilderness

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week