Forum Moderators: open

Message Too Old, No Replies

net-sweeper

Ignored Robots.txt

         

frontpage

7:14 pm on Mar 15, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This UA fell into my spider trap that ignored the robots.txt.

66.207.120.227 - - [2003-03-14 (Fri) 15:45:22] "GET /trap/ HTTP/1.0" Mozilla/5.0

Resolves to:

host227.net-sweeper.com

carfac

9:21 pm on Mar 15, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



frontpage:

net-sweeper is a service that "filters" web pages for clients. Kind of like a Net Nanny or something, I guess. My understanding is that when you request a URL, they grab the page first (if it is not already in their database) and check it for "offensive" words. Like I said, that is my understanding...

In practice, I have seen this all over my sites. As you note, it does NOT respect robots.txt. (My guess it that it is trying to anticipate their clients next click, so it d/l's all the links from the first page it goes to, regardless of a robots.txt) I have the original IP address banned because of this practice. I am not sure how this effects their clients... but I frankly do not care.

dave

frontpage

11:16 pm on Mar 15, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



What is the IP you use to ban this robot.txt disrespecter? Is it the same as the one I posted?

Thanks!

wilderness

1:27 am on Mar 16, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



frontpage,
the IP range is 66.207.120.224 - 66.207.120.255

You can use mod_rewite:
RewriteCond %{REMOTE_ADDR} ^66\. 207\.120\.(22[4-9]¦25[0-5])\. [OR]

BTW there was a recent discussion in which the two backbones, Fibre Wired/Hamilton Hydro and Guelph Hydro were mentioned.
[webmasterworld.com...]

jdMorgan

1:37 am on Mar 16, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



wilderness,

One too many slash-dots in that regex...
This should work:
RewriteCond %{REMOTE_ADDR} ^66\.207\.120\.(22[4-9]¦25[0-5])$ [OR]

This is a weird case, 'cause it's not quite a browser, and not quite a robot, either. Gotta keep an eye on this one, I think.

Jim

jazzguy

1:38 am on Mar 16, 2003 (gmt 0)

10+ Year Member



After they hit a couple of my sites, I banned their entire netblock:
Netsweeper FWGH-NETSWEEPER-1 (NET-66-207-120-224-1) 66.207.120.224 - 66.207.120.255 or "^66\.207\.120\.(22[4-9]¦2[3-4][0-9]¦25[0-5])$"

I also banned by Remote_Host "net-sweeper\.com" and "netsweeper\.fibrewired\.on\.ca"

carfac

5:20 pm on Mar 16, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi:

On this one, I am not sure if I would ban the entire net block...

Based on that this one says it does, and my summation above (if accurate), 66.207.120.227 is probably all that needs to be banned.

I THINK that 66.207.120.227 is there "preview" bot, that will d/l the page, and look for flagged words or phrases. The rest of the IP range COULD be legitimate users.

On this one, I would urge caution in banning it all... I think you can be fine with JUST banning the single IP.

dave

wilderness

7:55 pm on Mar 16, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Dave,
<snip>I THINK that 66.207.120.227 is there "preview" bot</snip>

I can't tell you how many times my laxity in IP denies has permitted some pest to return and grab way too many pages.
Although my zealousness may not be appropiate for everybody? It fits my situation.

I've been getting hit from more than a few of FibreWird/Hamilton Hydro's users for nearly two years. It would be nice if I can deny on a UA and not close the door to all their users?
ONLY Hamilton Hydro can solve that dilema and like most IP's they hardly have their hearing aid on today or any other day as related to webmasters issues :(

Don

Thanks for the consideration and thought, though.

carfac

11:59 pm on Mar 16, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Don:

Not a problem! You are a bit more, well, reactive! than I am. I think I am a bit strong sometimes, too. (I am really hard on anyone stealing my images. But that is my business (as in my product!))

But to each his or her own... I am just glad we share info like this, so we can all make up our own minds how far to go!

<aside>Sometimes, I really wish I COULD just cut off all of APNIC like you. But I do get some sales from there. No many in number, but usually all high dollar amounts... so, again, we differ. ;) </aside>

dave

jmccormac

5:15 pm on Apr 7, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Just had a DoS on my main website (~64K webpages) by these people. The spider they are using is badly written in that it does not interpret external links properly. (The site affected is the main .ie website/domain directory.) The net-sweeper bot seems to interpret external links as being subdirectories. The useragent aspect was missing but the host name was host227.net-sweeper.com

Initially I put a 403 block into effect but the badly designed bot just kept hammering away. In the end I had to block them at the IP level. I'm going to send these characters an invoice for the problems they have caused.

Regards...jmc