Forum Moderators: open

Message Too Old, No Replies

discobot

discoveryengine

         

Hobbs

11:20 pm on Apr 9, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Mozilla/5.0 (compatible; discobot/1.0; +http://discoveryengine.com/discobot.html

208.96.54.zz
208.96.0.0/18 ServePath

Yet Another Spider disco/Nutch-1.0-dev
[webmasterworld.com...]

wilderness

1:54 am on Apr 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You can save yourself many pests by using the IP's (all from same backbone):

RewriteCond %{REMOTE_ADDR} ^208\.96\.([0-9]¦[1-5][0-9]¦6[0-3]¦8[0-9]¦9[0-6])\. [OR]
RewriteCond %{REMOTE_ADDR} ^216\.93\.1([6-8][0-9]¦9[01])\. [OR]
RewriteCond %{REMOTE_ADDR}
^64\.151\.(6[4-9]¦[7-9][0-9]¦1[01][0-9]¦12[0-7])\. [OR]
RewriteCond %{REMOTE_ADDR} ^69\59\.(12[8-9]¦1[3-8][0-9]¦19[01])\. [OR]

Hobbs

10:31 am on Apr 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



wilderness,
64.151.64.0 - 64.151.127.255 is ServePath ok
64.151.64.0/17

69.59.128.0 - 69.59.191.255 is ServePath ok
69.59.128.0/18

but 216.93.0.0 - 216.93.127.255 is core.com ISP

wilderness

1:08 pm on Apr 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hobbs,
You've missed the 1(, one outside the parenteses which makes it begin with 160.

Hobbs

1:53 pm on Apr 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



thanks, regexp is pig latin to me :-)

(in mortal human language)
216.93.160.0 - 216.93.191.255 >> 216.93.160.0/19

So just for the record here are all ranges put together:

64.151.64.0/17
69.59.128.0/18
208.96.0.0/18
216.93.160.0/19

how do you guys find different IP ranges for the same hosting company? Arin.net shows only one of them.

wilderness

1:59 pm on Apr 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Just get the name from an ARIN IP Whois.

Then copy and paste that name into a new ARIN WHOIS.

BTW, the following is real "pig Latin"

64.151.64.0/17
69.59.128.0/18
208.96.0.0/18
216.93.160.0/19

Hobbs

2:27 pm on Apr 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks, you've just opened my eyes onto a new world of easy blocking, who would have thought, enter the company name and down comes all IP ranges!

to me, telling my firewall apf -d 208.96.0.0/18 is clean and simple, less things to go wrong with all your [ and ] and God forbid you forget a \. and much easier to read later on, but as I said, I'm a mere mortal.

wilderness

3:00 pm on Apr 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hobbs,
With ARIN we may also do sub-net range searches.
The method does require some "tinkering"

as an example:
Enter the following minus the quotes;

"> 209.206.128"

then sroll the page and view the results.

The problem with many of these similar seraches is limit of 256-something.
I've had some results fill multiple pages and take many minutes, while others will cut short at a pre-dtermined limit.

As an aside; I've never determined or understood the method of doing these searches at RIPE, however the help does provide that these inquiries are an option.

edited by wilderness!

Hope your not on dialup ;)

"> 67.128."

incrediBILL

6:59 pm on Apr 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



the following is real "pig Latin"

64.151.64.0/17
69.59.128.0/18
208.96.0.0/18
216.93.160.0/19

Correction, the ServePath range is 64.151.64.0/18, you don't want to block /17 or you might whack some legit browsers by accident and a big bunch of them.

OrgName: ServePath, LLC
NetRange: 64.151.64.0 - 64.151.127.255
CIDR: 64.151.64.0/18
NetName: SERVEPATH-BLK4

Besides, it's not igpay atinlay, it's a binary bitmask for a CIDR (Classless Inter-Domain Routing) [en.wikipedia.org] and quite easy to understand!

Since many people have difficulty with these here's a little primer on CIDRs that hopefully simply the concept.

Each part of the IP address is represented by 8 bits (byte) which is 0-255.

Think of the CIDR like this in terms of the binary bitmask: 0-8.9-16.17-24.25-32 therefore you know something ending in a /18 uses the first 2 bits of the C block as the start of all the addresses assigned to that range.

Some examples:

1.0.0.0/8 means the first 8 bits are constant so you're referring to a specific A block.

Therefore, 1.1.0.0/16, 1.1.1.0/24 would refer to a specific B or C block respectively and 1.1.1.1/32 means the entire IP address is used and not a portion.

1.0.0.0/8 represents the range from 1.0.0.0-1.255.255.255.255 or 1.1.0.0/16 represents 1.1.0.0-1.1.255.255, etc.

So 64.151.64.0/18 means that the range is 64.151.64.0-64.151.127.255 or easier to see in binary as:
01000000.10010111.01000000.00000000
so /18 is
01000000.10010111.01nnnnnn.nnnnnnnn

Note that the first 18 bits of the CIDR are fixed and everything after that point is variable. The C block value has a base of 64 meaning that you can only add values of 1-63 to the base part of the C block address making the maximum 127, The D block in the example can be any value from 1-255.

Hope that takes some of the igpay out of the atinlay for those that find CIDR's hard to deal with and the scientific mode of the calculator that comes with windows provides binary to decimal conversions to make it all easier to see and there are some CIDR calculators online that make it even simpler.

IMO, the cutting and pasting the exact CIDR from the ARIN record into a firewall is a lot safer than the mistakes that can be easily make using rewrite rules and I'm a big stickler for spot on accuracy so I use it "as-is" from ARIN or the other respective internet number registries just to avoid potential disasters.

wilderness

7:14 pm on Apr 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Bill,
As I said "pig latin" ;)

From my own perspective, it's much easier to understand the rewrite lines.

As fars as Hobbs mention of syntax errors?
These are going to occur in any method (i. e., Bill's 17-18 reference).

All these merely shows us all, there's more than one way to skin an htaccess cat ;)

Don

Hobbs

7:53 pm on Apr 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Here's the simplest way I could come up with to decode it without losing the rest of my hair: A text file on my desktop containing:

/32 1/256 C 1 D
/31 1/128 C 2 D
/30 1/64 C 4 D
/29 1/32 C 8 D
/28 1/16 C 16 D
/27 1/8 C 32 D
/26 1/4 C 64 D
/25 1/2 C 128 D
/24 1 C 256 D
/23 2 C
/22 4 C
/21 8 C
/20 16 C
/19 32 C
/18 64 C
/17 128
/16 256 C, 1 B
/15 512 C, 2 B
/14 1024 C, 4 B
/13 2048 C, 8 B
/12 4096 C, 16 B
/11 8192 C, 32 B
/10 16384 C, 64 B
/9 32768 C, 128 B
/8 65536 C, 256 B, 1 A
/7 131072 C, 512 B, 2 a
/6 262144 C, 1024 B, 4 A
/5 524288 C, 2048 B, 8 A
/4 1048576 C, 4096 B, 16 A
/3 2097152 C, 8192 B, 32 A
/2 4194304 C, 16384 B, 64 A
/1 8388608 C, 32768 B, 128 A
/0 16777216 C, 65536 B, 256 a

so if I'm blocking 64 class C's all I need to do is look it up and it is /18

true mistakes do happen (thankfully, we got Bill to save the day), but I'm holding on to my remaining gray cells for much more trivial matters (the futile effort to remain married and sane).

incrediBILL

10:06 pm on Apr 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



These are going to occur in any method (i. e., Bill's 17-18 reference).

Not true if you cut & paste direct from the source.

100% satisfaction guaranteed ;)

keyplyr

7:22 am on Apr 11, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



discobot is a nutch variant. Just ban "nutch"

incrediBILL

8:31 am on Apr 11, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



See the Nutch robots.txt [lucene.apache.org] info:

Different installations of the Nutch software may specify different agent names, but all should respond to the agent name "Nutch". Thus to ban all Nutch-based crawlers from your site, place the following in your robots.txt file:

User-agent: Nutch
Disallow: /

Assuming they still use Nutch and didn't alter it's behavior that should work.

keyplyr

9:18 am on Apr 11, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Actually Bill, the variants do not obey this:

User-agent: Nutch
Disallow: /

I tried that at first a couple years ago. Despite what it says at the Nutch web site, the only one that will follow that disallow directive is Nutch itself.

This is one of the things that irritate me about all these start-up bots. We must disallow each one by name, and half of them do not obey it even then. Been there, done that.

So when I say "just ban 'Nutch'" I an referring to alternate methods, i.e. mod_rewrite, mod_setenvif, etc.

incrediBILL

4:16 pm on Apr 11, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



So when I say "just ban 'Nutch'" I an referring to alternate methods, i.e. mod_rewrite, mod_setenvif, etc.

Ah, so you can't "just ban Nutch" then, you have to ban the individual names, what a joke.

See, I wouldn't know about this problem since I whitelist and they're all banned by default so this is good to know for giving advice to the blacklisters.

[edited by: incrediBILL at 4:17 pm (utc) on April 11, 2008]

wilderness

5:38 pm on Apr 11, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Bill,
There's some very, very old threads on this.
One of the creators/administrators of Nutch came into this forum at one time and Jim (as well as others) had some discussions.

In the end, nobody (except of course Nutch) was amused with the explanations offered.

Don

incrediBILL

7:19 pm on Apr 11, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Don,

I vaguely remember some of those old threads but the Nutch code base gets updated all the time and evolves so it would be nice to think that they fix some of these things.

That's why I went to their site to see what they had to say about blocking Nutch in general since things change.

However, with that said, some people still run the old nutch versions so even if the new versions used the generic Nutch label in robots.txt that wouldn't change the behavior of old Nutch implementations already in use which could cause conflicting views about whether Nutch does or doesn't do certain things.

It's just a mess no matter what.

keyplyr

10:29 pm on Apr 11, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Ah, so you can't "just ban Nutch" then, you have to ban the individual names - incrediBILL

That's not what I said.

You can't just "disallow in robots.txt" them all with "nutch."

Since robots.txt does not "ban" per say, banning refers to alternative methods. 99% of these clones keep the "nutch" in the UA string, so using a firewall, http config, htaccess, etc is effective for "nutch"

Sorry to say these methods also stops mother nutch unless you add some other allow filters.

clyde4210

12:38 pm on May 6, 2008 (gmt 0)

10+ Year Member



How about sending them an email and ask to be taken off the crawl list. That usually 99% of the time works. They ban themselves daily from my site for trying to crawl folders I have blocked but, doesn't bother me because my site has good security.