TekSavvy

Forum Moderators: open

Message Too Old, No Replies

TekSavvy

wilderness

12:04 pm on Feb 5, 2008 (gmt 0)

This name sticks in my craw with some nagging reccolection that I don't have OLD documentation on?
Anybody recall?

In late October the following:
206.248.137.zz - - [28/Oct/2007:05:37:13 -0500] "GET /robots.txt HTTP/1.0"
200 4514 "-" "canasasearchbot(http://canadasearch.no-ip.info)"

Yesterday an undidentified crawl of 40 pages:
206.248.167.zzz - - [04/Feb/2008:02:38:55 -0600] "GET / HTTP/1.0" 200 6774 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 1.1.4322)"

Now about once a hour:
64.41.145.zzz - - [04/Feb/2008:20:26:20 -0600] "GET / HTTP/1.1" 301 313 "-"
"Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.2a) Gecko/20021021"

NotNeYzer

8:46 pm on Feb 28, 2008 (gmt 0)

They hit me for 0.5 Gig overnight.

Savvis controls 64.41.128.0 - 64.41.255.255.

Attributor Corporation has the offending IP Range 64.41.145.0 - 64.41.145.255

They are a content management company that looks for duplicate content for their paying customers. And they provide this service by running huge scrapes. Yet another company raking in the bucks off my bandwidth.

My solution:

RewriteCond %{REMOTE_ADDR} 64\.41\.145
RewriteRule ^.*$ IP_Range_Banned.php [L]

[edited by: NotNeYzer at 9:07 pm (utc) on Feb. 28, 2008]

thetrasher

3:28 am on Mar 1, 2008 (gmt 0)

Yesterday an undidentified crawl of 40 pages:
206.248.167.zzz - - [04/Feb/2008:02:38:55 -0600] "GET / HTTP/1.0" 200 6774 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 1.1.4322)"

Let me guess! IP is in the list "someone-is-scraping-me.html"?

wilderness

4:21 am on Mar 1, 2008 (gmt 0)

Actually, if it's the same Canadain bot that visited in October?
Than both my websites contian widgets of interest to Candian's.

Unfortuantely when the bot changes to a standard browser UA and does not read robots.text?
Any influence their audience might have in my web pages is simply negated by that lack of protocol.

Don

blend27

11:37 am on Mar 5, 2008 (gmt 0)

-- Let me guess! IP is in the list "someone-is-scraping-me.html"? --

the IPS from that list are a botnet I'v been trying to deal with. Lots of thouse IPs are from known COLOs and Hosting Ranges.

Here is something that is the same for every request:

REQUEST HEADERS:

HTTP Request item: Value
------------------------
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*
Connection: Close
Content-Length: 0
UA-CPU: x86
Host: www.domaininquestion.com
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 1.1.4322)
Accept-Language: en-us

------------------------
request_method: GET
server_protocol: HTTP/1.0
http_content:
------------------------

The more I look at it the more I don't understand the purpose of this. It looks like a "refferer spam", a scrape and the botnet at the same time. All sites that have been visited are link to each other in some way. A lot of times, this bot will choke on BASEHREF.

wilderness

2:09 pm on Mar 5, 2008 (gmt 0)

-- Let me guess! IP is in the list "someone-is-scraping-me.html"? --

This is a pretty broad term from a webmasters perspective?
There are plenty of reasons that a bot may be grabbing materials, most of which are unknown to the webmaster of the site (s) that are being spidered.
I get spidering from many Universites and "they" would have us believe they are entitled to an "open door" merely because "they are who they are", rather than providing a link to detailed explantion of both their particular project and their "intent" for the materials.

the IPS from that list are a botnet I'v been trying to deal with. Lots of thouse IPs are from known COLOs and Hosting Ranges.

I deny all colos and selling-hosts as spidering traffic from unidentified websites in NOT beneficial to my sites.

As an aside; I've long believed that there mere mention of a colo or hosts name in this forum, merely provides them with free advertising.

A lot of times, this bot will choke on BASEHREF.

My sites are rather simple (KISS) and I had to do a google on "BASEHREF".
I've used relative links for an eternity and many bots choke on these as well (to me; this choking is a sure that the particular bot needs adding to the denial list. Same adding goes for bots that get 404's due to case errors).

Could somebody possibly explain to me any benefit or differnce between "BASEHREF" and "relative"?

Don

incrediBILL

11:06 pm on Mar 19, 2008 (gmt 0)

Don,

The only time I used BASEHREF was on a shared SSH cert where the files were in a different path than the default root path, kind of hairy but typically not needed.

For instance the main URI is [mydomain.com...] and the shared SSL server is [ssl2.myhost.com...] and my site on the shared SSL is then [ssl2.myhost.com...] therefore all the relative files tend to default to [ssl2.myhost.com...] instead of the full path and you specify the exact path in the BASEHREF to fix the problem.

Hope that clears it up.

wilderness

12:20 pm on Mar 20, 2008 (gmt 0)

Many thanks Bill.

TekSavvy

wilderness

NotNeYzer

thetrasher

wilderness

blend27

wilderness

incrediBILL

wilderness

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week