Welcome to WebmasterWorld Guest from 50.16.112.199

Forum Moderators: Ocean10000 & incrediBILL

TekSavvy

   
12:04 pm on Feb 5, 2008 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



This name sticks in my craw with some nagging reccolection that I don't have OLD documentation on?
Anybody recall?

In late October the following:
206.248.137.zz - - [28/Oct/2007:05:37:13 -0500] "GET /robots.txt HTTP/1.0"
200 4514 "-" "canasasearchbot(http://canadasearch.no-ip.info)"

Yesterday an undidentified crawl of 40 pages:
206.248.167.zzz - - [04/Feb/2008:02:38:55 -0600] "GET / HTTP/1.0" 200 6774 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 1.1.4322)"

Now about once a hour:
64.41.145.zzz - - [04/Feb/2008:20:26:20 -0600] "GET / HTTP/1.1" 301 313 "-"
"Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.2a) Gecko/20021021"

8:46 pm on Feb 28, 2008 (gmt 0)

5+ Year Member



They hit me for 0.5 Gig overnight.

Savvis controls 64.41.128.0 - 64.41.255.255.

Attributor Corporation has the offending IP Range 64.41.145.0 - 64.41.145.255

They are a content management company that looks for duplicate content for their paying customers. And they provide this service by running huge scrapes. Yet another company raking in the bucks off my bandwidth.

My solution:

RewriteCond %{REMOTE_ADDR} 64\.41\.145
RewriteRule ^.*$ IP_Range_Banned.php [L]

[edited by: NotNeYzer at 9:07 pm (utc) on Feb. 28, 2008]

3:28 am on Mar 1, 2008 (gmt 0)

5+ Year Member



Yesterday an undidentified crawl of 40 pages:
206.248.167.zzz - - [04/Feb/2008:02:38:55 -0600] "GET / HTTP/1.0" 200 6774 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 1.1.4322)"
Let me guess! IP is in the list "someone-is-scraping-me.html"?
4:21 am on Mar 1, 2008 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Actually, if it's the same Canadain bot that visited in October?
Than both my websites contian widgets of interest to Candian's.

Unfortuantely when the bot changes to a standard browser UA and does not read robots.text?
Any influence their audience might have in my web pages is simply negated by that lack of protocol.

Don

11:37 am on Mar 5, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



-- Let me guess! IP is in the list "someone-is-scraping-me.html"? --

the IPS from that list are a botnet I'v been trying to deal with. Lots of thouse IPs are from known COLOs and Hosting Ranges.

Here is something that is the same for every request:

REQUEST HEADERS:

HTTP Request item: Value
------------------------
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*
Connection: Close
Content-Length: 0
UA-CPU: x86
Host: www.domaininquestion.com
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 1.1.4322)
Accept-Language: en-us

------------------------
request_method: GET
server_protocol: HTTP/1.0
http_content:
------------------------

The more I look at it the more I don't understand the purpose of this. It looks like a "refferer spam", a scrape and the botnet at the same time. All sites that have been visited are link to each other in some way. A lot of times, this bot will choke on BASEHREF.

2:09 pm on Mar 5, 2008 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



-- Let me guess! IP is in the list "someone-is-scraping-me.html"? --

This is a pretty broad term from a webmasters perspective?
There are plenty of reasons that a bot may be grabbing materials, most of which are unknown to the webmaster of the site (s) that are being spidered.
I get spidering from many Universites and "they" would have us believe they are entitled to an "open door" merely because "they are who they are", rather than providing a link to detailed explantion of both their particular project and their "intent" for the materials.

the IPS from that list are a botnet I'v been trying to deal with. Lots of thouse IPs are from known COLOs and Hosting Ranges.

I deny all colos and selling-hosts as spidering traffic from unidentified websites in NOT beneficial to my sites.

As an aside; I've long believed that there mere mention of a colo or hosts name in this forum, merely provides them with free advertising.

A lot of times, this bot will choke on BASEHREF.

My sites are rather simple (KISS) and I had to do a google on "BASEHREF".
I've used relative links for an eternity and many bots choke on these as well (to me; this choking is a sure that the particular bot needs adding to the denial list. Same adding goes for bots that get 404's due to case errors).

Could somebody possibly explain to me any benefit or differnce between "BASEHREF" and "relative"?

Don

11:06 pm on Mar 19, 2008 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Don,

The only time I used BASEHREF was on a shared SSH cert where the files were in a different path than the default root path, kind of hairy but typically not needed.

For instance the main URI is [mydomain.com...] and the shared SSL server is [ssl2.myhost.com...] and my site on the shared SSL is then [ssl2.myhost.com...] therefore all the relative files tend to default to [ssl2.myhost.com...] instead of the full path and you specify the exact path in the BASEHREF to fix the problem.

Hope that clears it up.

12:20 pm on Mar 20, 2008 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Many thanks Bill.
 

Featured Threads

My Threads

Hot Threads This Week

Hot Threads This Month