Welcome to WebmasterWorld Guest from 54.147.10.12

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

TekSavvy

     
12:04 pm on Feb 5, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5408
votes: 2


This name sticks in my craw with some nagging reccolection that I don't have OLD documentation on?
Anybody recall?

In late October the following:
206.248.137.zz - - [28/Oct/2007:05:37:13 -0500] "GET /robots.txt HTTP/1.0"
200 4514 "-" "canasasearchbot(http://canadasearch.no-ip.info)"

Yesterday an undidentified crawl of 40 pages:
206.248.167.zzz - - [04/Feb/2008:02:38:55 -0600] "GET / HTTP/1.0" 200 6774 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 1.1.4322)"

Now about once a hour:
64.41.145.zzz - - [04/Feb/2008:20:26:20 -0600] "GET / HTTP/1.1" 301 313 "-"
"Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.2a) Gecko/20021021"

8:46 pm on Feb 28, 2008 (gmt 0)

New User

5+ Year Member

joined:Mar 28, 2006
posts: 11
votes: 0


They hit me for 0.5 Gig overnight.

Savvis controls 64.41.128.0 - 64.41.255.255.

Attributor Corporation has the offending IP Range 64.41.145.0 - 64.41.145.255

They are a content management company that looks for duplicate content for their paying customers. And they provide this service by running huge scrapes. Yet another company raking in the bucks off my bandwidth.

My solution:

RewriteCond %{REMOTE_ADDR} 64\.41\.145
RewriteRule ^.*$ IP_Range_Banned.php [L]

[edited by: NotNeYzer at 9:07 pm (utc) on Feb. 28, 2008]

3:28 am on Mar 1, 2008 (gmt 0)

Junior Member

10+ Year Member

joined:June 25, 2005
posts:179
votes: 1


Yesterday an undidentified crawl of 40 pages:
206.248.167.zzz - - [04/Feb/2008:02:38:55 -0600] "GET / HTTP/1.0" 200 6774 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 1.1.4322)"
Let me guess! IP is in the list "someone-is-scraping-me.html"?
4:21 am on Mar 1, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5408
votes: 2


Actually, if it's the same Canadain bot that visited in October?
Than both my websites contian widgets of interest to Candian's.

Unfortuantely when the bot changes to a standard browser UA and does not read robots.text?
Any influence their audience might have in my web pages is simply negated by that lack of protocol.

Don

11:37 am on Mar 5, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2004
posts:1667
votes: 36


-- Let me guess! IP is in the list "someone-is-scraping-me.html"? --

the IPS from that list are a botnet I'v been trying to deal with. Lots of thouse IPs are from known COLOs and Hosting Ranges.

Here is something that is the same for every request:

REQUEST HEADERS:

HTTP Request item: Value
------------------------
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*
Connection: Close
Content-Length: 0
UA-CPU: x86
Host: www.domaininquestion.com
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 1.1.4322)
Accept-Language: en-us

------------------------
request_method: GET
server_protocol: HTTP/1.0
http_content:
------------------------

The more I look at it the more I don't understand the purpose of this. It looks like a "refferer spam", a scrape and the botnet at the same time. All sites that have been visited are link to each other in some way. A lot of times, this bot will choke on BASEHREF.

2:09 pm on Mar 5, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5408
votes: 2


-- Let me guess! IP is in the list "someone-is-scraping-me.html"? --

This is a pretty broad term from a webmasters perspective?
There are plenty of reasons that a bot may be grabbing materials, most of which are unknown to the webmaster of the site (s) that are being spidered.
I get spidering from many Universites and "they" would have us believe they are entitled to an "open door" merely because "they are who they are", rather than providing a link to detailed explantion of both their particular project and their "intent" for the materials.

the IPS from that list are a botnet I'v been trying to deal with. Lots of thouse IPs are from known COLOs and Hosting Ranges.

I deny all colos and selling-hosts as spidering traffic from unidentified websites in NOT beneficial to my sites.

As an aside; I've long believed that there mere mention of a colo or hosts name in this forum, merely provides them with free advertising.

A lot of times, this bot will choke on BASEHREF.

My sites are rather simple (KISS) and I had to do a google on "BASEHREF".
I've used relative links for an eternity and many bots choke on these as well (to me; this choking is a sure that the particular bot needs adding to the denial list. Same adding goes for bots that get 404's due to case errors).

Could somebody possibly explain to me any benefit or differnce between "BASEHREF" and "relative"?

Don

11:06 pm on Mar 19, 2008 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14624
votes: 88


Don,

The only time I used BASEHREF was on a shared SSH cert where the files were in a different path than the default root path, kind of hairy but typically not needed.

For instance the main URI is [mydomain.com...] and the shared SSL server is [ssl2.myhost.com...] and my site on the shared SSL is then [ssl2.myhost.com...] therefore all the relative files tend to default to [ssl2.myhost.com...] instead of the full path and you specify the exact path in the BASEHREF to fix the problem.

Hope that clears it up.

12:20 pm on Mar 20, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5408
votes: 2


Many thanks Bill.