homepage Welcome to WebmasterWorld Guest from 54.166.95.146
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
TekSavvy
wilderness




msg:3566600
 12:04 pm on Feb 5, 2008 (gmt 0)

This name sticks in my craw with some nagging reccolection that I don't have OLD documentation on?
Anybody recall?

In late October the following:
206.248.137.zz - - [28/Oct/2007:05:37:13 -0500] "GET /robots.txt HTTP/1.0"
200 4514 "-" "canasasearchbot(http://canadasearch.no-ip.info)"

Yesterday an undidentified crawl of 40 pages:
206.248.167.zzz - - [04/Feb/2008:02:38:55 -0600] "GET / HTTP/1.0" 200 6774 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 1.1.4322)"

Now about once a hour:
64.41.145.zzz - - [04/Feb/2008:20:26:20 -0600] "GET / HTTP/1.1" 301 313 "-"
"Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.2a) Gecko/20021021"

 

NotNeYzer




msg:3587467
 8:46 pm on Feb 28, 2008 (gmt 0)

They hit me for 0.5 Gig overnight.

Savvis controls 64.41.128.0 - 64.41.255.255.

Attributor Corporation has the offending IP Range 64.41.145.0 - 64.41.145.255

They are a content management company that looks for duplicate content for their paying customers. And they provide this service by running huge scrapes. Yet another company raking in the bucks off my bandwidth.

My solution:

RewriteCond %{REMOTE_ADDR} 64\.41\.145
RewriteRule ^.*$ IP_Range_Banned.php [L]

[edited by: NotNeYzer at 9:07 pm (utc) on Feb. 28, 2008]

thetrasher




msg:3588590
 3:28 am on Mar 1, 2008 (gmt 0)

Yesterday an undidentified crawl of 40 pages:
206.248.167.zzz - - [04/Feb/2008:02:38:55 -0600] "GET / HTTP/1.0" 200 6774 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 1.1.4322)"
Let me guess! IP is in the list "someone-is-scraping-me.html"?
wilderness




msg:3588614
 4:21 am on Mar 1, 2008 (gmt 0)

Actually, if it's the same Canadain bot that visited in October?
Than both my websites contian widgets of interest to Candian's.

Unfortuantely when the bot changes to a standard browser UA and does not read robots.text?
Any influence their audience might have in my web pages is simply negated by that lack of protocol.

Don

blend27




msg:3591808
 11:37 am on Mar 5, 2008 (gmt 0)

-- Let me guess! IP is in the list "someone-is-scraping-me.html"? --

the IPS from that list are a botnet I'v been trying to deal with. Lots of thouse IPs are from known COLOs and Hosting Ranges.

Here is something that is the same for every request:

REQUEST HEADERS:

HTTP Request item: Value
------------------------
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*
Connection: Close
Content-Length: 0
UA-CPU: x86
Host: www.domaininquestion.com
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 1.1.4322)
Accept-Language: en-us

------------------------
request_method: GET
server_protocol: HTTP/1.0
http_content:
------------------------

The more I look at it the more I don't understand the purpose of this. It looks like a "refferer spam", a scrape and the botnet at the same time. All sites that have been visited are link to each other in some way. A lot of times, this bot will choke on BASEHREF.

wilderness




msg:3591894
 2:09 pm on Mar 5, 2008 (gmt 0)

-- Let me guess! IP is in the list "someone-is-scraping-me.html"? --

This is a pretty broad term from a webmasters perspective?
There are plenty of reasons that a bot may be grabbing materials, most of which are unknown to the webmaster of the site (s) that are being spidered.
I get spidering from many Universites and "they" would have us believe they are entitled to an "open door" merely because "they are who they are", rather than providing a link to detailed explantion of both their particular project and their "intent" for the materials.

the IPS from that list are a botnet I'v been trying to deal with. Lots of thouse IPs are from known COLOs and Hosting Ranges.

I deny all colos and selling-hosts as spidering traffic from unidentified websites in NOT beneficial to my sites.

As an aside; I've long believed that there mere mention of a colo or hosts name in this forum, merely provides them with free advertising.

A lot of times, this bot will choke on BASEHREF.

My sites are rather simple (KISS) and I had to do a google on "BASEHREF".
I've used relative links for an eternity and many bots choke on these as well (to me; this choking is a sure that the particular bot needs adding to the denial list. Same adding goes for bots that get 404's due to case errors).

Could somebody possibly explain to me any benefit or differnce between "BASEHREF" and "relative"?

Don

incrediBILL




msg:3605850
 11:06 pm on Mar 19, 2008 (gmt 0)

Don,

The only time I used BASEHREF was on a shared SSH cert where the files were in a different path than the default root path, kind of hairy but typically not needed.

For instance the main URI is [mydomain.com...] and the shared SSL server is [ssl2.myhost.com...] and my site on the shared SSL is then [ssl2.myhost.com...] therefore all the relative files tend to default to [ssl2.myhost.com...] instead of the full path and you specify the exact path in the BASEHREF to fix the problem.

Hope that clears it up.

wilderness




msg:3606279
 12:20 pm on Mar 20, 2008 (gmt 0)

Many thanks Bill.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved