homepage Welcome to WebmasterWorld Guest from 54.237.78.165
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
New bot Java/1.5.0_06 grabs all pages
grabbed all pages from 2 different domains
privacyman

10+ Year Member



 
Msg#: 3146 posted 12:58 am on Feb 3, 2006 (gmt 0)

Managing my own domains plus several domains which are independent of my own does give an advantage for spotting new, questionable, or bad bots.

Recently I found my site and several other (isolated and independent) domains had the same log entries for IP number and user agent. For each site this bot grabbed every page from each site.

My research of the IP and its group (and provider) revealed what I consider "concealed identity" wherein the registrar did not give owner names and lookup by address did not give any company name or individual name. I also went to the domain name(S) associated with the IP and it had a Flash page with no alternative content (I deliberately have Flash uninstalled... never use it for many reasons).

Because of "lack of information" on the owner of the IP cidr group (provider of service to the bots IP) and no reverse dns on the individual IP and not much else, plus with it grabbing all pages, I blocked the entire cidr group plus the user agent.

The IP number was 69.85.234.27 and UserAgent was Java/1.5.0_06

For the UserAgent, G and other SE's showed it was a plugin for some browsers.

The cidr range 69.85.192.0/18 of 69.85.192.0 - 69.85.255.255 belongs to
slfiber.com in Alabama. Search of G by address shows Harbor Communications LLC in Mobile AL and where I have found that a huge amount of spam originates from southern states I would sooner block the entire group (16k). I could be a valid new SE but I did not submit to them and would sooner protect my site and those I manage.

Every page grabbed from multiple independent domains is NOT right.

Just a heads-up to watch for the IP and UA.

 

thetrasher

5+ Year Member



 
Msg#: 3146 posted 4:38 pm on Feb 3, 2006 (gmt 0)

An FTP server replies on HTTP-Requests?! No Flash.

"M4cub3x (c) FTP Server (Version 6.5/OpenBSD) server ID."

Do you need visits from a server?

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3146 posted 5:26 pm on Feb 3, 2006 (gmt 0)

Most everybody and every UA deny list complied includes deinal to JAVA in all forms.

You may skip the Flash intro at their site and it reads the following:

"we deliver cutting edge services to carriers, business and goverment entities."

Not a mention of private internet services.
A sub page under "services" also offers co-location.

As a result, I agree with your decision to deny the entire range.

RewriteCond %{REMOTE_ADDR} ^69\.85\.(19[2-9]¦2[0-5][0-9])\. [OR]


jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3146 posted 5:01 pm on Feb 5, 2006 (gmt 0)

... As well as adding

# Block Java and Python URLlib except from Google and Yahoo
RewriteCond %{HTTP_USER_AGENT} ^(Python[-.]?urllib¦Java/?[1-9]\.[0-9]) [NC]
RewriteCond %{REMOTE_ADDR}!^207\.126\.2(2[4-9]¦3[0-9])\.
RewriteCond %{REMOTE_ADDR}!^216\.239\.(3[2-9]¦[45][0-9]¦6[0-3])\.
RewriteRule .* - [F]

so your sites can't get raided again by Java- or Python-based scrapers.

(Note that IP rangess may need some expansion/updating - I haven't checked this in a while.)

Jim

pocpocpoc

5+ Year Member



 
Msg#: 3146 posted 7:04 am on Feb 9, 2006 (gmt 0)

I've also encountered this bot. It first arrived on January 22, and my server gave it the 403 treatment. It keeps coming back from different IP's, presumably because of the 403.

I've logged 130 visits now, from 92 unique IP addresses, all with different user agents that look like Java versions.

I saw it from 69.85.234.38 (close to your .27 signting) on January 29. Most of the source IP's have generic or missing reverse DNS. A few are servers, all of which so far appear to be running Windows.

Yesterday, I started giving this bot a 301 to a nonexistent site. We'll see if that has any effect.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved