Welcome to WebmasterWorld Guest from

Forum Moderators: Ocean10000 & incrediBILL & keyplyr

Message Too Old, No Replies

New bot Java/1.5.0_06 grabs all pages

grabbed all pages from 2 different domains

12:58 am on Feb 3, 2006 (gmt 0)

New User

10+ Year Member

joined:Apr 28, 2003
votes: 0

Managing my own domains plus several domains which are independent of my own does give an advantage for spotting new, questionable, or bad bots.

Recently I found my site and several other (isolated and independent) domains had the same log entries for IP number and user agent. For each site this bot grabbed every page from each site.

My research of the IP and its group (and provider) revealed what I consider "concealed identity" wherein the registrar did not give owner names and lookup by address did not give any company name or individual name. I also went to the domain name(S) associated with the IP and it had a Flash page with no alternative content (I deliberately have Flash uninstalled... never use it for many reasons).

Because of "lack of information" on the owner of the IP cidr group (provider of service to the bots IP) and no reverse dns on the individual IP and not much else, plus with it grabbing all pages, I blocked the entire cidr group plus the user agent.

The IP number was and UserAgent was Java/1.5.0_06

For the UserAgent, G and other SE's showed it was a plugin for some browsers.

The cidr range of - belongs to
slfiber.com in Alabama. Search of G by address shows Harbor Communications LLC in Mobile AL and where I have found that a huge amount of spam originates from southern states I would sooner block the entire group (16k). I could be a valid new SE but I did not submit to them and would sooner protect my site and those I manage.

Every page grabbed from multiple independent domains is NOT right.

Just a heads-up to watch for the IP and UA.

4:38 pm on Feb 3, 2006 (gmt 0)

Junior Member from DE 

10+ Year Member

joined:June 25, 2005
votes: 1

An FTP server replies on HTTP-Requests?! No Flash.

"M4cub3x (c) FTP Server (Version 6.5/OpenBSD) server ID."

Do you need visits from a server?

5:26 pm on Feb 3, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
votes: 3

Most everybody and every UA deny list complied includes deinal to JAVA in all forms.

You may skip the Flash intro at their site and it reads the following:

"we deliver cutting edge services to carriers, business and goverment entities."

Not a mention of private internet services.
A sub page under "services" also offers co-location.

As a result, I agree with your decision to deny the entire range.

RewriteCond %{REMOTE_ADDR} ^69\.85\.(19[2-9]¦2[0-5][0-9])\. [OR]

5:01 pm on Feb 5, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
votes: 0

... As well as adding

# Block Java and Python URLlib except from Google and Yahoo
RewriteCond %{HTTP_USER_AGENT} ^(Python[-.]?urllib¦Java/?[1-9]\.[0-9]) [NC]
RewriteCond %{REMOTE_ADDR}!^207\.126\.2(2[4-9]¦3[0-9])\.
RewriteCond %{REMOTE_ADDR}!^216\.239\.(3[2-9]¦[45][0-9]¦6[0-3])\.
RewriteRule .* - [F]

so your sites can't get raided again by Java- or Python-based scrapers.

(Note that IP rangess may need some expansion/updating - I haven't checked this in a while.)


7:04 am on Feb 9, 2006 (gmt 0)

New User

10+ Year Member

joined:Sept 12, 2005
votes: 0

I've also encountered this bot. It first arrived on January 22, and my server gave it the 403 treatment. It keeps coming back from different IP's, presumably because of the 403.

I've logged 130 visits now, from 92 unique IP addresses, all with different user agents that look like Java versions.

I saw it from (close to your .27 signting) on January 29. Most of the source IP's have generic or missing reverse DNS. A few are servers, all of which so far appear to be running Windows.

Yesterday, I started giving this bot a 301 to a nonexistent site. We'll see if that has any effect.