homepage Welcome to WebmasterWorld Guest from 54.204.231.110
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / WebmasterWorld / Website Analytics - Tracking and Logging
Forum Library, Charter, Moderators: Receptional & mademetop

Website Analytics - Tracking and Logging Forum

    
Strange hits from random. bots?
Logs show multiple 20 hit sessions with java browser
Duskrider




msg:889340
 6:12 am on Jun 7, 2006 (gmt 0)

Hello,

I've got a site that's fairly new, it's only been online about a month or so. Lately, just in the last week, my logs have been showing access to the site by several random bots from different locations all over the world. I've been webmaster of a few sites, but none have ever really accumulated much traffic, so I'm not sure if this is normal or strange.

Basically, between 10 and 20 times a day, a bot comes and hits exactly 20 pages of my site. More often, maybe 30 times a day, a bot of the same type will hit 5 pages and stop. It always lists as a Java Browser, either 1.5.0 or 1.4.1 and can come from anywhere in the world (I've seen Romania, China, Korea, USA, Germany, etc. Majority are from the Netherlands).

When they visit, they crawl through the site very quickly, hitting each of my first 20 (or 5) pages for just a couple of seconds. Normally this wouldn't worry me, but when I look at my logs, way more than half my traffic comes from these hits.

About the same time as this started happening, I was going around to various directories and free stuff sites (my site has free widgets for download), and submitting my site to everything I could find... trying to get links. I'm assuming these hits are due to my submitting my URL to someone during that time.

Is this harmful or just annoying? Can I block them, should I?

Thanks for any information.

 

oxbaker




msg:889341
 11:14 pm on Jun 7, 2006 (gmt 0)

these bots are typical of popular sites and search engine portals. Basically there are two types of bots: good and bad. Google uses googlebot (can be read in the UserAgent field) and that spiders pages to be indexed for their databases, many hackers and script kiddies make bots to search for porn, music, etc, or more malicious reasons. You cant really do much to prevent them. A Robots.txt file is a good idea for ANY website, but again, only the "good" bots really adhere to it. This site (webmasterworld.com) had some serious issues with bots and i suggest you find the thread about it. Basically you cant stop them from coming in without LOADS of work, but if you ever have to report on them, make sure you filter out things like Freedom or internal-zero-knowledge agent, as these are NOT real people but bots. and you will report inaccurate visits / hits to your site because of this.

Linear regression analysis is a popular method for filtering out bots during reporting. but to prevent them you can really only add a robots.txt file and perhaps, join the IAB and get their list of "BAD BOT ADDRESSES" and block out requests from those ip's (keep in mind that list is like $10,000 US to buy though)

hth,
mcm

Pfui




msg:889342
 7:21 pm on Jun 10, 2006 (gmt 0)

Sounds like you've opened yourself up to troublesome visitors from bad bots and/or bad neighbors, and/or because you offer free widgets. Bummer. Here are some recommendations/solutions:

1.) Seeing as how the hits you described appear automatic, many of the countries you named are notorious for generating trouble of ALL kinds (ranging from e-mail and log spam to hacking and intrusion attempts), and the visitors don't look like they're there for what you have to offer, I'd err on the block 'em side.

2.) For example, I've banned all permutations of the User-agent (UA) "Java" (FYI [webmasterworld.com]) on all of my sites for years and with no ill effects. Here's one of many quick ways to send a 403 (Forbidden) if you've got access to .htaccess and your server's equipped with mod_rewrite:

RewriteEngine on

## If a UA begins with "Java" (regardless of version number) 
RewriteCond %{HTTP_USER_AGENT} ^Java
## Tell it No Way
RewriteRule ^.*$ - [F]

3.) You can handle huge numbers of Bad Guys of all kinds (ditto country-specific IPs, and hosts, etc.) in many ways once you get the hang of it. Check out the "Search Engine Spider Identification [webmasterworld.com]" forum, the "robots.txt [webmasterworld.com]" and the "Apache Web Server [webmasterworld.com]" forum for info and how-to.

(The latter forum is where you'll also learn how to make sure only visitors actually on your site can download widgets from your site.)

4.) Touching .htaccess for any reason can be extremely effective -- and can also result in your locking yourself (and ALL visitors) out. Plus this site's programming tents to turn pipes (straight up and down lines) into broken lines, and lop off spaces before exclamation marks. So take care, go _s_l_o_w_l_y_, and get the basics down. And then go nuts;)

5.) If you already know your way around .htaccess and/or mod_rewrite, you can dive into the deep end, and follow-up in the aforementioned areas as need be:

A Close to perfect .htaccess ban list
[webmasterworld.com...]

6.) When you want to look up a specific UA, check out the terrific info from psychedelix.com:

List of User-Agents (Spiders, Robots, Crawler, Browser)
[psychedelix.com...]

Note: That many-pages list can be pretty daunting, even depressing, and definitely free time-consuming. That's why last year, a couple of us decided to block every UA except for those including the word "Mozilla" and also kick the bots, etc., spoofing/abusing that. (I mention this so you'll know you can block a single Bad Guy, or thousands in one swell foop:)

Good luck!

rfontaine




msg:889343
 7:33 pm on Jun 10, 2006 (gmt 0)

My recent post about a similar situation:
[webmasterworld.com...]

blend27




msg:889344
 12:07 pm on Jun 11, 2006 (gmt 0)

Every one of these

Java/1.4.1_01
Java/1.4.1_02
Java/1.4.1_04
Java/1.4.2
Java/1.4.2_01
Java/1.4.2_03
Java/1.4.2_04
Java/1.4.2_05
Java/1.4.2_06
Java/1.4.2_08
Java/1.4.2_09
Java/1.4.2_10
Java/1.5.0
Java/1.5.0_01
Java/1.5.0_02
Java/1.5.0_03
Java/1.5.0_04
Java/1.5.0_05
Java/1.5.0_06
Java/1.6.0-beta
Java1.3.1_03
Java1.4.0_01

gets in to the bot trap with no exeptions as soon as they arrive.

I did do some experiments with these small "animals" like allowing the Java bot see the page with the limited amount of URI's on the page. Most of the time it looks like list of URIs gets extracted sorted in alpha order and revisited. as far as where the come from?-- every known ISP so far.

Duskrider




msg:889345
 8:52 am on Jun 12, 2006 (gmt 0)

Thanks much for all the information everyone, I appreciate it.

Pfui - I've just now began to experiment with .htaccess - luckily my host allows me full access to the file and all its goodies. I actually found that ban list about a day after I posted this. While doing a search for bad bots on G that thread came up. The list is online now in my .htaccess, but I'm still gitting my hits.

Now they're hitting 22 pages every time instead of 20. I have far more pages than that, but it seems the more I add, the more they add. Wierd. I just wonder what the heck they're doing. /shrug

I guess I'll just have to take the route of Pfui and blend27 - hack all java agents off at the knees. They don't seem to be doing me any good anyway.

Thanks agian.

Pfui




msg:889346
 9:24 am on Jun 12, 2006 (gmt 0)

Duskrider, I'm glad all of our replies didn't set your head to spinnin' too much:)

One concern...

If you copy-pasted any Rewrite (or SetEnv, etc.) list from anywhere, even from WW, there's a really good chance it's not going to work properly because lists can still be incomplete snippets, even if they're really long.

For example, you say, 'The list is online now in my .htaccess, but I'm still getting my hits' -- however if Java is to be 403'd (Forbidden), visitors using that UA should be stopped on the spot.

If such is the case, if Java is to be rewritten [F] but still getting in, something's not working properly. Check out the aforementioned Apache forum for help because even a misplaced comma or an extra [OR] can prevent mod_rewrite from working its magic.

Romeo




msg:889347
 10:55 am on Jun 12, 2006 (gmt 0)

## Tell it No Way
RewriteRule ^.*$ - [F]

... or REWRITE to a valid alternate page with some broken content like
"<html><body>This is my pages about widgets"

A few bytes being sent with a friendly code 200 look better and more unsuspiscious to the bot than a plain 403, which could probably alert the bot owner to investigate and adapt and change, which would make it harder to find on our side then.

No need to show them that we got them ...

Kind regards,
R.

Duskrider




msg:889348
 1:24 pm on Jun 12, 2006 (gmt 0)

I actually did just copy and paste the code from WW into my .htaccess file. Everything looks ok at first glance, but to be honest I'm not really sure exactly what it's doing.

There doesn't seem to be anything to do with java at all in that list, at least not the one I got, so I wouldn't expect any java agents to be blocked at this point.

I suppose my best bet is to start learning about .htaccess and what the rewrite is, how it works, and what it does exactly. I'll be better off that way anyway - rather than just blindly copy/pasting code and crossing my fingers. I know screwing up the .htaccess file can really mess a site up, so I figure I better learn sooner or later regardless. :)

Thanks again for the heads up!

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / WebmasterWorld / Website Analytics - Tracking and Logging
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved