Forum Moderators: open
For example, say I have these directories and files which run off the home directory:
dir1/subdir1a/file1.htm
dir2/file2.htm
I'll see requests for something like this in my 404 stats:
dir1/dir2/file2.htm
dir1/subdir1a/file1.htm/file2.htm
Some of them are very obviously not a typo because there are two filenames in the URL, not just smushing directory names together.
Usually there has been no referring URL showing in my stats (Awstats). However, increasingly I'll see a referring address that is another one of these compound "addresses" (often the same "address" for over a dozen requests), or it might be a legitimate address in my site.
This has got to be a bot throwing deliberate 404s at my site, but I have no idea who it would be or what they're looking for when they do it. I've wondered if it might be hack attempts, like someone trying to get directory access by deliberately requesting a path that doesn't exist? But when I try it manually I just get the standard 404 page. I believe Inktomi does deliberately false requests, although I'm not sure what it gets out of doing that. I don't think this is Inktomi though, because I've seen what I think is them for a long time and this stuff only started around April.
Between these compound URLS and the truncated URLS, I have over 12 screens of 404s so far this month! I can't tell if I have any real 404s because all this other garbage is in the way.
I don't know if it's a legitimate error (unlikely) or someone's robot run amok or a virus making the rounds. Does anyone else see this in their stats? Does anyone have any idea what is causing these things? Thanks for any help with this.
Starhugger
Awstats and all the other stats that hosts provide are nice tools from the hosts perspective.
From an individual webmasters perspective (whom desire to anaylze visitors and stats) "nothing" replaces genuine and full logs.
There are software's which analyze full log line data.
Don
Log example:
65.54.188.149 - - [21/May/2006:05:20:02 -0700] "GET /myfolder/mypage.html HTTP/1.0" 404 - "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
Here's a brief sample of the kind of exploration this thing does (the dir and file names have been changed), starting with reading robots.txt (commas separate fields):
209.249.86.4, -, -, [01/May/2006:00:09:43, -0700], GET /robots.txt HTTP/1.0, 200, 384, -, -
209.249.86.4, -, -, [01/May/2006:00:10:03, -0700], GET /dir1/subdir1a/file1.htm HTTP/1.0, 200, 30091, -, -
209.249.86.4, -, -, [01/May/2006:00:10:13, -0700], GET /dir1/subdir1a/file1.htm HTTP/1.0, 200, 30091, -, -
28 more 200's of various pages, many repeats
209.249.86.4, -, -, [01/May/2006:00:15:22, -0700], GET /dir1/subdir1a/file2.htm/file3.htm HTTP/1.0, 404, -, -, -
2 more the same
209.249.86.4, -, -, [01/May/2006:00:15:52, -0700], GET /dir1/subdir1a/file2.htm/file1.htm HTTP/1.0, 404, -, -, -
209.249.86.4, -, -, [01/May/2006:00:16:02, -0700], GET /dir1/subdir1a/file2.htm HTTP/1.0, 200, 32307, -, -
It continues in that kind of fashion, with a lot of repeating the exact same request. It also crawls through various system dirs and files, triggering various 301 codes. I'm not sure what the field is immediately after the 200, 404, 301, etc. field, but that second field sometimes has various 3-digit codes and sometimes nothing. The 200 requests typically have various 4 or 5 digit codes in that second field.
In the stats I could get at (it's a HUGE file) I didn't see any refering URL, although I sometimes see them in my Awstats, so I guess those instances are in another part of the file that didn't load into Excel.
I did a WhoIs lookup for 209.249.86.4 and here's what I found in ARIN:
Abovenet Communications, Inc ABOVENET-4 (NET-209-249-0-0-1) 209.249.0.0 - 209.249.255.255
Kavam MFN-T595-209-249-86-0-24 (NET-209-249-86-0-1) 209.249.86.0 - 209.249.86.255
I also found this, which is very interesting considering the "first seen" date and these 404s started sometime in April:
209.249.86.0/24 More Specific
Current-Status: Announced
First-Seen: 2200h 13 Apr 2006 UTC
Last-Seen: Current
Origin_AS: AS36737 -- KAVAM - Kavam, Inc
First_Hop_AS: AS6461 -- MFNX MFN - Metromedia Fiber Network
So does anyone know anything about these guys? I couldn't find a phone number for them in the WhoIs info I found in a couple of places, although there is one for Abovenet. I'm unclear whether Abovenet owns Kavam or whether they are perhaps just their webhost.
Any help deciphering this is much appreciated! :)
Starhugger
And what about your "compound URL" hitter? Shoot. Just nuke it!
Given the IP [dnsstuff.com] you provided, plus the prevalence of its CustName "Kavam" via a quick search [google.com], I'm going to add these to my htaccess:
SetEnvIfNoCase User-Agent "kavam" keep_out
SetEnvIfNoCase Remote_Addr "209\.249\.86\.[0-9]+" keep_out
(Your code may vary.)
Thanks for the specific head's up!
The code you gave here: Is that supposed to be one line or two? I know often .htaccess lines can wrap making them look like 2 lines but won't work if they are entered that way.
Also, in the other thread about this, where this same bot is called Charlotte, Wilderness gives this code:
Options -Indexes
<Limit GET>
SetEnvIf User-Agent Charlotte keep_out
order allow,deny
deny from 209.249.86.
allow from all
deny from env=keep_out
</Limit>
What is the functional difference between your code and this other code?
And last: Could you direct me to a thread or site that gives detailed instructions on which ones and how to block the obvious nasties?
Thanks for the help. :)
Starhugger
Could you direct me to a thread or site that gives detailed instructions on which ones and how to block the obvious nasties?
There is no such thing as "obvious"!
Each webmaster MUST make their own decision on what is beneficial or detrimental to their own site (s).
With the above in mind?
Copying and pasting from lines that others have suggested may often lead to disabling your website (500 erros) for all visitors (yourself included) because of a simple syntax error.
[webmasterworld.com...]
[webmasterworld.com...]
[webmasterworld.com...]
[webmasterworld.com...]
[joseluis.pellicer.org...]
[botspot.com...]
[psychedelix.com...]
[spiderhunter.com...]
[projecthoneypot.org...]
I just thought I'd update some info about this bot. I see that the last time it visited my site was May 1st and hasn't been back since. The last visit in April was the 30th. So (trying on the benefit of the doubt), I'm thinking maybe part of the problem might have been bad programming and they discovered the problem and took it offline until they fixed it. Or...(hurling benefit of the doubt out the window) maybe they had their fun or achieved their mission and left. Maybe they'll be back when June turns the page. Who knows. I've blocked them in .htaccess anyway.
Starhugger