Forum Moderators: open
Thank you.
Dan
My rule of thumb is , if I havent heard of it before it is of no use and not going to give me anything in return.
1.2.3.4 - - [23/Nov/2002:19:24:44 +0100] "GET /x.gif HTTP/1.1" 200 963 www.domain.net "referer here" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" "-"
where "Mozilla..." is the UA.
You apparently have the so-called "common log format", not providing the UA. If there is a file named httpd.conf in the root directory of your site, there is most likely these lines somewhere:
LogFormat "%h %l %u %t \"%r\" %>s %b" common
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined
If there is also a line beginning with "CustomLog .... common", you might to change it to "combined".
[edited by: bull at 9:42 pm (utc) on Nov. 23, 2002]
64.140.49.68 ICG NetAhead, Englewood Co.
68.3.115.137 Customer of Cox Communications
66.77.73.145 Fast Search
165.166.181.156 Customer of Rock Hill Telephone Co, Rock Hill, SC
216.167.97.169 Customer of Verio, Inc.
With the exception of the Fast Search engine spider, you've got a mixed bag, here.
You have to look these guys up - in ARIN [arin.net], for example - and then take into account what you can find out about:
1) Who they are
2) What User-agent they use
3) What they "do" on your site
1a) For the first IP, look these guys up and decide whether their stated business purposes justify deep visits to your site. Are they offering anything to your or your customers/visitors? Sometimes, you'll find "web information miners" that offer specialized web-gathered information for sale. They use your bandwidth to sell information to their customers. You decide if this is useful to you, or just theft of bandwidth.
1b) For the ones marked "Customer of xyz", you can't track them any further based on just the IP address. They may be dial-up users who are assigned a "random" IP address out of their ISP's pool of addresses each time they connect, for example. So, you go on to:
2) What user-agent do they use? Is it "Mozilla/x.xx (compatible; )" or "Opera/xx.xx"? This would be a normal IE, Netscape, or Opera browser. If not, is it a known-abusive web site harvester like "Indy Library"? Does it appear in the Close to perfect htaccess ban list [webmasterworld.com] posted here on WebmasterWorld? If so, it's bad news. If not, a search through this forum may turn up something. For total unknowns, we go on to:
3) What do they do on your site? Download your whole site in 15 seconds, bringing your server to its knees? Poke around looking for forrmail.pl? Dig into your robots.txt file and then go for disallowed files? Do they load just your html pages, and ignore images and scripts?
You have to analyze the big picture to figure this out, and the only really helpful thing is experience. Also, everyone's focus and attitude is a little different. From "ignorance is bliss" and "anything goes" to "ban everything that does not benefit my site or my visitors or customers."
Having decided, you can block by IP address or IP address range for known-static intruders. You can block by User-agent for those who use well-known e-mail address harvesters, site downloaders, etc. You can block by referer for sites which "borrow" your images. There are also several pesky 'bots that use a legitimate User-agents and legitimate-but-inappropriate referers (e.g. iaea.org) to "make it look good" while they download your whole site.
Sometimes, you just can't get a handle on a fixed IP address, User-agent, etc. In that case, you can set up some spider traps on your pages, and have them call a cgi script to dynamically add the offending IP address to your ban list. I am working on this at this time, using a modified version of this spider-trap script [webmasterworld.com] posted here on WebmasterWorld by member Key_Master. I can tell you it's effective, but potentially dangerous if you make a mistake - If you're not very careful, you could easily block Googlebot for example. Once I get more experience with it, I may post again with what I've learned about doing this safely.
The key issues are:
1) How to name the spider trap files to make them attractive to bad 'bots.
2) How to name them so that Google doesn't list links to them in SERPs (without cloaking).
3) How to protect them so that Google doesn't fetch them (there is a conflict here with #2)
4) How to hide them so that normal users can't find them, click on them, and get banned.
Right now, I know just a little about these four points, and would be uncomfortable advising anyone.
HTH,
Jim
Thanks very much
Dan
Actually, I had an ulterior motive:
How many members here use a spider trap script approach, rather than having to spending hours each day looking at raw log files and adding bans line-by-line? As I said, I've begun playing with this approach, and I see that it can be quite effective against "brute force" site attacks. It's not a "final answer", it's just another tool in the toolbox. But I'm just an apprentice in the craft, and hoping to learn to use it better...
Thanks,
Jim
Most folks think I'm quite overbearing. However I tend to let most visitors enter at will until they begin acting malicious before responding.
In most instances there is some type of initial warning.
A lone inquiry into robots.txt, A HEAD only entry. These are the most common.
In some instances a bot will just beging hitting hard.
In the past week I've had two hits from AT&T users. This added on to a AT&T intrusion back in January prompted me to expand my AT&T denies. Notifying AT&T of the expanded denials. Only response was automated.
I've had something similar with MSN and Verizon recently.
Many folks are having Verio problems as well.
I think these unidentified private intrusions are going to increase as backbones are moved around in the failing economy.
The end result may well require extensive use of traps.
It's too bad that most service providers don't accept these intrsuions as intrusions even when they are provided with TOS links from their own websites. :-(