Forum Moderators: open
First of all this forum really is a cool place, I never knew there was such a keen interest in this, so I was wondering if anyone has this running before I go and reinvent the wheel.
I go through my logs once a day, when I can that is, to look for abnormal/strange/new activity and have in the past just simply banned ip blocks from the whole server if I ever saw anything crop up in any of the websites' logs.
It got to the point that for amusment, spammers looking for an open sendmail.pl on an NT box is basicically adding things to a database that I keep track of. (If you ever saw those get requests the ignorant phusks actually attempt to email a freemail account with the website url so they can just check their mail for open spam-ready servers)
Now after getting slammed by some copyright bots that actually degraded performance and had the router guy call me up thinking it could've been a DoS, I want to create a automatic honeypot I set up that will temporary redirect, based on a session time and frequency of get requests, those that ignored a robots.txt file or doesn't have a list of valid user agents.
Well I read all the discussions here before posting, and I already got a list (I think so anyway) of ALL normal browser agents from Mac to Lynx. I even did lookups on most of all the ip addresses listed, and most are legit, the others I don't really care if they view my sites if they are mistakes, I tend to think that people here seem experienced, and noone's gonna pollute this forum with bogus ip's to scare newbies away.
Well this is IIS4 and 5 machines I want to experiment with and then apply it globally.
All I'm thinking about doing is planting a link with an invisible gif spacer I already have in place to a directory that is denied access by the robots.txt file. The linked page is just an asp page with another linked page to another denied dir, and I just want to keep track of this session for 4 subsequent pages in denied directories. If there are 4 gets within a x time it will just lock the session to a constant redirect to a page of my choice. If their user-agent string rotates within x time or y get requests then it will redirect as well. I am going to store this to a simple system dsn to keep track of it's effectiveness, for testing I'll just have it email me so I can stop and check the logs to see what triggered it.
What i'd -really- like to do is globally ban the suckers right from the ip filter by writing to it directly, once this is working right, and then resetting it after keeping track of what doing what... but I don't know how to write something like and these are production machines. so the global.asa will have to do for now I guess.
This concept will, I understand, stop some pre-caching proxies, but that's my choice. I can no longer have a malicious spider strap a whole server down because one website happens to get scanned by one of these things that blatently did ignore the robots.txt file that exclusivly banned all agents from a particular dir because of its dynamic stress on database hits.
I know that most scanners rotate their ip addresses and their browser types as well, so its down to an AI now to track and stop this crap, before they get really sneaky and reset their tactics to just get a page an hour, which I can't understand why they didn't in the first place.
Since I posted I already set up nested dummy asp pages in a restricted robots.txt dir to record the ip as 'flagged' if it hits 4 pages (I store the IP address, the user agent, the hit count from that ip through the dummy pages, and a 'flagged' flag.
was wondering if it was possible to hook that into the actuall dll for filtering or something in the long run so it will work on a global scale. I'll share my results, however, from different websites. Shouldn't take long to populate it, I would think.
They are only hit by those asp pages themselves, and in different directories all of which have robot.txt exclusions. It is hit by a spider that doesn't follow the robots exclusion list... then its logged.
I don't know how else to go about it, if the user agent can be forged the only way to tell if its a rouge spider is to see if it follows the robot.txt guidelines that are black-listed with acceptable and known spiders.
This means that even spoofing UA robots could get caught, even more of a reason to track.