Welcome to WebmasterWorld Guest from 18.104.22.168
These days lots of spider crazy claw the site. I have an idea here, I am not using it yet, too risky, so wanna your view.
If (UserAgent = IE/Firefox/Opera)
And connection = close (in Header)
And webserver default setting is: keep-alive
Then I think it's a spider, how you think?
is a valid header that can be returned by Firefox/Opera/IE. I usually only see it done when going though a proxy server or similar service. So I wouldn't block a client just on this alone, but you can use it as a tip off to do more checks.
I better header to check is the "Accept" Header which the major web browsers use, and again some proxy servers will remove it. But its easier to check for common proxy server headers. If the Accept Header is missing its more then likely a Bot or Client coming though a proxy server. FYI Mobile browsers are a crap shoot at including the Accept header.
I use the Accept Header method to remove a lot of spoofing bot traffic on my site. It keeps my bandwidth bill light and scrapers moving on to easier targets.
[edited by: Ocean10000 at 3:22 pm (utc) on Mar. 16, 2008]
I have a problem of spoofing IE7.
Here is the header:
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 1.1.4322)
Look pretty right? The problem is, my robot trap will catch this header from lots of different IP on a short period of time, and strange thing is, no matter where the IP from, the Accept-Language is en-us, and the Accept is same means they must has same installed software, and User-Agent also same means they even have same patch!
To odd to not to think it's a faked IE7. Any ideas?
So I think I will block a ip if:
Connection = Close
and Accept = image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*
and Accept-Language = en-us
and User-Agent = Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 1.1.4322)
How you think?
I do block based on some invalid accept and connection headers but typically I profile the visitor in real-time and block bot-like behavior such as:
- Fall into my spider trap
- Use invalid SGML/HTML in URIs
- Speed traps to stop accesses too fast for a human
(make sure you disable pre-fetch in htaccess first)
- Ask for robots.txt file
- Doesn't ask for CSS, .js, .gif and .jpg files
- so on and so forth, etc.
FWIW, some of the bots think they're clever and ask for all the files on the first page they access then take off screaming through the site and get immediately busted for speed.
Others are a bit smarter and go a bit slower and ask for hundreds of pages over days.
However, doing a test for humans at the control after certain odd behavior happens when the average number of page views is skewed has so far been the only clear cut way to block the stealth bots.
My question is, for example, if googlebox was in this trap and will it delete the page already in then database? Or they will keep the old version and try get new stall next time?
(I guess they will keep old one untill another http200, what's your idea, or is there any official declares?
Basically my security goes like this:
1. search engines get a free pass into the site, no further scrutiny
2. anything that's not MSIE/Firefox/Opera gets bounced, almost all other unwanted junk bounces here
3. anything claiming to be MSIE/Firefox/Opera get further scrutiny, such as speed traps, spider traps, bad headers, bad SGML/HTML, etc.