Stopping spoofing spiders?

Forum Moderators: phranque

Message Too Old, No Replies

Stopping spoofing spiders?

JAB Creations

6:07 am on Jun 3, 2005 (gmt 0)

An interesting idea popped in to my head just a moment ago, with spiders spoofing with common UAs, their behavior should still be detectable compared to a human. So then would there be a way to detect if say an IP requests a page but fails to request the files referenced on that page such as images? Would be it possible to write a serverside script that would block such IPs?

keyplyr

9:50 am on Jun 3, 2005 (gmt 0)

One way would be to put together a DNScache.txt and have your script check to see if the IP ranges match those you have for each UA.

jdMorgan

3:54 pm on Jun 3, 2005 (gmt 0)

This is a complex application, since "behaviour over time" enters into it. That is, you need to track a specific client over time and over multiple HTTP requests. This requires a database to be searched and updated for each incoming HTTP request. Only if a particular client from a particular IP address fails to fetch (in your example) a non-cacheable image over some period of time (say 20 seconds, or after a certain number of additional requests from that same IP) would you want to consider blocking it.

This will likely require a database containing the IP address (perhaps indexed by an MD5 hash of the IPs to speed lookups), the time of the last request, a list of URLs that will require the non-cacheable image to be fetched, and a counter that indicates the remaining time or remaining number of requests after that URL is fetched before you will consider that client from that IP to be malicious.

The main drawback is server load; The script will need to do a database lookup and update for every HTTP request arriving at your server. This is not a good approach for a busy site.

This approach also suffers from the "closing the barn door after the horse has already left" flaw... By the time you declare a client to be malicious, it may have already collected what it wanted. If you set the parameters too strictly, you risk blocking legitimate users, and if too loosely, letting the damage be done.

Personally, I prefer the simpler methods described in two "bad-bot" scripts published here on WebmasterWorld -- the first by Key_Master [webmasterworld.com] and the second by xlcus [webmasterworld.com], with later versions of Key_Master's script "enhanced" by myself and others [webmasterworld.com], and of xlcus's script enhanced by AlexK [webmasterworld.com].

However, if you decide to pursue development of a new behaviour-based blocking script, some of the ideas in those threads may be helpful.

Jim

Matt Probert

6:02 pm on Jun 3, 2005 (gmt 0)

So then would there be a way to detect if say an IP requests a page but fails to request the files referenced on that page such as images?

You are assuming that all human's can see, use a GUI browser, and want to download images!

Be careful, you could be throwing the baby out along with the bath water.

Matt