Forum Moderators: DixonJones

Message Too Old, No Replies

Unknown crawlers with dynamic ip's

Crawlers using dyamic ip's

         

rsmarsha

9:52 am on Mar 16, 2007 (gmt 0)

10+ Year Member



We have had the following crawlers hitting the site at the same time every day for a while now:

87117

16 Mar, 07:40:53, 84-12-#*$!-184.dyn.example.co.uk, 282 pages
Windows 2003, Explorer 6.0

87119

16 Mar, 07:40:53, 84-12-#*$!-194.dyn.example.co.uk, 230 pages
Windows 2003, Explorer 6.0

87116

16 Mar, 07:40:52, host-84-9-XX-50.example.com, 262 pages
Windows 2003, Explorer 6.0

They use common user agents and are coming from personal ips' ip ranges. The providers in question have either denied the crawlers are using their service, or that there is nothing they can do about it.

Any ideas on how we can stop the offending crawlers?

[edited by: Receptional at 10:56 am (utc) on Mar. 16, 2007]
[edit reason] Anonymised /examplified the IPs [/edit]

Receptional

10:59 am on Mar 16, 2007 (gmt 0)



I'm no programmer, but presumably they are loading these pages in very quick succession? Wouldn't it be possible to block any IP number in that ISP's range that loads (say) 10 pages in a minute for a period of (say) 10 minutes, making the crawler stop in its tracks without overly hampering the not user on the same ISP's IP number?

rsmarsha

11:25 am on Mar 16, 2007 (gmt 0)

10+ Year Member



Thats a nice idea, will have to look into how i'd go about doing that. :)

eelixduppy

11:39 am on Mar 16, 2007 (gmt 0)



Make sure you don't block Google, Yahoo, etc... :)

rsmarsha

12:51 pm on Mar 16, 2007 (gmt 0)

10+ Year Member



Yeah it wouldn't be a problem, i'd lookup the ranges for the isp of the offenders and work it around those.

Not sure how to go about it at present. No idea if you can do something like was suggested above with apache's httpd.conf or if it would have to be a php method.

The stat program that gave me the full details (bbclone), only records full details of the last 100 or so. When i get in Monday i'll take a full print of which pages they accessed that day and for how long so i can work out the pages per second block as Receptional suggested. :) The stat program that gave me the full details (bbclone), only records full details for a certain amount of previous visitors.

motorhaven

11:04 pm on Mar 16, 2007 (gmt 0)

10+ Year Member Top Contributors Of The Month



Set up honey pots that trap them and auto-ban them. Use hidden links that users can't see but the crawlers will. Exclude "valid" crawlers from the honey pot(s) via robots.txt.

Our honey pots trap at least one of these daily.

rsmarsha

8:38 am on Mar 19, 2007 (gmt 0)

10+ Year Member



Already have one set up. The problem with these is they are not following the links, just set categories and product pages. The same ones every day.

motorhaven

11:57 pm on Mar 19, 2007 (gmt 0)

10+ Year Member Top Contributors Of The Month



Are they loading images as well? I have code in place that detects browsers that should load images but don't. Very few legit surfers these days turn off images.

rsmarsha

9:18 am on Mar 20, 2007 (gmt 0)

10+ Year Member



Not sure to be honest, i just get a list of the pages they have visited.

Tastatura

9:25 am on Mar 20, 2007 (gmt 0)

10+ Year Member



The problem with these is they are not following the links, just set categories and product pages. The same ones every day.

This *might* be a competitor keeping tabs on your prices (or someone else interested in your product/prices for what ever reason)

rsmarsha

3:29 pm on Mar 20, 2007 (gmt 0)

10+ Year Member



Thats what i'm thinking. Just need to work out a way to stop them doing it.

Tastatura

4:08 pm on Mar 20, 2007 (gmt 0)

10+ Year Member



Thats what i'm thinking. Just need to work out a way to stop them doing it.

If that is the case, and you can positively identify them (particular pages that they grab, frequency, etc.) you could have some "fun" with them - such as serving (different then actual) prices for products.

rsmarsha

4:45 pm on Mar 20, 2007 (gmt 0)

10+ Year Member



Hehe i'm liking that idea. :)

I just can't think of a way to determine it's them as they are using dynamic ip's.

If i track their pages over a few days i can see if they are looking at the same pages daily. Not sure how to detect frequency of page load (with php or something in apache conf, preferably).