Forum Moderators: open

Message Too Old, No Replies

How to Stop Crawlers in IIS

         

dhaliwal

10:49 pm on May 2, 2007 (gmt 0)

10+ Year Member



There are many programs which do not obey robots.txt file. Only some good website crawlers will see this file and stop crawling your website.

What can be done on IIS based web servers to stop users from using crawling softwares which really effect the performance of web server drastically.

Thanks in advance
Dhaliwal

LifeinAsia

11:01 pm on May 2, 2007 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



You can block by IP. Unfortunately, a lot of bots use many different IPs.

dataguy

3:10 pm on May 3, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



We've used a system of counting page views by IP addresses and it they exceeded certain limits over a period of time, the IP address would be added to a ban list and the scraper would get a friendly message asking for them to "please slow down" just in case of false positives.

This system proved very effective, though we decided it was more worthwhile to allow the scrapers in because we benefit from the links they provide. (sounds like a dumb reason, right?)

I believe WebmasterWorld uses a similar system. Some ISP's offer spider traps at the firewall level.

Ocean10000

3:54 pm on May 3, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



On the sites I use I use a dynamic robots.txt along with white listing acceptable bots. This filters about 90% of the ones I see on a daily bases, meaning if its not on the white list it is blocked. Also by downloading the robots.txt they are marked as a bot, and treated as such. The remaining 5% that don't grab robots.txt are blocked known ip ranges of hosting companies. The roughly 5% percent that slip though, may be caught by aggressive page scraping (grabbing too many pages in a short time) and blocked. Slow scrapers using unknown hosting ranges and spoofing Useragents and not touching robots.txt file will sneak by for awhile, this is how I eliminated the main trouble making bots from my sites.

I have done this with asp.net 1.0 and asp.net 2.0. The sites am refer to are completely dynamic, static files are filtered though special handlers so they are protected as well.