Analyzing requests on every pageview vs. Every 1 minute

Forum Moderators: open

Message Too Old, No Replies

Analyzing requests on every pageview vs. Every 1 minute

Pros & cons...

MichaelBluejay

9:43 am on Dec 13, 2006 (gmt 0)

Can I ask why people are running their bot-checking scripts on every page request, vs. once every minute via a cron job? Seems like once a minute would be a short enough window to catch bots before they do much damage. And running your scripts on every pageview means more server load when the site is busy. If you're getting a page request every second, your anti-bot script runs 60 times a minute, vs. just once a minute with a cron job. I ask because I'm deciding which to implement, and I'm attracted to the performance advantages of the cronjob method and don't really see a significant downside by not running it on every pageview.

Or is this too personal a question?

incrediBILL

8:55 pm on Dec 13, 2006 (gmt 0)

It's called PROACTIVE vs REACTIVE responses.

I run dynamic sites and I've had hundreds of page requests in a single minute and by then my server is bogged down and unresponsive until the queue clears out. Therefore, being PROACTIVE and checking per page, I've shut their speedy butts down in just a few pages and my server is still serving up pages vs. if I only checked once a minute being REACTIVE they would have my site at a stand still for quite a few minutes.

Besides, there are certain things you can only do in real-time, which you cannot detect in a log file, at least not the standard log files anyway.

[edited by: incrediBILL at 8:55 pm (utc) on Dec. 13, 2006]

MichaelBluejay

9:30 pm on Dec 13, 2006 (gmt 0)

Okay, that's a good explanation, thanks.

Could you elaborate on the things you can do in real-time that you can't do with log files, or would that give too much info to bot-makers?

incrediBILL

10:05 pm on Dec 13, 2006 (gmt 0)

Well, here's a couple of examples...

Most log files don't track if the site came from a proxy server which allows me to track individuals accessing things like Google's translator, which is a sometimes source of scraping, opposed to just blocking it entirely because of too much access.

There are also some subtle things in the request header information that help me determine if it's a real mobile device and not someone spoofing a cell phone, a Treo, or some such device, or if it's a CGI/PHP proxy server, lots of little clues lost in the log file.

Lastly, you can't easily challenge a visitor with a CAPTCHA after they've already hit and run with a post-mortem logfile review. I can see if the visitor just kept asking for additional pages (bot) when CAPTCHA's are displayed, OR... answered the CAPTCHA in a couple of tries.

incrediBILL

10:09 pm on Dec 13, 2006 (gmt 0)

Oh yes, I also do "REVERSE CLOAKING" and put breadcrumbs into each page served to anyone other than a search engine. The search engine gets a clean copy of the page but everyone else gets breadcrumbs in the text, cloaked with CSS so you don't see them, they blend into the background of the page. However, the scrapers rip out all HTML and links, thus the breadcrumbs and plainly visible in their pages and also show up in the search engines.

This makes it easy to link the scraper to the online scrapings and it's trivial to find them since the search engines point them out to me.

Can't do that reviewing a log file ;)

MichaelBluejay

12:00 am on Dec 14, 2006 (gmt 0)

Interesting. That gives me the idea that one could go a step further, and put this invisible text into the page:

Stolen on [date] by [IP/User-Agent] from [mysite]

incrediBILL

12:52 am on Dec 14, 2006 (gmt 0)

You catch on quick.

However, you need a completely unique keyword for locating your stuff, otherwise it's more work to find than it's worth.

Tastatura

2:34 am on Dec 14, 2006 (gmt 0)

Cron job vs. realtime - Fairly recent and interesting info on the subject in this thread
Spider Traps and Honey Pots - Design Considerations [webmasterworld.com]