Blocking Scrapers - How? - Apache Web Server forum at WebmasterWorld

Forum Moderators: phranque

Message Too Old, No Replies

Blocking Scrapers - How?

old_expat

9:17 am on Feb 25, 2009 (gmt 0)

Yesterday in 12 hours I got 4,000 + hits from what I believe is a scraper. InfoPath.1 and InfoPath.2. Same thing today.

Here is part of the request in my access log.

"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; GTB5; InfoPath.2; .NET CLR 2.0.50727; OfficeLiveConnector.1.3; OfficeLivePatch.0.0)"

The requests are coming from multiple IPs.

I tried blocking this with a redirect in my .htaccess. I copied the code from an old WW thread.

Did I make a mistake by not using 'InfoPath.1' and 'InfoPath.2' .. ?

.. am I taking the wrong approach?

Any suggestions that I can understand will be appreciated.

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
...
RewriteCond %{HTTP_USER_AGENT} ^InfoPath [OR]
...
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule ^.*$ ['URL'...] [L,R]

[edited by: jdMorgan at 12:52 pm (utc) on Feb. 25, 2009]
[edit reason] Fixed formatting and trimmed [/edit]

jdMorgan

1:07 pm on Feb 25, 2009 (gmt 0)

InfoPath and OfficeLiveConnector are support modules for business collaboration and the new Microsoft "cloud computing" initiative. They are used, for example, so that you can use Microsoft Word running on a computer at Microsoft to edit documents on your own PC. They are being installed automatically by MS Update, and many legitimate MSIE users are probably not aware of this. I cannot recommend blocking user-agents with these strings in them. I would recommend re-analyzing your data to be sure that there are other reasons to consider these user-agents to be scrapers.

The problem is that you have specified in your regular expression that the UA must *start* with InfoPath, and the string does not start with "InfoPath", it starts with "Mozilla". If you need help with regular expressions, then see the tutorial cited in our Forum Charter.

If you still think these are scrapers, then look into checking *all* the request headers they send, including the ones that don't show up in your standard log files. Luckily, many scrapers make mistakes when spoofing user-agents, which makes it easy to block some of them.

Jim

old_expat

2:39 am on Feb 26, 2009 (gmt 0)

Hi Jim,

I see what you mean. I'm certainly inept when it comes to regular expressions.

If you still think these are scrapers, then look into checking *all* the request headers they send, including the ones that don't show up in your standard log files. Luckily, many scrapers make mistakes when spoofing user-agents, which makes it easy to block some of them.

I haven't a clue how to even start looking into these 2 issues .. especially #2. Can you give me a "jump start"?

jdMorgan

1:50 pm on Feb 26, 2009 (gmt 0)

I use a script to log requests that look suspicious to me, and record all the standard headers sent by the client, many of which are not saved in typical server logs. Then I can use that data to be sure that the requesting user-agent is in fact acting like the user-agent it claims to be. Data collected in this manner can be used to build firewall filter rules and mod_rewrite access-control rules.

Jim