Forum Moderators: phranque

Message Too Old, No Replies

HTTP 1.0 and MSIE

Scrapers

         

Peter

8:58 pm on Jan 28, 2007 (gmt 0)

10+ Year Member



Forgive me if this is a stupid question, but is there any legitimate reason for a user agent containing MSIE - for instance "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT)" - to show in the apache logs as HTTP 1.0 (and not 1.1)?

I've noticed that someone who is trying to scrape my content signals himself this way, and I was hoping to block him very easily as a result, but a closer examination of the logs shows that a few apparently legitimate visitors seem to do this as well. Are they in fact legitimate?

Thank you for your comments.

Peter.

jdMorgan

9:46 pm on Jan 28, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If your site is on name-based shared web hosting, then it is by definition inaccessible via HTTP/1.0.
However, some legitimate user-agents, such as Googlebot, aqdvertise as HTTP/1.0 for compatibility reasons. They are actually using 'extended' HTTP/1.0, and can therefore access name-based virtual hosts.

So, a better way to reject these scrapers is to validate the Windows version. Since "Windows NT" by itself and not followed by a version number is invalid, that's an obvious way to catch them without blocking search engine robots.

Jim

Peter

11:26 pm on Jan 28, 2007 (gmt 0)

10+ Year Member



Thanks for your reply. Yes, I should have said, the site is name based and not the default server site.

I seem to get a fair number of legitimate visitors with slightly unlikely user-agents, and I wouldn't normally want to refuse them just for that. What surprised me was to find a few **apparently** legitimate visitors (in addition to my scraping friend) presenting a combination of fake MSIE and HTTP/1.0.

Peter.

jdMorgan

12:14 am on Jan 29, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Your legitimate users may be connecting through HTTP/1.0 proxies -- old equipment at their work, for example.

I suggest you use the invalid user-agent screening as described above -- It has worked well for me on dozens of servers for 9+ years... I have never seen an example of a legitimate client with an invalid Windows version.

Jim

Peter

12:49 am on Jan 29, 2007 (gmt 0)

10+ Year Member



Yes, the use of proxies would explain the HTTP/1.0. For the invalid Windows user-agents, I'll follow your suggestion - as I have so many times in the past! Thanks again.

Peter.