homepage Welcome to WebmasterWorld Guest from 54.211.157.103
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Request Protocol
wilderness




msg:4568002
 5:34 pm on Apr 25, 2013 (gmt 0)

There's a couple of old threads on this topic.
key master suggested this syntax and Jim provide the example.

lucy has mentioned the limied use of HTTP/1.0, more than a few times.

SetEnvIf Request_Protocol HTTP/1\.0$ Bad_Req

Has anybody used it?

 

wilderness




msg:4568251
 12:37 pm on Apr 26, 2013 (gmt 0)

Didn't document the submission time, however it was more than eight hours getting a new thread approved. DISMAL!

FWIW, I went through a very large yesterdays log and EVERY

HTTP/1.0 request was for ranges either previously denied or added as harvesters (for lack of a better word).

I've added the line on the site of the large log, and at least it failed to generate a 500 ;)

Will take until tomorrow to determine if it functions as desired.

Don

dstiles




msg:4568305
 7:19 pm on Apr 26, 2013 (gmt 0)

See my thread on synapse, which mentions http:/1.0.

I am now inclined to the view that synapse IS a bad bot and should be blocked. That uses http/1.0. I think there are a few smaller but beneficial bots that still use http/1.0 - one of the UK services still used nutch as a verifier until recently and had to be accommodated.

I do not block http/1.0 specifically but it is implicit in several other blocking methods. Perhaps I should revisit this.

wilderness




msg:4568310
 7:51 pm on Apr 26, 2013 (gmt 0)

Many of the server farm IP's are using this for their crawls.

BTW, it does function as intended.
I remove the Class A denial on a bit that was appearing hourly and Request_Protocol denied their subsequent visits.

Key_Master




msg:4569020
 9:05 pm on Apr 29, 2013 (gmt 0)

I use a slightly different rule now that enforces a strict observance to HTTP 1.1:

SetEnvIf Request_Protocol ^(?!HTTP/1\.1$) ban
wilderness




msg:4574847
 12:32 am on May 17, 2013 (gmt 0)

Just an update after a few weeks.

Adding this line has brought my involvement with my friends SMF forum to a few minutes daily. Nearly all his access issues were related to these protocol requests.

blend27




msg:4575258
 2:15 pm on May 18, 2013 (gmt 0)

I started writing a detailed response to this thread back in April. Unfortunately on one of the Virtual Parallels Workstation images that got corrupted the next day :( ,,... anyway....

Most of the Open source scraping packages use http:/1.0. Most of the spam bots do. Some PROXY Servers from large Corporations 'still' do as well :(.

In my book: Unless the request is sent with the 'full head of headers' it is 99% block-able. On its own it is just an indicator, strong one, that something is not to a par.

But then again:

ip: 157.55.32.111
remote host: msnbot-157-55-32-111.search.msn.com (0)
method: GET
protocol:HTTP/1.0
User-Agent: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Connection: Keep-Alive
From: bingbot(at)microsoft.com
URI: /robots.txt
Accept: */*
Cache-Control: no-cache
-----------------------------------------------------------------------------------
ip: 83.149.126.98
remote host: 83.149.126.98 (-4)
method: GET
protocol: HTTP/1.0
User-Agent: Mozilla/5.0 (compatible; MJ12bot/v1.4.3; http:// www.majestic12.co.uk/bot.php?+) (UA altered on purpose)
Connection: close
URI: /robots.txt
Accept: */*
Accept-Language: en
-----------------------------------------------------------------------------------

Live visitor via Squid Proxy

ip: 61.90.11.XXX
remote host: ppp-61-90-11-XXX.revip.asianet.co.th (0)
method: GET
protocol: HTTP/1.0
Accept-Encoding: gzip,deflate,sdch
User-Agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17
Via: 1.0 PROXY, 1.0 efw-60.greyhound.co.th:8080 (squid/2.6.STABLE22)
Connection: Keep-Alive
Accept-Charset: windows-874,utf-8;q=0.7,*;q=0.3
Referer: http:// www.google.co.th/imgres?.... blah bla blah
URI: /some.html
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
X-Chrome-Variations: .... blah bla blah==
Cache-Control: max-age=259200
Accept-Language: th-TH,th;q=0.8

-----------------------------

Comment spammer from OVH

ip: 46.105.122.108
remote host: ns384327.ovh.net (0)
time: {ts '2013-05-17 21:34:17'}
method: GET
protocol: HTTP/1.0
host: forum.example.com <----- this site never had a forum
user-agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/22.0.1229.94 Safari/537.4
URI: /
referer: root of the site
accept: image/gif, image/jpeg, image/pjpeg, application/x-ms-application, application/vnd.ms-xpsdocument, application/xaml+xml, application/x-ms-xbap, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, application/x-shockwave-flash, */*
cookie: CFID=5111132; CFTOKEN=830f5b80

Notice that this one also sent cookie information, unfortunately for them, that cookie was not for them :)

MickeyRoush




msg:4576472
 11:43 am on May 22, 2013 (gmt 0)

I prefer to use this:

RewriteCond %{THE_REQUEST} !HTTP/1\.1$ [NC]
RewriteRule .* - [F]

dstiles




msg:4576645
 6:46 pm on May 22, 2013 (gmt 0)

Starting last weekend, I rather belatedly began blocking HTTP/1.0 (actually, everything that wasn't 1.1).

Most of the blocks I'm seeing due to this are valid and would in any case have been caught by other means. Those that are not valid catches seem to be proxies, especially for UK education (mainly schools). Annoying because at least one of my sites is used by schools. Some (but not all) of the proxies also have faulty header field combinations, for which I'd previously had to make exceptions. I've now had to add HTTP/1.0 to the exceptions for these. Doubly annoying because these were the teachers I was trapping but pupils use the same proxies in some cases and I'm concerned about them abusing the loopholes. Ah, well.

lucy24




msg:4576693
 8:56 pm on May 22, 2013 (gmt 0)

Do you also get robots using proxies, or are these all legitimate humans? You may be able to poke a hole in the 1.0 block by adding a look at the X-Forwarded-For header.

:: detour here to figure out why randomly chosen request has three separate URLs in this slot ::

dstiles




msg:4577161
 6:40 pm on May 23, 2013 (gmt 0)

They're schools. With kids aka juvenile hackers! :)

The proxies can vary within the IP group depending on the individual school's setup, though they are usually different versions of the same basic proxy server.

The proxies have to be genuine, known ones for me to whitelist them, but what goes on behind them I have no idea and little chance of finding out.

I did once have a long email conversation with one of the providers and we ironed out a few of his problems but on the whole I have to rely (now) on backup blocking techniques (Oh! A google bot. How novel and unexpected!).

dstiles




msg:4583988
 9:28 pm on Jun 13, 2013 (gmt 0)

Further to my recent blocking of HTTP/1.0 headers:

I've had to exclude testing of this if a proxy is used. Far too many use HTTP/1.0 protocol and a customer was complaining (see note re: schools above). :(

Also, some search engines use HTTP/1.0 at least sometimes. I've been trying to resolve blocking issues with bingbot (rare but noticeable) and mail.ru bot (always). It looks as if allowing both protocols has solved the problem.

All of which is a shame because I've had to gradually move the test further down the list, which has slowed the test slightly. When I first added the test it speeded up the test process. Ah, well, it's not a major time change. :(

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved