homepage Welcome to WebmasterWorld Guest from 54.242.200.172
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe and Support WebmasterWorld
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 40 message thread spans 2 pages: < < 40 ( 1 [2]     
How to ban (compatible ; type requests
Note space between compatible and semicolon
larryhatch




msg:402942
 10:18 am on Jun 23, 2006 (gmt 0)

My access_log files have long had bogus requests designed to mimic organic traffic

Many of these give themselves away by:
a) Requesting only .html pages, never the associated images ..
b) Fast rate of requests. more like a spider than a human ..
c) Deliberately obscure alterations to the usual use agents

A good example of c) is "Mozilla/4.0 (compatible ; etc. "
==> Note the space between 'compatible' and the semicolon ';'.

I want to disallow 'compatible ;', with the strangely placed space -BUT- I have to be careful!

If .htaccess ignores the space as 'whitespace', I will throw away 2/3 of my organic traffic!

1) Does anybody have a known good bullet-proof way to do this?
2) Am I disallowing by USER_AGENT like this?
RewriteCond %{HTTP_USER_AGENT} Java/1 [NC,OR] ..
or is it {HTTP_SOMETHING_ELSE}?

Help much appreciated! -Larry

 

wilderness




msg:3010140
 4:32 am on Jul 16, 2006 (gmt 0)

Doing some follow-up #:3006416

And found the following in old records.

131.107.3.91 - - [13/Sep/2003:19:54:13 -0700] "GET /myfolder/ HTTP/1.0" 200 13010 "-" "Mozilla/4.0 (compatible; MSIE 4.0; Windows NT; ....../1.0 )"
207.46.225.251 - - [13/Sep/2003:19:54:20 -0700] "GET /myfolder HTTP/1.0" 301 310 "-" "Mozilla/4.0 (compatible; MSIE 4.0; Windows NT; ....../1.0 )"

interesting IP ranges and these crawls were the subject of extensive conversation at that time.

The recent IP's on my site were not MSN ranges.

wilderness




msg:3010142
 4:36 am on Jul 16, 2006 (gmt 0)

I thought we needed to escape SetEnvIf User-Agent details, a la:

SetEnvIf User-Agent "compatible\ \;" keep_out
SetEnvIf User-Agent "\(compatible\;\ MSIE\ 5\.0\)" keep_out
SetEnvIf User-Agent "\(compatible\;\ MSIE\ 6\.0\)" keep_out
SetEnvIf User-Agent "\(compatible\;\ MSIE\ 7\.0\)" keep_out

Anne,
using quotes around the expression and then attempting to escape the blank spaces is reduntant. I'm not sure how it's hadnled.
It's assuredly NOT necessary.

Also I've noticed that you use quotes on every single line.
from approx 1,600 lines of rewrites I have less than a dozen that use quotes.
You may find reason for more than I, however it's definetly NOT required on every line.

Don

Pfui




msg:3010844
 11:40 pm on Jul 16, 2006 (gmt 0)

I'm intrigued about your no-quotes, Don, because I must have a million of them. Am not sure if I simply kept replicating some snippet I found somewhere, or misinterpreted this to mean quote-and-escape everything or what:

Example:

SetEnvIf Request_URI "\.gif$" object_is_image=gif
SetEnvIf Request_URI "\.jpg$" object_is_image=jpg
SetEnvIf Request_URI "\.xbm$" object_is_image=xbm

[httpd.apache.org...]

Or maybe I picked it up from referers and/or arrays?

SetEnvIfNoCase User-Agent "(curl¦libcurl¦libcurl-agent)" keep_out

SetEnvIf Request_URI "exec" keep_out

SetEnvIfNoCase Referer "^http://(www\.)?example\." keep_out
SetEnvIfNoCase Referer "localhost¦server¦example¦robots" keep_out

[webmasterworld.com...]

## NAMEPROTECT.COM BOT: 12.175.0.32 - 12.175.0.47
SetEnvIf Remote_Addr "12\.175\.0\.[0-9]+" keep_out

Beats heck out of me. But at least I can attest that quotes don't prevent anything from happening:)

You don't use them with any SetEnv? Not even any of the above?

[edited by: Pfui at 11:44 pm (utc) on July 16, 2006]

wilderness




msg:3010876
 12:20 am on Jul 17, 2006 (gmt 0)

SetEnvIf Request_URI "\.gif$" object_is_image=gif
SetEnvIf Request_URI "\.jpg$" object_is_image=jpg
SetEnvIf Request_URI "\.xbm$" object_is_image=xbm

This would all function as intended without the use of quotes.
SetEnvIf Request_URI \.gif$ object_is_image=gif
SetEnvIf Request_URI \.jpg$ object_is_image=jpg
SetEnvIf Request_URI \.xbm$ object_is_image=xbm

SetEnvIfNoCase User-Agent "(curl¦libcurl¦libcurl-agent)" keep_out

same here
SetEnvIfNoCase User-Agent (curl¦libcurl¦libcurl-agent)keep_out

SetEnvIf Remote_Addr "12\.175\.0\.[0-9]+" keep_out

I don't understand this line (however I'm most positive the quotes are redundant as well.
What's the ending plus sign for?
An example of a CDIR range is provided on the DNS Stuff Box as follows:
192.168.112.0/24)
end of quote.
I'm not at all sure your may use the [0-9] epression in SetEnvIf. I don't.

I use a Rewrite for NameProtect
RewriteCond %{REMOTE_ADDR} ^12\.175\.0\.(3[2-9]¦4[0-7])$ [OR]

You don't use them with any SetEnv?

of 378 lines of SetEnvIf;
I have a mere three instances where I use quotes.
And, one was recently added (the subject that I added to this thread).

Beats heck out of me. But at least I can attest that quotes don't prevent anything from happening


The extra quotes are redundant and really shouldn't stop anything from functioning, however to add them in unneccesarily is a bad practice.
There may come a time when you have a syntax error in your htaccess and your required to go through character-after-character and line-after-line to locate the error.
I am able to to convey that the more crap (in this instance redundancy) you have?
The more of your hair that you will be pulling out!
Of course, all that will be after you have banged your head into the wall a couple of times for allowing yourself to make the syntax error in the first place.

Don

Pfui




msg:3012244
 9:58 pm on Jul 17, 2006 (gmt 0)

Don, thanks for extended reply. As always, rewrite/regex details take me a while to digest:) I'll look around for where I got that ending plus sign (was a thread here).

Jim, over the weekend I used this (from #:3007302), with pipes and spaces fixed --

# Missing Windows NT version number
RewriteCond %{HTTP_USER_AGENT} Windows\ NT
RewriteCond %{HTTP_USER_AGENT}!Windows\ NT\ (4\.0¦5\.[0-2])(\)¦;\ [^)])
RewriteRule .* - [F]

-- and blocked a poor AOL'r using this:

Mozilla/4.0 (compatible; MSIE 5.01; AOL 5.0; Windows NT; {BBF3CA51-22C0-11D9-B66A-00B0D0C36340})

FWIW

[edited by: Pfui at 10:00 pm (utc) on July 17, 2006]

incrediBILL




msg:3012350
 11:18 pm on Jul 17, 2006 (gmt 0)

You sure it was a legit AOL'er?

The MSIE user agent ALWAYS has a version # after "Windows NT" so "Windows NT;" should be invalid.

Just because they are on AOL doesn't mean they aren't using automated tools, or that the AOL part of the user agent isn't spoofed as that's the nature of stealth, remaining hidden.

jdMorgan




msg:3012361
 11:31 pm on Jul 17, 2006 (gmt 0)

That's a spoof:

> Mozilla/4.0 (compatible; MSIE 5.01; AOL 5.0; Windows NT; {BBF3CA51-22C0-11D9-B66A-00B0D0C36340})

If your visitors are hi-tech, you might want to add a further mod to allow Windows Vista Beta testers:

# Missing Windows NT version number
RewriteCond %{HTTP_USER_AGENT} Windows\ NT
RewriteCond %{HTTP_USER_AGENT}!Windows\ NT\ (4\.0¦5\.[0-2]¦[b]6\.0[/b])(\)¦;\ [^)])
RewriteRule .* - [F]

Pfui




msg:3012410
 12:24 am on Jul 18, 2006 (gmt 0)

By spoof do you mean the visitor fabricated that UA? Um, not so sure about that.

The UA first 403'd, so he wrote me -- from an AOL account, from mx.aol.com, with: "X-Mailer: AOL 5.0 for Windows sub 108" -- about how he suddenly couldn't get in.

So I sent him to a private page off-site where a script details three Environment Variables to the browser (I forget the nick phrase for that), which he then e-mailed back to me --

HOST: cache-dtc-ae10.proxy.aol.com
ADDR: 205.188.117.14

APPL: Mozilla/4.0 (compatible; MSIE 5.01; AOL 5.0; Windows NT;
{BBF3CA51-22C0-11D9-B66A-00B0D0C36340})

-- and this is how another script logged that same access:

[17/Jul/2006:14:29:08]
- /index.html
- GET
- 205.188.117.14
- cache-dtc-ae10.proxy.aol.com
- [H_REF]
- Mozilla/4.0 (compatible; MSIE 5.01; AOL 5.0; Windows NT; {BBF3CA51-22C0-11D9-B66A-00B0D0C36340})

The fellow doesn't sound like a geek such that he'd dream up that UA. But goodness knows what he has on board with what looks like a torturously long registration number.

If I have a chance, I'll ask him if he knows what a "UA string" is... (Might be a few days, tho', sorry. It's gonna be a heckuva week.)

(Dan: Details not obfuscated because it's an AOL server.)

incrediBILL




msg:3012453
 1:19 am on Jul 18, 2006 (gmt 0)

Perhaps he was annoyed you stopped his scraping?

I get a TON on AOL users and it's spoofed or modified somehow as that is NOT a legit UA or I'd be blocking tons of AOLers and I'm not. As a matter of fact, the combination of "AOL 5.0; Windows NT;" doesn't even show up in my archive going back almost a year.

Nobody said a scraper had to be hi-tech either, they get some script and it crawls and spits out websites, no brains required.

wilderness




msg:3012590
 3:06 am on Jul 18, 2006 (gmt 0)

Nobody said a scraper had to be hi-tech either, they get some script and it crawls and spits out websites, no brains required.

For a long time I had quite a difficulty in accepting any credibility from anybody who would use AOL as their internet provider. It's still difficult to digest the reasons why a user would accept the restricted internet of AOL's tunnel vision.

Some folks just remain commited to the provider for reasons that the majority cannot comprehend.
I have a friend whose wife handles the majority of the internet management, although my friend has progressed far beyond any capacity I thought he would.
These folks have a hi-speed cable connection and then connect to AOL ;)
Their reasoning is that many of the family members use AOL as their provider and it affords them all a community of interaction that they enjoy. (try as I may, I just cannot inderstand the logic and especially with the many alternatives).

In spite of all this both my friend and his Mrs. are quite intelligent people. I wouldn't call the Mrs. a geek, however when she sets her mind to it, she has no trouble finding a computer method to accomplish what she desires.

If the majority of scraprers were hi-tech?
We as websmasters would have a more difficult time stopping them in their tracks with otherwise simple procedures.

This 40 message thread spans 2 pages: < < 40 ( 1 [2]
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved