Forum Moderators: open
Many of these give themselves away by:
a) Requesting only .html pages, never the associated images ..
b) Fast rate of requests. more like a spider than a human ..
c) Deliberately obscure alterations to the usual use agents
A good example of c) is "Mozilla/4.0 (compatible ; etc. "
==> Note the space between 'compatible' and the semicolon ';'.
I want to disallow 'compatible ;', with the strangely placed space -BUT- I have to be careful!
If .htaccess ignores the space as 'whitespace', I will throw away 2/3 of my organic traffic!
1) Does anybody have a known good bullet-proof way to do this?
2) Am I disallowing by USER_AGENT like this?
RewriteCond %{HTTP_USER_AGENT} Java/1 [NC,OR] ..
or is it {HTTP_SOMETHING_ELSE}?
Help much appreciated! -Larry
And found the following in old records.
131.107.3.91 - - [13/Sep/2003:19:54:13 -0700] "GET /myfolder/ HTTP/1.0" 200 13010 "-" "Mozilla/4.0 (compatible; MSIE 4.0; Windows NT; ....../1.0 )"
207.46.225.251 - - [13/Sep/2003:19:54:20 -0700] "GET /myfolder HTTP/1.0" 301 310 "-" "Mozilla/4.0 (compatible; MSIE 4.0; Windows NT; ....../1.0 )"
interesting IP ranges and these crawls were the subject of extensive conversation at that time.
The recent IP's on my site were not MSN ranges.
I thought we needed to escape SetEnvIf User-Agent details, a la:SetEnvIf User-Agent "compatible\ \;" keep_out
SetEnvIf User-Agent "\(compatible\;\ MSIE\ 5\.0\)" keep_out
SetEnvIf User-Agent "\(compatible\;\ MSIE\ 6\.0\)" keep_out
SetEnvIf User-Agent "\(compatible\;\ MSIE\ 7\.0\)" keep_out
Anne,
using quotes around the expression and then attempting to escape the blank spaces is reduntant. I'm not sure how it's hadnled.
It's assuredly NOT necessary.
Also I've noticed that you use quotes on every single line.
from approx 1,600 lines of rewrites I have less than a dozen that use quotes.
You may find reason for more than I, however it's definetly NOT required on every line.
Don
Example:
SetEnvIf Request_URI "\.gif$" object_is_image=gif
SetEnvIf Request_URI "\.jpg$" object_is_image=jpg
SetEnvIf Request_URI "\.xbm$" object_is_image=xbm
[httpd.apache.org...]
Or maybe I picked it up from referers and/or arrays?
SetEnvIfNoCase User-Agent "(curl¦libcurl¦libcurl-agent)" keep_out
SetEnvIf Request_URI "exec" keep_out
SetEnvIfNoCase Referer "^http://(www\.)?example\." keep_out
SetEnvIfNoCase Referer "localhost¦server¦example¦robots" keep_out
## NAMEPROTECT.COM BOT: 12.175.0.32 - 12.175.0.47
SetEnvIf Remote_Addr "12\.175\.0\.[0-9]+" keep_out
Beats heck out of me. But at least I can attest that quotes don't prevent anything from happening:)
You don't use them with any
SetEnv? Not even any of the above? [edited by: Pfui at 11:44 pm (utc) on July 16, 2006]
SetEnvIf Request_URI "\.gif$" object_is_image=gif
SetEnvIf Request_URI "\.jpg$" object_is_image=jpg
SetEnvIf Request_URI "\.xbm$" object_is_image=xbm
This would all function as intended without the use of quotes.
SetEnvIf Request_URI \.gif$ object_is_image=gif
SetEnvIf Request_URI \.jpg$ object_is_image=jpg
SetEnvIf Request_URI \.xbm$ object_is_image=xbm
SetEnvIfNoCase User-Agent "(curl¦libcurl¦libcurl-agent)" keep_out
same here
SetEnvIfNoCase User-Agent (curl¦libcurl¦libcurl-agent)keep_out
SetEnvIf Remote_Addr "12\.175\.0\.[0-9]+" keep_out
I don't understand this line (however I'm most positive the quotes are redundant as well.
What's the ending plus sign for?
An example of a CDIR range is provided on the DNS Stuff Box as follows:
192.168.112.0/24)
end of quote.
I'm not at all sure your may use the [0-9] epression in SetEnvIf. I don't.
I use a Rewrite for NameProtect
RewriteCond %{REMOTE_ADDR} ^12\.175\.0\.(3[2-9]¦4[0-7])$ [OR]
You don't use them with any SetEnv?
of 378 lines of SetEnvIf;
I have a mere three instances where I use quotes.
And, one was recently added (the subject that I added to this thread).
Beats heck out of me. But at least I can attest that quotes don't prevent anything from happening
Don
Jim, over the weekend I used this (from #:3007302), with pipes and spaces fixed --
# Missing Windows NT version number
RewriteCond %{HTTP_USER_AGENT} Windows\ NT
RewriteCond %{HTTP_USER_AGENT}!Windows\ NT\ (4\.0¦5\.[0-2])(\)¦;\ [^)])
RewriteRule .* - [F]
-- and blocked a poor AOL'r using this:
Mozilla/4.0 (compatible; MSIE 5.01; AOL 5.0; Windows NT; {BBF3CA51-22C0-11D9-B66A-00B0D0C36340})
FWIW
[edited by: Pfui at 10:00 pm (utc) on July 17, 2006]
The MSIE user agent ALWAYS has a version # after "Windows NT" so "Windows NT;" should be invalid.
Just because they are on AOL doesn't mean they aren't using automated tools, or that the AOL part of the user agent isn't spoofed as that's the nature of stealth, remaining hidden.
> Mozilla/4.0 (compatible; MSIE 5.01; AOL 5.0; Windows NT; {BBF3CA51-22C0-11D9-B66A-00B0D0C36340})
If your visitors are hi-tech, you might want to add a further mod to allow Windows Vista Beta testers:
# Missing Windows NT version number
RewriteCond %{HTTP_USER_AGENT} Windows\ NT
RewriteCond %{HTTP_USER_AGENT}!Windows\ NT\ (4\.0¦5\.[0-2]¦[b]6\.0[/b])(\)¦;\ [^)])
RewriteRule .* - [F]
The UA first 403'd, so he wrote me -- from an AOL account, from mx.aol.com, with: "X-Mailer: AOL 5.0 for Windows sub 108" -- about how he suddenly couldn't get in.
So I sent him to a private page off-site where a script details three Environment Variables to the browser (I forget the nick phrase for that), which he then e-mailed back to me --
HOST: cache-dtc-ae10.proxy.aol.com
ADDR: 205.188.117.14
APPL: Mozilla/4.0 (compatible; MSIE 5.01; AOL 5.0; Windows NT;
{BBF3CA51-22C0-11D9-B66A-00B0D0C36340})
-- and this is how another script logged that same access:
[17/Jul/2006:14:29:08]
- /index.html
- GET
- 205.188.117.14
- cache-dtc-ae10.proxy.aol.com
- [H_REF]
- Mozilla/4.0 (compatible; MSIE 5.01; AOL 5.0; Windows NT; {BBF3CA51-22C0-11D9-B66A-00B0D0C36340})
The fellow doesn't sound like a geek such that he'd dream up that UA. But goodness knows what he has on board with what looks like a torturously long registration number.
If I have a chance, I'll ask him if he knows what a "UA string" is... (Might be a few days, tho', sorry. It's gonna be a heckuva week.)
(Dan: Details not obfuscated because it's an AOL server.)
I get a TON on AOL users and it's spoofed or modified somehow as that is NOT a legit UA or I'd be blocking tons of AOLers and I'm not. As a matter of fact, the combination of "AOL 5.0; Windows NT;" doesn't even show up in my archive going back almost a year.
Nobody said a scraper had to be hi-tech either, they get some script and it crawls and spits out websites, no brains required.
Nobody said a scraper had to be hi-tech either, they get some script and it crawls and spits out websites, no brains required.
For a long time I had quite a difficulty in accepting any credibility from anybody who would use AOL as their internet provider. It's still difficult to digest the reasons why a user would accept the restricted internet of AOL's tunnel vision.
Some folks just remain commited to the provider for reasons that the majority cannot comprehend.
I have a friend whose wife handles the majority of the internet management, although my friend has progressed far beyond any capacity I thought he would.
These folks have a hi-speed cable connection and then connect to AOL ;)
Their reasoning is that many of the family members use AOL as their provider and it affords them all a community of interaction that they enjoy. (try as I may, I just cannot inderstand the logic and especially with the many alternatives).
In spite of all this both my friend and his Mrs. are quite intelligent people. I wouldn't call the Mrs. a geek, however when she sets her mind to it, she has no trouble finding a computer method to accomplish what she desires.
If the majority of scraprers were hi-tech?
We as websmasters would have a more difficult time stopping them in their tracks with otherwise simple procedures.