Forum Moderators: open

Message Too Old, No Replies

What's wrong with this simple .htaccess file?

Works for one phony spider, not another ..

         

larryhatch

2:53 am on Dec 6, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Below is my exemplified .htaccess file.

It appears to disallow all the missagua hits, but Larbin sails right through ..

RewriteEngine On
RewriteCond %{HTTP_HOST}!^www\.mysite\.net [NC]
RewriteRule ^(.*)$ [mysite.net...] [R=301,L]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^Missigua [NC]
RewriteRule .* - [F]

- - -

Am I missing a wildcard symbol or something like that?
Is 'larbin' case-sensitive? How do I get around that if so?
Are both Larbin and Missigua considered USER AGENTS, or am I specifing the wrong field?

Can somebody do a quick and dirty fix?
I'm weak at this, and afraid to change things without advise.

If you copy the entire short contents back with fixes it will be far easier for me.
Then I can go ahead and add some other'jewels' to my sh** list. -Larry

GaryK

4:04 pm on Dec 6, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Larry,

I don't know a thing about .htaccess files. I can tell you about user agents.

Larbin does not always identify itself as larbin. In the last three months I've seen such variations as:

larbin_2.6.3
Mozilla/5.0 (larbin@unspecified.mail)
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Larbin/2.6.3
larbin_extended

Hopefully whoever winds up providing you with a fix will know about the variations on larbin and take them into account.

kevinpate

4:17 pm on Dec 6, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Based on my own file, try this approach:
CHANGE: ^larbin TO READ: "larbin"
Leave the rest of it as you have it.

Making the change should block all variations of larbin in a UA.

Span

4:47 pm on Dec 6, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Making the change should block all variations of larbin in a UA.

When you take out the start anchor from "^larbin" the "larbin" that gets a 403 is only a lowercase "larbin". To match "larbin" or "Larbin" anywhere in a UA string you have to use the NC (No Case) flag at the end of the rewritecond.


RewriteCond %{HTTP_USER_AGENT} larbin [NC,OR]

larryhatch

9:29 pm on Dec 6, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks much guys!

OK, lets say I use the two lines

RewriteCond %{HTTP_USER_AGENT} "larbin" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Missigua [NC]

Note that I changed ^larbin to "larbin" in quotes,
does that ban ALL larbins?
regardless of position in long complex string, and even AFTER Mozilla X.X yadda yadda?
Even the larbin@nobodyhome phony email part?
I want every kind of larbin/Larbin out of here.

I also added the NC in [NC,OR] to cover upper and lower case.

Much appreciated! -Larry

Span

9:50 pm on Dec 6, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Okay Larry. Who said use quotes? Get rid of them, since that line will only ban "larbin" or "Larbin" including the quotes. No quotes, Larry!

RewriteCond %{HTTP_USER_AGENT} larbin [NC,OR]

That line will ban every UA string with the word Larbin or larbin anywhere in it. Even if if that string is somethinlikelarbinvisitingLarry.

jdMorgan

10:02 pm on Dec 6, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'd also argue in favor of changing the order of your two rulesets. After all, why waste bandwidth redirecting bad user-agents? -- If they come in using the 'wrong' domain, you are telling them to re-request the resource from your server using the correct domain, which -- if they comply -- will result in a second request to your server, and only then will you return a 403.

I'd recommend 403ing them and then only redirect the 'good' user-agents that pass your access control restrictions.

Jim

larryhatch

10:20 pm on Dec 6, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



OK, lets try this on for size

RewriteEngine On

# disallow weenies
RewriteCond %{HTTP_USER_AGENT} larbin [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Missigua [NC]
RewriteRule .* - [F]

# redirect non-www to www.
RewriteCond %{HTTP_HOST}!^www\.mysite\.net [NC]
RewriteRule ^(.*)$ [mysite.net...] [R=301,L]

Note: Removed quote marks around "larbin" (thanks Span)
Re-ordered rewrite rules so larbin and missagua, case insensitive are disallowed BEFORE redirecting to www.
Added # comments for clarity. Is this OK as written?

Are blank lines for easy reading OK also? Or should those have the # character too? -Larry

larryhatch

10:22 pm on Dec 6, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'm off to work now. Back in 12 hours. Thanks again! -Larry

keyplyr

8:52 am on Dec 8, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



...changing the order of your two rulesets. After all, why waste bandwidth redirecting bad user-agents? - jdMorgan

Amazing, Jim has (once again) caused some clarity within the confides of what I affectionately refer to as my mind. So obvious, yet after years of staring at this file, it never occurred to me to filter out undesirables 'before' redirecting them. My daily logs just decreased by 10%.

larryhatch

11:04 am on Dec 8, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Heck yes! Excellent suggestion. Ban first, sort / redirect later if at all.

Anyhow, I went ahead and implemented the new .htaccess.
Soon I will see if I canned the various 'larbin' critters.

Next comes Java/1.4.1_04, which sucked down half my site in seconds.
That's on the boards as an email spammer fishing for addresses or the like.

I'm not ready to ban Jakarta-Commons yet, that supposedly has legitimate uses.

New question! One highly suspicious agent comes in as 'Microsoft URL Control' or some such.
Just spaces between the words, no dashes or anything like that.

I'm told NOT to put UAs in quotes. Does that mean a ban would look like this?

RewriteCond %{HTTP_USER_AGENT} Microsoft URL Control [NC,OR]

-- or should I go ahead with the quotes like this?

RewriteCond %{HTTP_USER_AGENT} "Microsoft URL Control" [NC,OR]

One guy out there noticed lots of phony hits with legitimate looking UAs except for one thing:
Instead of Mozilla 4.0 compatible; , he was seeing Mozilla 4.0 compatible ;
[ NOTE the added space before the semicolon ; !]

Would THAT be worth banning, or is it too risky?

I don't want to ban legitimate traffic, I want visitors to see my stuff. -Larry

Sorry for all the questions. -Larry

Span

11:27 am on Dec 8, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Microsoft URL Control is a spambot. Ban it like this:


RewriteCond %{HTTP_USER_AGENT} ^Microsoft\ URL\ Control [NC,OR]

There is a start anchor ^ because this string always looks like this and you have to escape the spaces with the backwards slashes.

You can learn a lot by looking up UA strings in Google. Or browse through Andreas Staeding's database of UAs: [psychedelix.com ]

larryhatch

11:32 am on Dec 8, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Wonderful! Thanks Span, you're a prince.

Microsoft SPAM control is banned as of now.

I sent the file up as ASCII rather than in Binary on advice from others.
Is this correct? -Larry

Span

12:03 pm on Dec 8, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



ASCII is ok.

And yes, there are a lot of strange Mozilla UAs out there. The example from your post is definitely not a normal user. But you really should only ban UAs that you've actually seen on your site. Don't copy, or you are banning UAs that will never visit your site or that no one uses anymore..


RewriteCond %{HTTP_USER_AGENT} ^Mozilla/3\.0\(compatible\)$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/3\.0\ \(compatible\)$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla\(IE\ Compatible\)$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^mozilla\ 4\.0$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/4\.0$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/4\.0\ \(compatible;\ MSIE\ 4\.00;\)$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/4\.0\ \(compatible\)$ [NC,OR]

larryhatch

5:15 pm on Dec 17, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hello Span (and others)

Earlier you wrote:

" Microsoft URL Control is a spambot. Ban it like this:
RewriteCond %{HTTP_USER_AGENT} ^Microsoft\ URL\ Control [NC,OR]

"There is a start anchor ^ because this string always looks like this and you have to escape the spaces with the backwards slashes.

- - -

Now Span, I have a another ugly crtter.

This one comes into my logs as 'forum.XYZ.nl'

XYZ (exemplified) hotlinks several images. Their particular forum page
is a million lines long, mostly scraped stuff from my and other sites.
I'm sure G and y have sent them to 'supplemental' purgatory long ago,
but they have so many visitors its driving me nuts.

Here's the issue: I tried the following line to ban the whole site:

RewriteCond %{HTTP_USER_AGENT} forum.XYZ.nl [NC,OR] # 13DEC05

.. and it didn't work. The Dutchmen come bombing right through.

So now, I am wondering about your mention of the \ (escape) character.

Is it the dots (.) which are screwing this up?

The XYZ part is sufficiently unlikely, something like 'FOK'
but I prefer to ban very selectively.

In short, do I need to 'escape' DOTS just like you said I should do with spaces?

Drunken and Dizzy in Redwood City aka -Larry

jdMorgan

5:22 pm on Dec 17, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You need to use %{HTTP_REFERER} if you want to block a specific referrer rather than a user-agent...

# 13DEC05
RewriteCond %{HTTP_REFERER} forum\.XYZ\.nl [NC,OR]

Also, as shown, do not put comments on the same line as code -- mod_rewrite doesn't like this, and will slow down your server generating 'Warnings' -- whether you see these warnings in your error log or not.

Jim

larryhatch

12:57 am on Dec 19, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks very much JDM!
That's exactly the info I needed and then some.
I have moved dates to separate lines too. -Larry