homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

Question about alternative to blocking by IP address
Patrick Taylor

 11:22 pm on Feb 3, 2013 (gmt 0)

Would this work (as an alternative to putting a large number of IP blocks in .htaccess)?

RewriteCond %{HTTP_REFERER} \.ru [NC]
RewriteRule .* - [F]

.ru is just an example but there is a lot of spam .ru referrers. They already exist so the objective would be to prevent googlebot etc from giving the incoming link any credence.

I have seen this thread: [webmasterworld.com...] but am not sure what the conclusion was.



 1:53 am on Feb 4, 2013 (gmt 0)

Sure. In fact I've got a similar rule myself. It goes

:: shuffling papers ::

RewriteCond %{HTTP_REFERER} \.(ru|ua)(/|$) [NC]
RewriteCond %{HTTP_REFERER} !(google|yandex)\.
RewriteRule (\.html|/)$ - [F]

There's a separate exclusion earlier in htaccess for fake yandex referers; this is the leftover.

But it doesn't have anything to do with googlebot. These are forged referers used by robots, so they'd never show up as Links To Your Site anyway. And even if they did, a link leading to a 403 is still a link. Disavowing links is a completely separate process.


 1:54 am on Feb 4, 2013 (gmt 0)

Would this work (as an alternative to putting a large number of IP blocks in .htaccess)?

I have seen this thread [webmasterworld.com] but am not sure what the conclusion was.

SevenCubed's response [webmasterworld.com] in that thread is your answer.

Patrick Taylor

 9:11 am on Feb 4, 2013 (gmt 0)


In lucy24's suggestion, why are the bolded bits required?

RewriteCond %{HTTP_REFERER} \.(ru|ua)(/|$) [NC]
RewriteCond %{HTTP_REFERER} !(google|yandex)\.
RewriteRule (\.html|/)$ - [F]

I think I follow SevenCube:

RewriteEngine On
RewriteCond %{HTTP_REFERER} \.(ru|ua)(/|$) [NC]
RewriteRule .* - [F]

ErrorDocument 403 "Access Denied"
ErrorDocument 404 "Page Not Found"

However, my error code is above the rules, not below. Is this an issue? My server produces the standard 403 error page.


 9:47 am on Feb 4, 2013 (gmt 0)

The bold parts aren't required. They're exclusions for legitimate Russian or Ukrainian referers. That's assuming for the sake of discussion that you might get real human traffic from google dot ru or yandex. If you don't, you don't need the exemption.

In htaccess, each module is an island. The only thing that matters is what order you put things within that module, like mod_rewrite or mod_authz. But you really should get in the habit of putting the basic things at the top.

:: shuffling papers ::

Mine currently starts with

Options -Indexes +Includes

SSIErrorMsg "<!-- SSI error -->"

AddType text/html .html
AddOutputFilter INCLUDES .html

ExpiresActive On
ExpiresDefault "access plus 1 month"
ExpiresByType text/html "access plus 7 days"

ErrorDocument 403 /boilerplate/forbidden.html
ErrorDocument 404 /boilerplate/missing.html
ErrorDocument 410 /boilerplate/missing.html
ErrorDocument 500 /boilerplate/internal_error.html

RewriteEngine On

...and then on into the redirects and rewrites.

Obviously ymmv on the details. In particular, I've got all mod_authz directives (Allow, Deny) along with SetEnvIf in a separate htaccess that's shared by all domains in my userspace. Otherwise they'd come before the mod_rewrite stuff.

Formulations like

ErrorDocument 403 "Access Denied"

--assuming you meant that literally-- are not pointers to documents. They are text which is displayed by the user's browser. If you're on shared hosting, they may have an override so it only goes to an ErrorDocument if you've named a physical document. (Not sure how they can do this, since ErrorDocument falls in the same AllowOverrides category as mod_rewrite-- which you're using-- but who knows.)

Oh, and always always remember
Microsoft Internet Explorer (MSIE) will by default ignore server-generated error messages when they are "too small" and substitute its own "friendly" error messages. The size threshold varies depending on the type of error, but in general, if you make your error document greater than 512 bytes, then MSIE will show the server-generated error rather than masking it.

It's good to bring that up every six months or so ;) Note Apache's sarcastic quotes on "friendly". If I remember rightly, MSIE's version of the 403 message is brazenly and gratuitously inaccurate, making a 403 sound like a 401.

But I digress.

Edit: Oops, overlooked the second bolded piece. But that's OK because I think I did explain that. You need to keep in mind that mod_rewrite-- and everything else in your htaccess-- looks at every single request. Not just the named page that your human or robot asked for, but also the stylesheets and scripts and images and sound files that the browser asks for on the user's behalf. Which is why it's called a User Agent. The server can distinguish between internal and external requests, but it has no way of knowing which pieces were explicitly typed/clicked by a human.

Most robots look only at pages, so you can save the server a lot of time and work by constraining your rules to page requests.

Patrick Taylor

 2:06 pm on Feb 4, 2013 (gmt 0)

it doesn't have anything to do with googlebot

Thanks. I didn't realise that.

Global Options:
 top home search open messages active posts  

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved