homepage Welcome to WebmasterWorld Guest from 54.211.213.10
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
htaccess questions about pattern matching etc.
Alikris




msg:4612945
 2:25 pm on Sep 26, 2013 (gmt 0)

Hia. Firstly, although newly registered I have been using Webmasterworld for quite a while and found it very helpful. Thank you.

I know there's lots of threads asking questions about how to write htaccess files, but I need some clarification on a few things, so please forgive me for starting another thread!

Say a user agent is called anythingspideranything

And I want to block it based on the fact that the phrase spider is included, which of the following should I use?

RewriteCond %{HTTP_USER_AGENT} spider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^.*spider.*$ [NC,OR]


Likewise, if a useragent contains a URL which contains the country code .ru (as a lot of referer spam does)which of the following should I use:

RewriteCond %{HTTP_REFERER} \.ru [NC,OR]
RewriteCond %{HTTP_REFERER} ^.*\.ru.*$ [NC,OR]


would either of the above catch the .ru both in a referrer using it within a url and a referrer using it at the end of a url?

If I wanted to catch something using the RewriteCond %{REMOTE_HOST} would the above syntax also apply? For example if I wanted to catch amazonaws, would I write:

RewriteCond %{REMOTE_HOST} ^.*amazonaws.*$ [NC,OR]

My next question regards the rewriterule and which is best.

I'm stuck between using one of the following:

RewriteRule .? - [F,L]
RewriteRule /* http://www.google.com [G,L]


Basically is it best to use the F or G flag? And is it a good idea to redirect to something like the Google search page? Fail would take them to a 404 but if I didn't want them to get a 404 would the redirect to google and the G flag do exactly that?

Lastly, does the following actually work?

deny from .ru
or
deny from *@*.ru

or is it just a waste of time?

I think that's all my questions for now on htaccess, but I do have another related question Please excuse me if this is the wrong place to ask. Maybe I should start a separate thread about this?

In order to stop domain squatters, I have registered a few domain names which are very similar to my own. At the moment they direct to a holding page, but what I would like to do is somehow make the domain non-resolvable. In other words, if someone types in on of the domain names, their browser will return a 'this domain doesn't exist' or 'this domain isn't loading. I have noticed that some domains that definitely have been registered have no associated dns. How do I make this possible?

Many thanks,
Ali.

 

Alikris




msg:4612950
 2:30 pm on Sep 26, 2013 (gmt 0)

My apologies for using the Google URl in the above post, I've just rea;lised buit can't edit my post, sorry.

lucy24




msg:4613012
 10:16 pm on Sep 26, 2013 (gmt 0)

Goodness, what a lot of questions :)

which of the following should I use?

If you're not capturing you never need .* or .+ A simple unanchored
spider
is all you need if you're looking for "any user-agent that contains the string 'spider' anywhere". Here [NC] is OK, because presumably you don't care if your unattractive robot calls itself Spider or SPIDER or even spIder instead.

which contains the country code .ru (as a lot of referer spam does)

Yes indeed. Matter of fact

:: shuffling papers ::

RewriteCond %{HTTP_REFERER} \.(ru|ua)(/|$) [NC]
RewriteCond %{HTTP_REFERER} !(google|yandex|mail)\.
RewriteRule (^|\.html|/)$ - [F]

That's my current version. I constrain most rules to requests for pages, so the server doesn't have to slow down and evaluate everything. If there's a request for an image, it was either referred by a page-- which has already been authorized-- or it's a hotlink-- which gets its own set of rules.

I have registered a few domain names which are very similar to my own. <snip> I have noticed that some domains that definitely have been registered have no associated dns. How do I make this possible?

I think you're conflating two different things. One's the domain name; the other is the domain's physical location. It is perfectly OK to have a registered domain but no site. The user's browser then puts up the error message that says "Although it appears to be a valid name, no server could be found". The only reason all those domains belonging to domain-name dragons resolve to a parking page is that they want users to know the name is for sale.

But really, if it's a legitimate typo domain, wouldn't you be better off with either a sitewide page-for-page redirect or a single human-readable page that says

:: shuffling papers again ::

Are you looking for Widget World? The site with the transcoders and morphological analyzer and other good things too numerous to list? You want <a href = "http://www.example.ca/">dot ca</a>. This is dot com.

(Since it's an ARIN range, the link doesn't give robots any information they didn't already have. If it were RIPE I wouldn't have used a live link.)

I'm stuck between using one of the following:

RewriteRule .? - [F,L]
RewriteRule /* http://www.example.com[G,L]

Basically is it best to use the F or G flag? And is it a good idea to redirect to something like the Google search page?

Both [F] and [G] carry an implied [L] flag. It does no harm, but isn't needed.

Is /* a typo? As written, it means "the request might contain a directory slash". A simple .? is enough if you're leading up to an unqualified [F].

NEVER redirect to some innocent third-party site. Some robots go away faster if you redirect either to 127.0.0.1 or to their own originating IP-- and it's a teeny bit less work for the server, since all it sends back is the redirect header. But really it's more about emotional satisfaction.

The form
any-target-here [G]
or
any-target-here [F]
is meaningless. It won't kill the server, but a "target" with any 400-class flag is simply ignored. As with redirects, some robots might go away faster if you lie and say the page doesn't exist at all. But after a while it becomes easier just to say [F] when that's what you mean.

Fail would take them to a 404 but if I didn't want them to get a 404 would the redirect to google and the G flag do exactly that?

Fail isn't 404, it's 403. Either way, it's very unlikely that the robot will look at your actual 403/404 page, or even the generic server-generated one. Coincidentally this subject came up in another thread very recently. I have a stylesheet that's used only by my error documents. So nobody knows it exists unless they've seen the requesting page. Until a couple of days ago, the googlebot had never requested this stylesheet. So you have to assume that even a major search engine doesn't actually read your error documents. (afaik, a "noindex" tag on the page doesn't in-and-of-itself prevent the search engine from asking for supporting documents.)

Lastly, does the following actually work?

deny from .ru

It might, but don't do it. Someone else will explain the techicalia. In essence if your Allow/Deny lines contain anything other than a normal IP address in CIDR form, the logs turn into an unreadable mess. And behind that mess lies extra work for your server.

Alikris




msg:4613127
 10:11 am on Sep 27, 2013 (gmt 0)

Thank you Lucy, very helpful :)

So, just to confirm, to redirect to 127.0.0.1 I'd use:

RewriteRule (^|\.html|\.php|/)$ http://127.0.0.1/$1 [R=401,L]

I do have another problem which I don't understand. In order to prevent hotlinking I have:

RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://(www\.)example.co.uk/.*$ [NC]
RewriteRule \.(gif|jpg|jpeg|bmp|avi|rar|mp3|flv|swf|xml|png|pdf)$ - [F]


But it blocks me from seeing my php forum style images. Referring URL is, for example, http://www.example.co.uk/phpbb/portal.php

Many thanks for your help in this, It's all giving me quite a headache :(

Ali.

lucy24




msg:4613160
 12:30 pm on Sep 27, 2013 (gmt 0)

Do you mean that the images themselves have a php extension (rare but it can happen), or that they're used by a php forum? You don't have a mix of http and https do you? Or a subdomain you forgot to code for?

Do you really use all those different extensions? Yowk. And what's pdf doing on the list? Generally you'd treat requests for pdfs the same as requests for pages. Sometimes the browser has to call in outside assistance, so the referer may not even be the original page. Same goes for xml.

And, finally,
jpg|jpeg = jpe?g

For hotlinking, it's actually less work for the server if you rewrite. A one-pixel gif should only set you back a few hundred bytes, while a 403 is the full size of your 403 page. Even if the requesting site isn't able to display it, the server still sends it. Another popular remedy is to rewrite to something ugly and/or offensive. I have a garish NO HOTLINKS graphic in an eye-catching combination of black, green and magenta. It's about 2K-- fractionally bigger than my error pages, but still way smaller than the nice jpg's they were hoping for.

!^http://(www\.)example.co.uk/.*$

The forms
.*$
^.*
are always meaningless if you're not capturing. Does
(www\.)
mean
(www\.)?
with question mark meaning "optional www"? You actually don't want this in a hotlinking routine. Presumably your own site has a single canonical hostname. So you give only that form. If someone comes in claiming "example.com sent me" when everyone knows the site as "www.example.com" you know they're lying. In fact I recently added that to my general referer blocks:

RewriteCond %{HTTP_REFERER} http://example\.com [NC,OR]
RewriteCond %{HTTP_REFERER} \.(su|mobi|biz)(/|$) [NC,OR]

et cetera. It has also gotten rid of some auto-referers who would otherwise have been redirected.

Alikris




msg:4613766
 12:11 pm on Sep 30, 2013 (gmt 0)

Thank you for your help and clarifications on this Lucy :)

Ali.

phranque




msg:4613823
 6:03 pm on Sep 30, 2013 (gmt 0)

welcome to WebmasterWorld, Ali!

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved