Goodness, what a lot of questions :)
which of the following should I use?
If you're not capturing you never need .* or .+ A simple unanchored
spider
is all you need if you're looking for "any user-agent that contains the string 'spider' anywhere". Here [NC] is OK, because presumably you don't care if your unattractive robot calls itself Spider or SPIDER or even spIder instead.
which contains the country code .ru (as a lot of referer spam does)
Yes indeed. Matter of fact
:: shuffling papers ::
RewriteCond %{HTTP_REFERER} \.(ru|ua)(/|$) [NC]
RewriteCond %{HTTP_REFERER} !(google|yandex|mail)\.
RewriteRule (^|\.html|/)$ - [F]
That's my current version. I constrain most rules to requests for pages, so the server doesn't have to slow down and evaluate everything. If there's a request for an image, it was either referred by a page-- which has already been authorized-- or it's a hotlink-- which gets its own set of rules.
I have registered a few domain names which are very similar to my own. <snip> I have noticed that some domains that definitely have been registered have no associated dns. How do I make this possible?
I think you're conflating two different things. One's the domain
name; the other is the domain's
physical location. It is perfectly OK to have a registered domain but no site. The user's browser then puts up the error message that says "Although it appears to be a valid name, no server could be found". The only reason all those domains belonging to domain-name dragons resolve to a parking page is that they
want users to know the name is for sale.
But really, if it's a legitimate typo domain, wouldn't you be better off with either a sitewide page-for-page redirect or a single human-readable page that says
:: shuffling papers again ::
Are you looking for Widget World? The site with the transcoders and morphological analyzer and other good things too numerous to list? You want <a href = "http://www.example.ca/">dot ca</a>. This is dot com.
(Since it's an ARIN range, the link doesn't give robots any information they didn't already have. If it were RIPE I wouldn't have used a live link.)
I'm stuck between using one of the following:
RewriteRule .? - [F,L]
RewriteRule /* http://www.example.com[G,L]
Basically is it best to use the F or G flag? And is it a good idea to redirect to something like the Google search page?
Both [F] and [G] carry an implied [L] flag. It does no harm, but isn't needed.
Is /* a typo? As written, it means "the request might contain a directory slash". A simple .? is enough if you're leading up to an unqualified [F].
NEVER redirect to some innocent third-party site. Some robots go away faster if you redirect either to 127.0.0.1 or to their own originating IP-- and it's a teeny bit less work for the server, since all it sends back is the redirect header. But really it's more about emotional satisfaction.
The form
any-target-here [G]
or
any-target-here [F]
is meaningless. It won't kill the server, but a "target" with any 400-class flag is simply ignored. As with redirects, some robots
might go away faster if you lie and say the page doesn't exist at all. But after a while it becomes easier just to say [F] when that's what you mean.
Fail would take them to a 404 but if I didn't want them to get a 404 would the redirect to google and the G flag do exactly that?
Fail isn't 404, it's 403. Either way, it's
very unlikely that the robot will look at your actual 403/404 page, or even the generic server-generated one. Coincidentally this subject came up in another thread very recently. I have a stylesheet that's used only by my error documents. So nobody knows it exists unless they've seen the requesting page. Until a couple of days ago, the googlebot had
never requested this stylesheet. So you have to assume that even a major search engine doesn't actually read your error documents. (afaik, a "noindex" tag on the page doesn't in-and-of-itself prevent the search engine from asking for supporting documents.)
Lastly, does the following actually work?
deny from .ru
It might, but don't do it. Someone else will explain the techicalia. In essence if your Allow/Deny lines contain
anything other than a normal IP address in CIDR form, the logs turn into an unreadable mess. And behind that mess lies extra work for your server.