Forum Moderators: phranque

Message Too Old, No Replies

Can "Allow from all" Override "noindex?"

My First Attempt at .htaccess

         

KidTao

12:15 pm on Aug 21, 2012 (gmt 0)

10+ Year Member



Hi,

We have a site that will be used as a demo, and we need to hide it from search engines for now. Setting robots.txt wasn't too bad, but being no programmer material, I have struggled with htaccess a little bit. The below is what I have come up with so far, and I put some questions regarding it underneath. If you could help me out, and take a look, I'd appreciate it:


-------------------------------------------------------------------------------

# Controls site-level indexing
Header set X-Robots-Tag "noindex, nofollow, noimageindex"

# Controls file-level indexing
<Files ~ "\.xml$">
Header set X-Robots-Tag "noindex, nofollow, noimageindex"
</Files>
<Files ~ "\.(png|jpe?g|gif)$">
Header set X-Robots-Tag "noindex"
</Files>
<Files ~ "\.pdf$">
Header set X-Robots-Tag "noindex, nofollow"
</Files>

# Protects the htaccess file
<files .htaccess>
order allow,deny
deny from all
</files>

# Protects against bad bots
<limit get="" post="" head="">
SetEnvIfNoCase user-Agent "^FrontPage" bad_bot [NC,OR]
SetEnvIfNoCase user-Agent "^Java.*" bad_bot [NC,OR]
SetEnvIfNoCase user-Agent "^Microsoft.URL" bad_bot [NC,OR]
SetEnvIfNoCase user-Agent "^MSFrontPage" bad_bot [NC,OR]
SetEnvIfNoCase user-Agent "^Offline.Explorer" bad_bot [NC,OR]
SetEnvIfNoCase user-Agent "^[Ww]eb[Bb]andit" bad_bot [NC,OR]
SetEnvIfNoCase user-Agent "^Zeus" bad_bot [NC,OR]
SetEnvIfNoCase user-Agent "^Yandex" bad_bot [NC,OR]
SetEnvIfNoCase user-Agent "^moget" bad_bot [NC,OR]
SetEnvIfNoCase user-Agent "^ichiro" bad_bot [NC,OR]
SetEnvIfNoCase user-Agent "^NaverBot" bad_bot [NC,OR]
SetEnvIfNoCase user-Agent "^Yeti" bad_bot [NC,OR]
SetEnvIfNoCase user-Agent "^Baiduspider" bad_bot [NC,OR]
SetEnvIfNoCase user-Agent "^sogou spider" bad_bot [NC,OR]
SetEnvIfNoCase user-Agent "^YoudaoBot" bad_bot [NC,OR]
SetEnvIfNoCase user-Agent "^Daumoa" bad_bot [NC,OR]
SetEnvIf Remote_Addr "212\.100\.254\.105" bad_bot [NC]
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</limit>

# Disables directory listing
Options -Indexes

# Prevents hotlinking
Options +FollowSymLinks
RewriteEngine On
RewriteCond %{HTTP_REFERER} !^https?://([a-z0-9-]+\.)?example\.com [NC]
RewriteCond %{HTTP_REFERER} !^$
RewriteRule .*\.(gif|png|jpe?g|swf|flv)$ /images/nohotlink.jpg [L]

-------------------------------------------------------------------------------


1) Do I need "Allow from all" in the bad bots part? What kind of worries me is what if allowing all the rest would override setting noindex/nofollow above, which would defeat the purpose of the whole "header set" business.

2) With "SetEnvIf Remote_Addr..." I'm trying to block the infamous Copyscape by its IP address. Is its syntax correct? I just want to make sure I'm NOT setting it so that I'm allowing Copyscape only.

3) Do I actually need the <limit> tag? Some of the snippets of the same topic weren't using it.

4) Is "https?://([a-z0-9-]+\.)?" in the last part correct? I basically want to include all the possible subdomains and subdirectories, http or https.

5) The last parameter [L] is correct? I have seen different ones like [R,NC].

Thank you in advance,

lucy24

9:00 pm on Aug 21, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Let me just answer the question posed in the subject line.

"Allow from all" and "noindex" have absolutely nothing to do with each other.

"Allow" and "deny" are server directives that determine who can physically get to the site. Nobody in the world-- not even google-- can disregard an htaccess or config file. Unless your host is grossly remiss, filenames with leading dot (in practice, .htaccess and .htpasswd) are blocked at the gate. That is: visitors have to obey them, but they can't read them.

"noindex" is aimed at search engines. They may or may not choose to follow it. Note an irritating but important quirk: if a search engine cannot reach the page, it cannot read the "noindex" directive. So it may list the page in its index even if all it knows is the URL and possibly some link text.




Someone will eventually come along and deal with the specific questions.

wilderness

7:05 am on Aug 22, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



1) Rather than using allow,deny it is more effective to use deny,allow. The reason being that the later permits ErrorDocuments and custom 403's to function.


Deny, Allow
Deny from 212.100.254.105
Allow from all

Your Bad Bots section has numerous errors, it may not even be functioning at all in denying the UA's that you'd desired (It would not surprise me if some of the errors even caused 500 errors and take down the whole server). I suggest you scrap the entire section until you comprehend the proper syntax and understanding of anchors (begins, ends and contains) when applied to User Agents.

Here's an example of that lack of comprehension:
SetEnvIfNoCase user-Agent "^[Ww]eb[Bb]andit" bad_bot [NC,OR] 


1) User-Agent not the lower case user that you have in error on every line
2)you have multiple case designations for web, while at the end your using the NC flag in duplication, as well as the leading SetEnvIfNoCase in triplicate. One or the other, not all three.
This misue is simply a result of lack of comprehension of the use.
3) Note; this is controversial, however remove the quotes surrounding the UA's.
4) You have other UA's listed as begins with anchor in which the UA will never appear in your logs as the beginning of the UA, rather some where in the middle, or contains.


As to your SetEnvIf Remote_Addr
I simply use (see above)
Deny from 212.100.254.105

I've used the Limit container for more than a decade (some months back I tested the File container and the result was a change in the format of my raw log output.

I use (which is overkill):
<Limit GET POST PUT HEAD>

No idea where you acquired the trailing characters from.

I've no clue about the headers.
I've uses page meta-tags for nofollow, noindex effectively.

I'm trying to block the infamous Copyscape


You severely limit the effectiveness by limiting your denial to a precise Class D IP, and in addition to not denying such a harvesting tool by UA.

lucy24

11:22 am on Aug 22, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



5) [OR]

In SetEnvIf, each line is an island. You don't need to link them up. In fact you can't. Like wilderness, I am surprised all those flags didn't make the server explode. Maybe it thinks [NC,OR] is just your made-up name for another environmental variable, and it's waiting for you to do something further with it. Brr.

Incidentally you can use the shorthand BrowserMatch-- or BrowserMatchNoCase-- if you're matching against user-agents. Saves a few bytes in every line ;)

Rather than using allow,deny it is more effective to use deny,allow. The reason being that the latter permits ErrorDocuments and custom 403's to function.

You can put your error documents-- also robots.txt-- inside Files or FilesMatch envelopes that are separately flagged as
Order Allow,Deny
Allow from all
... and that's all. Then everyone can see them.

Do I actually need the <limit> tag? Some of the snippets of the same topic weren't using it.

No. Unless you're taking the position that evil Ukrainian robots are welcome to do X, Y and Z so long as they don't try to do V or W.

But right now you may not need any of this stuff. Set a blanket "Deny from all" and then add "Allow from" lines for the specific people who have permission to see the site while it's under construction. Either by IP address or by BrowserMatch depending on what's most likely to be unique.

Or there's the "Satisfy/Require" set of directives...

wilderness

2:29 pm on Aug 22, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



5) [OR]

In SetEnvIf, each line is an island. You don't need to link them up. In fact you can't. Like wilderness, I am surprised all those flags didn't make the server explode. Maybe it thinks [NC,OR] is just your made-up name for another environmental variable, and it's waiting for you to do something further with it. Brr.


Thanks lucy.
Not sure how I missed that. Perhaps I was just focused on scrapping the entire section.

The [OR] flag is not used on SetEnvIf lines, rather on mod_rewrite lines.

lucy24

8:59 pm on Aug 22, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Perhaps I was just focused on scrapping the entire section.

That would probably be my first choice too. Start from scratch and build up a list of current, active UAs with only the necessary anchors or other necessary text. For example

BrowserMatch "MSIE [1-4]\." keep_out


rather than simply

BrowserMatch "MSIE [1-4]" keep_out


I once inadvertently locked out a MSIE 10 user. Took me forever to figure out why they were drawing a 403.

I use quotation marks in BrowserMatch statements when they contain a literal space. Otherwise you have to escape the space.

And gosh, what a relief to find an occurrence of "scrapping" that isn't a typo for "scraping" ;)

KidTao

1:09 am on Aug 23, 2012 (gmt 0)

10+ Year Member



You guys are awesome. Thank you very much for all the trouble.

I looked into all of your advices discussed above, and tried to streamline the code as follows:

---------------------------------------------------------------------------

# Controls site-level indexing
Header set X-Robots-Tag "noindex, nofollow, noimageindex"

# Controls file-level indexing
<Files ~ "\.xml$">
Header set X-Robots-Tag "noindex, nofollow, noimageindex"
</Files>
<Files ~ "\.(png|jpe?g|gif)$">
Header set X-Robots-Tag "noindex"
</Files>
<Files ~ "\.pdf$">
Header set X-Robots-Tag "noindex, nofollow"
</Files>

# Protects the htaccess file
<files .htaccess>
order deny,allow
deny from all
</files>

# Blocks bad bots
BrowserMatchNoCase yandex bad_bot
BrowserMatchNoCase ichiro bad_bot
BrowserMatchNoCase moget bad_bot
BrowserMatchNoCase mogimogi bad_bot
BrowserMatchNoCase cowbot bad_bot
BrowserMatchNoCase naverrobot bad_bot
BrowserMatchNoCase naverbot bad_bot
BrowserMatchNoCase nabot bad_bot
BrowserMatchNoCase yeti bad_bot
BrowserMatchNoCase daum bad_bot
BrowserMatchNoCase daumoa bad_bot
BrowserMatchNoCase rabot bad_bot
BrowserMatchNoCase baiduspider bad_bot
BrowserMatchNoCase baiduimagespider bad_bot
BrowserMatchNoCase sogou bad_bot
BrowserMatchNoCase sohu bad_bot
BrowserMatchNoCase youdaobot bad_bot
order deny,allow
deny from env=bad_bot
deny from 212.100.254.105

# Prevents hot-linking
RewriteCond %{HTTP_REFERER} !^http(s)?://([^.]+\.)*example\.com [NC]
RewriteRule .*\.(jpe?g|gif|png|js|css|swf|flv)$ - [F,NC,L]

---------------------------------------------------------------------------

I have fewer worries now although I'm still not too sure about a couple of things.

1) Following BrowserMatch, does a user agent need to be in a full word? I have a feeling DAUMOA is redundant since I already have DAUM. This applies with Naver as well: NaverRobot, NaverBot.

2) As for HTTP_REFERER, do I even need RegEx for subdomain and protocol? If "containing somewhere in the string" is defined by NOT using either ^ or $, wouldn't be the domain name only sufficient? For now, I'm following jdMorgan's protocol, but I'm open to your idea as well.

Thank you again for your time guys.

wilderness

1:30 am on Aug 23, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The answer is YES for both your 1 & 2 questions.

These may also simply be replaced with spider and you'll catch some other pests as well:

baiduspider

lucy24

4:29 am on Aug 23, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hm. wilderness, when you said Yes did you mean No? Your example makes it look that way.

does a user agent need to be in a full word? I have a feeling DAUMOA is redundant since I already have DAUM. This applies with Naver as well: NaverRobot, NaverBot

If you don't use an operator such as \b ("end of word") it will match the string anywhere. Hence my example about "MSIE [1-4]" matching MSIE 10 :(

As for HTTP_REFERER, do I even need RegEx for subdomain and protocol?

Only if your site uses both http and https. Remember, what you're matching here is actual referrals from your own site. Not forged referers who may get it a little bit wrong. In particular, your real site either uses or doesn't use www. So the other form could only come from a forged referer.

Hotlinking rules should include an exception for null referers:

^-?$

Look in your logs and you'll see that there's almost never a complete blank. Usually there's a - single dash. You except them for two reasons. One is search engines-- unless of course you don't want a single one of your images ever included in any index anywhere. The other is human browsers that simply don't send a referer. They're not common, but they can be perfectly legitimate.

Oh, and you don't need [L] with [F]. It won't do anything bad, but [F] is one of very few flags that carries an implied [L]. Others are [G] and [P] -- but not [R]. (This seems counterintuitive to me.)

wilderness

5:14 am on Aug 23, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If "containing somewhere in the string" is defined by NOT using either ^ or $, wouldn't be the domain name only sufficient?


The answer is YES for both your 1 & 2 questions.


Hm. wilderness, when you said Yes did you mean No? Your example makes it look that way.


For the benefit of clarification, KidTao and I were on the same channel, while your on another set ;)

lucy24

8:22 am on Aug 23, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This is one of those dialectal things isn't it? I'm still trying to decipher "in a full word".

Q 1): "does a user agent need to be in a full word?"
A#1: Yes.
A#2: No.
The "pattern" string doesn't have to be a full word. In the "target" it doesn't have to appear as a full word. It can even be part of one word and part of the next.

Q 2a): "As for HTTP_REFERER, do I even need RegEx for subdomain and protocol?"
A#1: Yes.
A#2: No.
I took the question to mean "Do I need to include those variables?", meaning http/https and with/without www. Hence the No. But if you're working on the premise that you do need to allow for the variables, and you take the question as "Do I need Regular Expressions to show the options?" then it's a Yes.

Q 2b): "If "containing somewhere in the string" is defined by NOT using either ^ or $, wouldn't be the domain name only sufficient?"
A#1: Yes.
A#2: No.
At least this one isn't a linguistic quibble. It's about, uhm, degrees of paranoia. Say "example" alone, rather than the full "http://www.example.com" and you've safely let in all images called by files on your own site. But you've also let in anything requested by an imperfect referer faker. And the robot living at www.badexample.net. This is not a problem until, well, the day it becomes a problem. (Do burglars try your house door every single day? Have you decided it's not worth locking it?)

KidTao

12:16 pm on Aug 23, 2012 (gmt 0)

10+ Year Member



Hotlinking rules should include an exception for null referers:

^-?$

Lucy,

I'm guessing what you are referring to is:
RewriteCond %{HTTP_REFERER} !^-?$

I'm a little confused about "-?". Since "?" is either 0 or 1, wouldn't it make no practical difference from not citing it?

phranque

6:02 pm on Aug 23, 2012 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



welcome to WebmasterWorld, KidTao!

!^-?$ is not NULL and not "exactly a dash".

otherwise stated, the condition fails for a non-NULL referrer or if the referrer is '-'.

lucy24

9:19 pm on Aug 23, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Since "?" is either 0 or 1, wouldn't it make no practical difference from not citing it?

Do you mean the whole line, or just the question mark? The anchors are essential here, because they make the complete line mean "exactly one hyphen or exactly nothing". Same principle as the with/without www redirect, where the same punctuation in the same order means "exactly 'www.example.com' or exactly nothing".

I don't know where the hyphen originates, but in my logs, a genuine null referer
""
is rare. It generally comes through as
"-"
In particular, that's what legitimate seearch engines look like.

Conversely, anyone whose UA is
^-?$
is blocked. Law-abiding robots-- always excepting google's faviconbot-- wear clothes.

KidTao

12:37 am on Aug 24, 2012 (gmt 0)

10+ Year Member



I see. One thing that I'm not quite sure still is since either a null referrer or "-" is a user agent, not URL, shouldn't I be using %{HTTP_USER_AGENT} instead of %{HTTP_REFERER}? I'm open to your idea regardless.

That brings my revision to:

-----------------------------------------------------------------------------------

# Controls site-level indexing
Header set X-Robots-Tag "noindex, nofollow, noimageindex"

# Controls file-level indexing
<Files ~ "\.xml$">
Header set X-Robots-Tag "noindex, nofollow, noimageindex"
</Files>
<Files ~ "\.(png|jpe?g|gif)$">
Header set X-Robots-Tag "noindex"
</Files>
<Files ~ "\.pdf$">
Header set X-Robots-Tag "noindex, nofollow"
</Files>

# Protects the htaccess file
<files .htaccess>
order deny,allow
deny from all
</files>

# Blocks bad bots
BrowserMatchNoCase yandex bad_bot
BrowserMatchNoCase ichiro bad_bot
BrowserMatchNoCase moget bad_bot
BrowserMatchNoCase mogimogi bad_bot
BrowserMatchNoCase cowbot bad_bot
BrowserMatchNoCase naver bad_bot
BrowserMatchNoCase nabot bad_bot
BrowserMatchNoCase yeti bad_bot
BrowserMatchNoCase daum bad_bot
BrowserMatchNoCase rabot bad_bot
BrowserMatchNoCase baidu bad_bot
BrowserMatchNoCase sogou bad_bot
BrowserMatchNoCase sohu bad_bot
BrowserMatchNoCase youdaobot bad_bot
order deny,allow
deny from env=bad_bot
deny from 212.100.254.105

# Prevents hot-linking
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http(s)?://([^.]+\.)*example\.com [NC]
RewriteRule .*\.(jpe?g|gif|png|js|css|swf|flv)$ - [F,NC]

# Blocks both blank UA and UA that contains "-"
RewriteCond %{HTTP_USER_AGENT} ^-?$
RewriteRule .* - [F]

-----------------------------------------------------------------------------------

Please let me know what you think if you would.

wilderness

1:11 am on Aug 24, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Blank refers are quite common these days and perfectly acceptable.

Blank UA's should never be allowed.