|Can the Googlebot read the blocking rules inside .htaccess?|
| 1:24 pm on Apr 14, 2014 (gmt 0)|
I would like to block all backlink checkers and other annoying crawlers on my .htaccess file ,but I'd like to ask first if the Googlebot can read the .htaccess file and see what is being blocked in the file?
I do know about the Googlebot reading robots.txt and the likes, but I am getting flooded with all these crawlers and want them out (hence I want to use .htaccess). However, I am not sure how the Googlebot goes about the .htaccess file and whether it can read it as well as read what is inside it other than any rules blocking the Googlebot itself. I'd assume the Googlebot would be able to recognize if any rule in .htaccess was blocking it, but that surely would not mean that it can read the rest of what's inside the .htaccess file (i.e. other crawlers blocked).
Would appreciate any answers, many thanks!
| 3:02 pm on Apr 14, 2014 (gmt 0)|
Most servers (shared hosting and otherwise) have access denied by default (I've not had an htaccess file accessed since I began my sites in 1999), however if your not comfortable with that, somebody will be along shortly to supply the lines.
FWIW (and somewhat related), if you have Rewrites that are not in proper order than your paths could be exposed to Google, other bots and even regular visitors.
| 3:58 pm on Apr 14, 2014 (gmt 0)|
Nobody can see htaccess. Unless you have the worst host in the world, your server's config file contains a line that says something like
Deny from all
The condition could easily say something broader like "^\."
but .ht is conventional as it covers both htaccess and htpasswd.
You can easily test this by requesting ".htaccess" in your browser, same as you'd request robots.txt or sitemap.xml or any other file.
Matter of fact, this block is so universal that I can't remember ever meeting a malign robot even asking for htaccess. (In my case it wouldn't help them a lot, since the IP blocks are in an extra htaccess file in my userspace, which you can't browse to.)
The crucial difference between robots.txt and htaccess is just this. Visitors can ask to see robots.txt and can then choose to follow its directives. Visitors have to obey htaccess whether they want to or not.
:: detour to check something ::
Thought so. The boilerplate config file that comes with MAMP has the built-in line
Deny from all
-- and that's for a pseudo-server that would only ever be used on someone's local HD. Someone else (phranque?) can say what you get when you download the Apache software; I'm sure it isn't a blank piece of paper that you have to fill in from scratch.
| 5:24 pm on Apr 14, 2014 (gmt 0)|
Well, we have a solid webhost and we have a good VPS, so there is no concerns there. We also have other sites in shared hosting accounts, but they are usually resellers of bigger companies like Hostgator, so that should be OK too.
So, considering that the host is professional and keeps a good environment of the aforementioned factors in your replies, then Google would not be able to read what other rules blocking other bots are in the .htaccess file? Correct?
Sorry for asking for confirmation but I am not technical enough and I will admit that I got a bit lost with your answers and I wasn't sure if you were talking of regular visitors not being able to see the .htaccess rules or Google not being able to see the .htacess rules :-D In essence, all we want is to block a lot of known bots (e.g. MOZ and their annoying OpenSiteExplorer bot and many more known crawlers/spiders) all blocked in the .htaccess file, but we want to do this without Google knowing that we have blocked these crawlers in .htaccess I know you replied excellently to my question, but I didn't get it fully (sorry, my fault, still a noob when it comes to this - but learning fast!).
Many Thanks again!
| 8:55 pm on Apr 14, 2014 (gmt 0)|
|I wasn't sure if you were talking of regular visitors not being able to see the .htaccess rules or Google not being able to see the .htacess rules |
There's no difference. Server directives-- whether in htaccess or the config file-- are for everyone.
All the visitor sees is a response header: 200 "sure, go on in" vs. 301 "go around the back" vs. 403 "nuh-uh, nothing for you" vs. 404 "sorry, I'd love to help but can't find it" vs. ... et cetera. The visitor has no way of knowing how or why the header originated: one response fits all, or Because I Don't Like Your Face, or complicated activity in a secret php file, or ... et cetera again.
The one concrete difference between a human and a robot is that a human user-agent-- what we call a browser-- automatically follows a redirect, without consulting the human. A robot makes note of the redirect and stashes the information for later. They may or may not follow it right away. But a robot-- even the googlebot-- cannot ignore the 301 and continue to the originally requested file.