homepage Welcome to WebmasterWorld Guest from 184.72.82.126
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Can the Googlebot read the blocking rules inside .htaccess?
Andy500




msg:4662874
 1:24 pm on Apr 14, 2014 (gmt 0)

Hi there

I would like to block all backlink checkers and other annoying crawlers on my .htaccess file ,but I'd like to ask first if the Googlebot can read the .htaccess file and see what is being blocked in the file?

I do know about the Googlebot reading robots.txt and the likes, but I am getting flooded with all these crawlers and want them out (hence I want to use .htaccess). However, I am not sure how the Googlebot goes about the .htaccess file and whether it can read it as well as read what is inside it other than any rules blocking the Googlebot itself. I'd assume the Googlebot would be able to recognize if any rule in .htaccess was blocking it, but that surely would not mean that it can read the rest of what's inside the .htaccess file (i.e. other crawlers blocked).

Would appreciate any answers, many thanks!

Andy

 

wilderness




msg:4662901
 3:02 pm on Apr 14, 2014 (gmt 0)

NO.

Most servers (shared hosting and otherwise) have access denied by default (I've not had an htaccess file accessed since I began my sites in 1999), however if your not comfortable with that, somebody will be along shortly to supply the lines.

FWIW (and somewhat related), if you have Rewrites that are not in proper order than your paths could be exposed to Google, other bots and even regular visitors.

lucy24




msg:4662918
 3:58 pm on Apr 14, 2014 (gmt 0)

Nobody can see htaccess. Unless you have the worst host in the world, your server's config file contains a line that says something like
<FilesMatch "^\.ht">
Order Allow,Deny
Deny from all
</FilesMatch>
The condition could easily say something broader like "^\."
but .ht is conventional as it covers both htaccess and htpasswd.

You can easily test this by requesting ".htaccess" in your browser, same as you'd request robots.txt or sitemap.xml or any other file.

Matter of fact, this block is so universal that I can't remember ever meeting a malign robot even asking for htaccess. (In my case it wouldn't help them a lot, since the IP blocks are in an extra htaccess file in my userspace, which you can't browse to.)

The crucial difference between robots.txt and htaccess is just this. Visitors can ask to see robots.txt and can then choose to follow its directives. Visitors have to obey htaccess whether they want to or not.

:: detour to check something ::

Thought so. The boilerplate config file that comes with MAMP has the built-in line
<FilesMatch "^\.ht">
Order allow,deny
Deny from all
Satisfy All
</FilesMatch>

-- and that's for a pseudo-server that would only ever be used on someone's local HD. Someone else (phranque?) can say what you get when you download the Apache software; I'm sure it isn't a blank piece of paper that you have to fill in from scratch.

Andy500




msg:4662933
 5:24 pm on Apr 14, 2014 (gmt 0)

Thanks!

Well, we have a solid webhost and we have a good VPS, so there is no concerns there. We also have other sites in shared hosting accounts, but they are usually resellers of bigger companies like Hostgator, so that should be OK too.

So, considering that the host is professional and keeps a good environment of the aforementioned factors in your replies, then Google would not be able to read what other rules blocking other bots are in the .htaccess file? Correct?

Sorry for asking for confirmation but I am not technical enough and I will admit that I got a bit lost with your answers and I wasn't sure if you were talking of regular visitors not being able to see the .htaccess rules or Google not being able to see the .htacess rules :-D In essence, all we want is to block a lot of known bots (e.g. MOZ and their annoying OpenSiteExplorer bot and many more known crawlers/spiders) all blocked in the .htaccess file, but we want to do this without Google knowing that we have blocked these crawlers in .htaccess I know you replied excellently to my question, but I didn't get it fully (sorry, my fault, still a noob when it comes to this - but learning fast!).

Many Thanks again!

lucy24




msg:4662977
 8:55 pm on Apr 14, 2014 (gmt 0)

I wasn't sure if you were talking of regular visitors not being able to see the .htaccess rules or Google not being able to see the .htacess rules

There's no difference. Server directives-- whether in htaccess or the config file-- are for everyone.

All the visitor sees is a response header: 200 "sure, go on in" vs. 301 "go around the back" vs. 403 "nuh-uh, nothing for you" vs. 404 "sorry, I'd love to help but can't find it" vs. ... et cetera. The visitor has no way of knowing how or why the header originated: one response fits all, or Because I Don't Like Your Face, or complicated activity in a secret php file, or ... et cetera again.

The one concrete difference between a human and a robot is that a human user-agent-- what we call a browser-- automatically follows a redirect, without consulting the human. A robot makes note of the redirect and stashes the information for later. They may or may not follow it right away. But a robot-- even the googlebot-- cannot ignore the 301 and continue to the originally requested file.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved