When Robots.txt And Meta Robots Collide

Forum Moderators: goodroi

Message Too Old, No Replies

When Robots.txt And Meta Robots Collide

What happens?

Easy_Coder

6:19 pm on Mar 7, 2007 (gmt 0)

What would be the expected behavior if a site has this in the <head>
<meta name="robots" content="ALL" />

and this in Robots.txt

User-agent: *
Disallow: /about/
Disallow: /press/
Disallow: /products/

Would pages in /about/, /press/ and /products/ get crawled and indexed?

I've got a site that I'm going to work on and they set things up this way. The site has a PR7 on the home page and 0 on any of the pages that are in these 3 directories.

encyclo

6:27 pm on Mar 7, 2007 (gmt 0)

No, because the spider reads the robots.txt first and will never fetch the pages within those directories. So robots.txt takes precedence over the meta element.

Easy_Coder

6:31 pm on Mar 7, 2007 (gmt 0)

Thanks encyclo...

I also just read a post from Vanessa Fox dated 3/5/07, subject "Using the robots meta tag" on the Google Webmaster Blog today and it turns out that ALL is not even an accepted value for the Robots Meta Tag, from googles perspective. The post indicated that the following were valid values (case insensitive):

NOINDEX
NOFOLLOW
NOARCHIVE
NOSNIPPET
NOODP
NONE

encyclo

6:51 pm on Mar 7, 2007 (gmt 0)

"ALL" is acceptable, but it does nothing but confirm what is already the default. Same goes for "INDEX" and "FOLLOW". As none of these (valid) values have any modifying effect on Googlebot, they have no need to read them. Using "ALL" does no harm, but is unnecessary.

oddsod

8:33 pm on Mar 8, 2007 (gmt 0)

No, because the spider reads the robots.txt first and will never fetch the pages within those directories.

You sure about that encyclo? AFAIK, Google blatantly crawls all over those directories, reads and stores information from there, but just doesn't include it in the SERPs.

jimbeetle

8:46 pm on Mar 8, 2007 (gmt 0)

You sure about that encyclo? AFAIK, Google blatantly crawls all over those directories, reads and stores information from there, but just doesn't include it in the SERPs.

That would mean that gbot is not obeying robots.txt, and that would be news all over the webmaster/sem forums. Haven't seen that as yet. While there have been isolated cases reported where googlebot appeared not to obey robots.tx, these usually involving a very recently changed robots.txt file with the bot apparently working with its most recently cached copy.

Pages blocked by robots.txt can, and will, be indexed if G finds links pointing to them. As Encyclo says, the page must be exposed to the bot so it can read the NOINDEX.

mcavic

9:33 pm on Mar 8, 2007 (gmt 0)

spider reads the robots.txt first and will never fetch the pages

Correct, a denial in robots.txt means that the spider isn't allowed to look at the page to see whether or not it has a meta tag.

encyclo

10:02 pm on Mar 8, 2007 (gmt 0)

A directory or page can be blocked by robots.txt but still be present within Google, but shown as URL only with no description - as Googlebot won't actually visit the URL in question but it will take account of links to it.

So, if you have 100 links on an indexed page to 100 pages residing in a blocked directory, then those 100 pages can still be listed in the SERPs.

Occasionally, a URL which is blocked by robots.txt but is listed in the DMOZ directory will display with the DMOZ title and data rather than URL only. I am not aware of Google ever using any other data (eg. backlink text) to add a title or description.

In my experience, Googlebot (or any other major search engine bot) has never violated the robots.txt directives. If you find that it does on a particular site, then the most likely explanation is that there is a problem with your robots.txt syntax.

blend27

10:47 pm on Mar 8, 2007 (gmt 0)

Once in a blue moon GoogleBot will wonder off to the disallowed directory or a page.

On the other hand SLURP is notorious to getting itself banned. none linked ditectories that were simply mentioned in robots.txt for testing of the bot trap.

MSNBot was cought couple times and later even had a cashed copy of the page in the indexed after getting real 403 that was disallowed by robots.txt for a could of days. That page is still linked-displayed when site:mydomain.tld is used. If I set to see 50 results per page it shows links to 2 bottrap pages in top 20 at this time. thouse pages have
<meta name="robots" content="noindex,nofollow"> in the head of the document but linked to from the footer on almost every page.

oddsod

11:09 am on Mar 9, 2007 (gmt 0)

Thanks for the confirmation, encyclo.

Google has publicly stated (I think) that not all their bots respect robots.tx. The Adwords bot is an example. Claims exist in threads like this one [webmasterworld.com]. But this does not relate to the main googlebot itself.

Various confirmed reports - by members like g1smd (whose posts I greatly respect) - on Google violations involved violation of robots meta tags [webmasterworld.com], not robots.txt.

I suppose I was misled by the many anecdotal accounts here about Google ignoring robots.txt itself.

Need More Hits

1:19 am on Mar 10, 2007 (gmt 0)

I have seen in the past that Goog completely ignores robots text.
having had many pages in the supplemental index from my tracking script indexed when I had the proper language in my robots text and the same goes for CSS and Java.
Brad

mcavic

1:29 am on Mar 10, 2007 (gmt 0)

many pages in the supplemental index

With or without snippets? Googlebot does include disallowed urls in the index, but only the urls, not the content.

Need More Hits

1:57 am on Mar 10, 2007 (gmt 0)

With or without snippets?
Please explain?

If you are talking about the SE text Snippets I really don�t recall what was there I wished I had paid more attention