homepage Welcome to WebmasterWorld Guest from 54.166.255.168
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
When Robots.txt And Meta Robots Collide
What happens?
Easy_Coder

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3274266 posted 6:19 pm on Mar 7, 2007 (gmt 0)

What would be the expected behavior if a site has this in the <head>
<meta name="robots" content="ALL" />

and this in Robots.txt

User-agent: *
Disallow: /about/
Disallow: /press/
Disallow: /products/

Would pages in /about/, /press/ and /products/ get crawled and indexed?

I've got a site that I'm going to work on and they set things up this way. The site has a PR7 on the home page and 0 on any of the pages that are in these 3 directories.

 

encyclo

WebmasterWorld Senior Member encyclo us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3274266 posted 6:27 pm on Mar 7, 2007 (gmt 0)

No, because the spider reads the robots.txt first and will never fetch the pages within those directories. So robots.txt takes precedence over the meta element.

Easy_Coder

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3274266 posted 6:31 pm on Mar 7, 2007 (gmt 0)

Thanks encyclo...

I also just read a post from Vanessa Fox dated 3/5/07, subject "Using the robots meta tag" on the Google Webmaster Blog today and it turns out that ALL is not even an accepted value for the Robots Meta Tag, from googles perspective. The post indicated that the following were valid values (case insensitive):

NOINDEX
NOFOLLOW
NOARCHIVE
NOSNIPPET
NOODP
NONE

encyclo

WebmasterWorld Senior Member encyclo us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3274266 posted 6:51 pm on Mar 7, 2007 (gmt 0)

"ALL" is acceptable, but it does nothing but confirm what is already the default. Same goes for "INDEX" and "FOLLOW". As none of these (valid) values have any modifying effect on Googlebot, they have no need to read them. Using "ALL" does no harm, but is unnecessary.

oddsod

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3274266 posted 8:33 pm on Mar 8, 2007 (gmt 0)

No, because the spider reads the robots.txt first and will never fetch the pages within those directories.

You sure about that encyclo? AFAIK, Google blatantly crawls all over those directories, reads and stores information from there, but just doesn't include it in the SERPs.

jimbeetle

WebmasterWorld Senior Member jimbeetle us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3274266 posted 8:46 pm on Mar 8, 2007 (gmt 0)

You sure about that encyclo? AFAIK, Google blatantly crawls all over those directories, reads and stores information from there, but just doesn't include it in the SERPs.

That would mean that gbot is not obeying robots.txt, and that would be news all over the webmaster/sem forums. Haven't seen that as yet. While there have been isolated cases reported where googlebot appeared not to obey robots.tx, these usually involving a very recently changed robots.txt file with the bot apparently working with its most recently cached copy.

Pages blocked by robots.txt can, and will, be indexed if G finds links pointing to them. As Encyclo says, the page must be exposed to the bot so it can read the NOINDEX.

mcavic

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3274266 posted 9:33 pm on Mar 8, 2007 (gmt 0)

spider reads the robots.txt first and will never fetch the pages

Correct, a denial in robots.txt means that the spider isn't allowed to look at the page to see whether or not it has a meta tag.

encyclo

WebmasterWorld Senior Member encyclo us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3274266 posted 10:02 pm on Mar 8, 2007 (gmt 0)

A directory or page can be blocked by robots.txt but still be present within Google, but shown as URL only with no description - as Googlebot won't actually visit the URL in question but it will take account of links to it.

So, if you have 100 links on an indexed page to 100 pages residing in a blocked directory, then those 100 pages can still be listed in the SERPs.

Occasionally, a URL which is blocked by robots.txt but is listed in the DMOZ directory will display with the DMOZ title and data rather than URL only. I am not aware of Google ever using any other data (eg. backlink text) to add a title or description.

In my experience, Googlebot (or any other major search engine bot) has never violated the robots.txt directives. If you find that it does on a particular site, then the most likely explanation is that there is a problem with your robots.txt syntax.

blend27

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3274266 posted 10:47 pm on Mar 8, 2007 (gmt 0)

Once in a blue moon GoogleBot will wonder off to the disallowed directory or a page.

On the other hand SLURP is notorious to getting itself banned. none linked ditectories that were simply mentioned in robots.txt for testing of the bot trap.

MSNBot was cought couple times and later even had a cashed copy of the page in the indexed after getting real 403 that was disallowed by robots.txt for a could of days. That page is still linked-displayed when site:mydomain.tld is used. If I set to see 50 results per page it shows links to 2 bottrap pages in top 20 at this time. thouse pages have
<meta name="robots" content="noindex,nofollow"> in the head of the document but linked to from the footer on almost every page.

:)

oddsod

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3274266 posted 11:09 am on Mar 9, 2007 (gmt 0)

Thanks for the confirmation, encyclo.

Google has publicly stated (I think) that not all their bots respect robots.tx. The Adwords bot is an example. Claims exist in threads like this one [webmasterworld.com]. But this does not relate to the main googlebot itself.

Various confirmed reports - by members like g1smd (whose posts I greatly respect) - on Google violations involved violation of robots meta tags [webmasterworld.com], not robots.txt.

I suppose I was misled by the many anecdotal accounts here about Google ignoring robots.txt itself.

Need More Hits

10+ Year Member



 
Msg#: 3274266 posted 1:19 am on Mar 10, 2007 (gmt 0)

I have seen in the past that Goog completely ignores robots text.
having had many pages in the supplemental index from my tracking script indexed when I had the proper language in my robots text and the same goes for CSS and Java.
Brad

mcavic

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3274266 posted 1:29 am on Mar 10, 2007 (gmt 0)

many pages in the supplemental index

With or without snippets? Googlebot does include disallowed urls in the index, but only the urls, not the content.

Need More Hits

10+ Year Member



 
Msg#: 3274266 posted 1:57 am on Mar 10, 2007 (gmt 0)

With or without snippets?
Please explain?

If you are talking about the SE text Snippets I really donít recall what was there I wished I had paid more attention

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved