Welcome to WebmasterWorld Guest from 54.211.136.250

Forum Moderators: goodroi

Message Too Old, No Replies

When Robots.txt And Meta Robots Collide

What happens?

   
6:19 pm on Mar 7, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



What would be the expected behavior if a site has this in the <head>
<meta name="robots" content="ALL" />

and this in Robots.txt

User-agent: *
Disallow: /about/
Disallow: /press/
Disallow: /products/

Would pages in /about/, /press/ and /products/ get crawled and indexed?

I've got a site that I'm going to work on and they set things up this way. The site has a PR7 on the home page and 0 on any of the pages that are in these 3 directories.

6:27 pm on Mar 7, 2007 (gmt 0)

WebmasterWorld Senior Member encyclo is a WebmasterWorld Top Contributor of All Time 10+ Year Member



No, because the spider reads the robots.txt first and will never fetch the pages within those directories. So robots.txt takes precedence over the meta element.
6:31 pm on Mar 7, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks encyclo...

I also just read a post from Vanessa Fox dated 3/5/07, subject "Using the robots meta tag" on the Google Webmaster Blog today and it turns out that ALL is not even an accepted value for the Robots Meta Tag, from googles perspective. The post indicated that the following were valid values (case insensitive):

NOINDEX
NOFOLLOW
NOARCHIVE
NOSNIPPET
NOODP
NONE

6:51 pm on Mar 7, 2007 (gmt 0)

WebmasterWorld Senior Member encyclo is a WebmasterWorld Top Contributor of All Time 10+ Year Member



"ALL" is acceptable, but it does nothing but confirm what is already the default. Same goes for "INDEX" and "FOLLOW". As none of these (valid) values have any modifying effect on Googlebot, they have no need to read them. Using "ALL" does no harm, but is unnecessary.
8:33 pm on Mar 8, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



No, because the spider reads the robots.txt first and will never fetch the pages within those directories.

You sure about that encyclo? AFAIK, Google blatantly crawls all over those directories, reads and stores information from there, but just doesn't include it in the SERPs.

8:46 pm on Mar 8, 2007 (gmt 0)

WebmasterWorld Senior Member jimbeetle is a WebmasterWorld Top Contributor of All Time 10+ Year Member



You sure about that encyclo? AFAIK, Google blatantly crawls all over those directories, reads and stores information from there, but just doesn't include it in the SERPs.

That would mean that gbot is not obeying robots.txt, and that would be news all over the webmaster/sem forums. Haven't seen that as yet. While there have been isolated cases reported where googlebot appeared not to obey robots.tx, these usually involving a very recently changed robots.txt file with the bot apparently working with its most recently cached copy.

Pages blocked by robots.txt can, and will, be indexed if G finds links pointing to them. As Encyclo says, the page must be exposed to the bot so it can read the NOINDEX.

9:33 pm on Mar 8, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



spider reads the robots.txt first and will never fetch the pages

Correct, a denial in robots.txt means that the spider isn't allowed to look at the page to see whether or not it has a meta tag.
10:02 pm on Mar 8, 2007 (gmt 0)

WebmasterWorld Senior Member encyclo is a WebmasterWorld Top Contributor of All Time 10+ Year Member



A directory or page can be blocked by robots.txt but still be present within Google, but shown as URL only with no description - as Googlebot won't actually visit the URL in question but it will take account of links to it.

So, if you have 100 links on an indexed page to 100 pages residing in a blocked directory, then those 100 pages can still be listed in the SERPs.

Occasionally, a URL which is blocked by robots.txt but is listed in the DMOZ directory will display with the DMOZ title and data rather than URL only. I am not aware of Google ever using any other data (eg. backlink text) to add a title or description.

In my experience, Googlebot (or any other major search engine bot) has never violated the robots.txt directives. If you find that it does on a particular site, then the most likely explanation is that there is a problem with your robots.txt syntax.

10:47 pm on Mar 8, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Once in a blue moon GoogleBot will wonder off to the disallowed directory or a page.

On the other hand SLURP is notorious to getting itself banned. none linked ditectories that were simply mentioned in robots.txt for testing of the bot trap.

MSNBot was cought couple times and later even had a cashed copy of the page in the indexed after getting real 403 that was disallowed by robots.txt for a could of days. That page is still linked-displayed when site:mydomain.tld is used. If I set to see 50 results per page it shows links to 2 bottrap pages in top 20 at this time. thouse pages have
<meta name="robots" content="noindex,nofollow"> in the head of the document but linked to from the footer on almost every page.

:)

11:09 am on Mar 9, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks for the confirmation, encyclo.

Google has publicly stated (I think) that not all their bots respect robots.tx. The Adwords bot is an example. Claims exist in threads like this one [webmasterworld.com]. But this does not relate to the main googlebot itself.

Various confirmed reports - by members like g1smd (whose posts I greatly respect) - on Google violations involved violation of robots meta tags [webmasterworld.com], not robots.txt.

I suppose I was misled by the many anecdotal accounts here about Google ignoring robots.txt itself.

1:19 am on Mar 10, 2007 (gmt 0)

10+ Year Member



I have seen in the past that Goog completely ignores robots text.
having had many pages in the supplemental index from my tracking script indexed when I had the proper language in my robots text and the same goes for CSS and Java.
Brad
1:29 am on Mar 10, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



many pages in the supplemental index

With or without snippets? Googlebot does include disallowed urls in the index, but only the urls, not the content.
1:57 am on Mar 10, 2007 (gmt 0)

10+ Year Member



With or without snippets?
Please explain?

If you are talking about the SE text Snippets I really donít recall what was there I wished I had paid more attention

 

Featured Threads

Hot Threads This Week

Hot Threads This Month