homepage Welcome to WebmasterWorld Guest from 54.242.241.20
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Robots.txt vs Google Sitemap.who wins?
smithaa02

5+ Year Member



 
Msg#: 4599472 posted 2:07 pm on Aug 5, 2013 (gmt 0)

There are some thin pages I don't want google to index.

I've added a robots.txt to block these pages.

However I still submit these pages via a sitemap? Why? Because I want to keep track of how may are being indexed this way.

I checked WMT and the first error I got was:

"When we tested a sample of the URLs from your Sitemap, we found that the site's robots.txt file was blocking access to some of the URLs. If you don't intend to block some of the URLs contained in the Sitemap, please use our robots.txt analysis tool to verify that the URLs you submitted in your Sitemap are accessible by Googlebot. All accessible URLs will still be submitted."

What has me worried is the last sentence which to me indicates these urls will still be committed (it's not clear though).

Another error that popped up was: "Sitemap contains urls which are blocked by robots.txt"

Does anybody know which is more powerful? Can I tell google not to index pages while still having them in a sitemap? The Submitted/Index ratio is still the same which to me says google is not obeying the robots.txt directive but perhaps there is a delay?

 

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4599472 posted 4:18 pm on Aug 5, 2013 (gmt 0)

The Submitted/Index ratio is still the same which to me says google is not obeying the robots.txt directive

The rest of your post seems to say the opposite: that g### IS obeying robots.txt and therefore not crawling the blocked pages.

crawling and indexing are different things

This is an interesting post because others have suggested that the sitemap overrides robots.txt, such that anything in a sitemap will be crawled even if roboted-out. Your experience seems to say otherwise.

Anyway the solution is straightforward. If you want a page to be neither crawled nor indexed, don't include it in a sitemap. If you want it to be crawled but not indexed, give each page a meta noindex header. If google knows that a resource exists, it will index it unless it has been explicitly told not to.

pippo



 
Msg#: 4599472 posted 4:27 pm on Aug 5, 2013 (gmt 0)

"it will index it unless it has been explicitly told not to."

...adding, robots.txt does NOT work as a method of saying "do not index". OP seems to be confused about this.

If I link to your robots.txt-excluded pages, Google will probably add your URLs to the index and rank them based on what it can know about them without crawling them directly. So not even robots.txt plus leaving them out of the sitemap will do what you want - you have to use a NOINDEX directive somewhere.

Amusingly, in the above example, a NOINDEX wouldn't work because Google has been instructed (by you) not to crawl the page, so it can't know what the META ROBOTS directives are. IE, robots.txt actually works against you here.

smithaa02

5+ Year Member



 
Msg#: 4599472 posted 5:45 pm on Aug 5, 2013 (gmt 0)

There are a lot of pages to noindex! Is there a way to this with a .htaccess file?

netmeg

WebmasterWorld Senior Member netmeg us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4599472 posted 5:49 pm on Aug 5, 2013 (gmt 0)

Look into an X-Robots tag.

https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag

smithaa02

5+ Year Member



 
Msg#: 4599472 posted 5:59 pm on Aug 5, 2013 (gmt 0)

Sounds good...will be curious to see if robots.txt + x-robots can defeat sitemap.xml!

If it doesn't, I guess I'll just give up and delete those entries out of sitemap.xml.

netmeg

WebmasterWorld Senior Member netmeg us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4599472 posted 7:39 pm on Aug 5, 2013 (gmt 0)

(For what it's worth, I would never put them IN the sitemap to begin with. Ever.)

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4599472 posted 9:09 pm on Aug 5, 2013 (gmt 0)

will be curious to see if robots.txt + x-robots can defeat sitemap.xml

remove the robots.txt block

If the googlebot can't crawl your pages, it will never see the "noindex" header-- whether it's an X-Robots response header or a meta within the individual page.

JD_Toims

WebmasterWorld Senior Member Top Contributors Of The Month



 
Msg#: 4599472 posted 9:18 pm on Aug 5, 2013 (gmt 0)

Sounds good...will be curious to see if robots.txt + x-robots can defeat sitemap.xml!

What Lucy24 said reiterated: You *cannot* block a page from being crawled in robots.txt *and* noindex the page to have it removed, because if Google finds links to the page it usually *will* be indexed based on the information in and surrounding the links to it even though they cannot crawl the page itself.

If you want a page to be removed from the index you *must* allow GoogleBot to crawl it and either have noindex on the page, noindex in an X-Robots-Tag header for the page *or* serve GoogleBot an error code such as 403 when the page is requested.

Also, I'm really not sure I understand why you're explicitly telling GoogleBot how to find pages you don't want indexed. Seems a bit like "conflicting signals" you're sending to me, unless you're just trying to get them crawled so the noindex or 403 error is seen and then they are going to be removed from the XML Sitemap.

The "indexing system" is just that, a system [not a person], so I think it's always best to send the clearest message you can.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved