Forum Moderators: Robert Charlton & goodroi
use a noindex robots meta tag instead of robots.txt rules
so it helps to "nofollow" links to pages you do not want crawled
Example: I have a website without a sitemap. I have a directory, which is disallowed in robots.txt, all links to the pages in the directory are nofollow and there are no external links to those pages.
Yet, one of them made it's way to the index.
Google then constructs a title and snippet for the URL just from references rather than by crawling the page directly.But it doesn't look like a snippet, it just looks how other pages are displayed in Google's search results. I have have come through this snippet stuff for other websites though.
And remember to change your robots.txt file so you now ALLOW googlebot to crawl the page. Unless they crawl, they won't ever read the robots meta tag.Oops, I wouldn't have done that if you hadn't informed me. Thanks a ton!
If the URL is in your sitemap, the page will be crawled.Are you sure that even though I may block a web page using noindex meta tags, the page will still be indexed if the URL has been included in the SITEMAP?!? Because I have never heard of this before. Can you give me some references or share your personal experiences? THanks
If the URL is in your sitemap, the page will be crawled.
Are you sure that even though I may block a web page using noindex meta tags, the page will still be indexed if the URL has been included in the SITEMAP?!?
1. If I block a page as 'do not' crawl, how the spiders still index it? If they don't crawl a page how can they index it? Crawling is the very first step to indexing right?
2. Do the SE spiders actually care about what is in robots.txt?
There is no problem with my robots.txt, though!
It starts accumulating PageRank, and all the other externally defined factors that exist in Google's worldCould you please explain this to me?
It starts accumulating PageRank, and all the other externally defined factors that exist in Google's world
Could you please explain this to me?
Also: Once the googlebot finds its name in robots.txt, it ignores all other sections. So if you want to block some areas from googlebot, and some areas from all robots, you'll have to say those parts twice.I have seen this on may websites and had wondered why do they repeat all the files diff spiders such as 'google bot' 'that of yahoo' 'that of alexa', ask etc. So from my above robots.txt, I am just going to remove the 'Noindex' section which as many of you have told is of no use. If I remove that section, then the command 'User-agent: Googlebot' will also get removed and there will be only one command for all the files 'User-agent: *'. It is enough right?
Still, if it shows up in your sitemap they may index it anyway. That is because if you read about the purpose of the sitemap, it is to have a list of the pages you want to have indexed. I found out the hard way a long time ago that you need to only have pages in the sitemap that you do want indexed, because a noindex metatag on the page gets ignored when they find it in the sitemapThis is what makes me learn more about SEO. Thanks for letting me know of that bud!
I am reminded of it again whenever I try to do away with an old page and forget to remove it from the sitemap after I put a noindex metatag on the page.Out of curiosity, why don't you put a redirect in place?
I submit new sitemaps and still see 404s from pages that have not existed for two years, are not in any current sitemap. I appreciate that I can now mark them as "Fixed" but I know they will be back.I have the same problems, my website has more than 600000 pages and I am getting 18k server errors through the GWT crawling error section. It shows pages that never existed in my website and whenever I mark them as fixed it again shows up, I fed up with the 'mark as fixed' process.
Out of curiosity, why don't you put a redirect in place?
Thanks for that! That snippet - Yes it's been long discussed. Don't know why should G even index the URL only version even though we have blocked it. G only knows!
If you use both a robots.txt file and robots meta tags
If the robots.txt and meta tag instructions for a page conflict, Googlebot follows the most restrictive. More specifically:
•If you block a page with robots.txt, Googlebot will never crawl the page and will never read any meta tags on the page.
•If you allow a page with robots.txt but block it from being indexed using a meta tag, Googlebot will access the page, read the meta tag, and subsequently not index it.
(You need a robots.txt file only if your site includes content that you don't want search engines to index. If you want search engines to index everything in your site, you don't need a robots.txt file—not even an empty one.
To remove a page or image, you must do one of the following:
* Make sure the content is no longer live on the web. Requests for the page must return an HTTP 404 (not found) or 410 status code.
* Block the content using a robots.txt file.
* Block the content using a meta noindex tag.
To remove a directory and its contents, or your whole site, you must ensure that the pages you want to remove have been blocked using a robots.txt file. Returning a 404 isn't enough, because it's possible for a directory to return a 404 status code, but still serve out files underneath it. Using robots.txt to block a directory ensures that all of its children are disallowed as well.
<snip>
Content removed with this tool will be excluded from the Google index for a minimum of 90 days.
So, I can easily exclude various parameters that might lead googlebot into an major duplicate content area, such as:
Disallow /category?sort=
Would not a person of ordinary intelligence interpret this to mean that a file in a roboted-out directory will stay out of the index, once removed?