Welcome to WebmasterWorld Guest from 220.127.116.11
The idea is that it gives Google a chance to be sure that I have nothing to hide and no filter will be applied to my site.
Is it really so or I am wrong? Have you any evidence for or against?
On the other hands, if Google does not index a page with noindex, does it still use this page to determine the topic of the site? Does someone knows something about this?
I have a personal blog where I write my daily live: about once each day a new post. One day I noticed that the front page and the data pages (for example www.example.com/2005/03/29) were present in the Google SEPRs, but the real articles were not. Because I write about one page per day, the content of the day-list is almost identical to the content of the real article. Google used the dupe content filter and decided that the date pages were more important than the real articles because of the higher amount of incomming links.
I didn't like Google's descision, because now pages were indexed in Google without proper path name and title.
Therefore I changed the weblog software in such a way that "noindex,follow" is added to all date pages, front pages, category lists etc. Now all date pages, aggregate pages etc. have disappeared from the SERPs and they are replaced by the articles with proper titles.
It would have been very difficult to do this with a robots.txt because the content of the site changes on a daily base. Furthermore robots.txt stops spidering so denying the weblog root would probably cause the total weblog to disappear from the SERPs. the "noindex,follow" only stops indexing but deeper pages are still accessible and Google indexes them without problems.
Noindex, otoh, removes the page from the index.
Sort of. I have a few 'dead' pages that I overwrote with a simple 'this page has moved to [new url]'* and added a robots 'no index,follow' meta tag. This change was 14th Dec 2004. Today Google still includes the URL in a site:www.mydomain.com search, and its title and snippet is from the OLD page content. The cache link returns a 'Your search did not match any documents'. Looks like Google is obeying the 'no index' for the current content, but still has the URL indexed and retains the old, pre-noindex, content for titles and snippets.
*Don't ask why I didn't just 301 it, there was a reason but I don't recall now. Also, this was just a simple link, no on-page redirect.
It would have been very difficult to do this with a robots.txt because the content of the site changes on a daily base. Furthermore robots.txt stops spidering so denying the weblog root would probably cause the total weblog to disappear from the SERPs
Wouldn't that be easier?
No problem if I post many times a day, but with one post each day the content of this day-thread and the post itself only differ in the filename and the title. The dupe content filter sees that many pages have the calendar in the margin so there are many links to the day-thread, but only one or two links are pointing to each individual posts. Therefore the individual post is marked duplicate and the day-thread is indexed.
if I noindex /page.html, and 1 month later I remove the noindex, will G index it again, with no problems /penalties because of the previous noindex?
I am replacing /moving some content to a new domain and need something like this.
If Google doesn't spider the page how would they really know anyway?
With noindex *follow* Google probably at least reads the content of the page to follow the links. Google naturally does't use the content for their main index, but Google may still use it to detemine the topic of the page and may be the topic of the site.
Google may also use the content to see that there are no cloaking.
The question is: does Google really do this?
This works well but I'm concerned about what will happen if Googlebot visits while I'm doing this. Can anyone tell me if this might cause Google to stop crawling my site?
Sorry if this is a bit off topic.
I think the bottom line answer is that there is no functional difference in the real world.
From what we have seen here talking with so many people - Google believes that the robots.txt settings are a NO INDEX setting. The do not believe that "disallow" whens they can not legitimatly spider that data and use it. That means, they feel that spidering and using any data from your site that they wish - but not listing it - is ok.
We've seen alot of comments over the years by people saying that Gbot does not follow robots.txt because Google will not index - but will spider and visit pages listed in robots.txt.
I have never seen a page that is "delisted" any differently because of a robots.txt entry or because of a NO INDEX tag added to a page. I will seek clarification on this point.
> remove noindex
Yes, I recently did that on an entire site and Google picked right up on it and started listing the site with in a couple weeks.