Welcome to WebmasterWorld Guest from 18.104.22.168
Oh yeah! There are thousands of them. Now that is a programming bug.
Obviously when you place the <meta name="robots" content="noindex"> tag on a page, you expect that the page will be spidered, but not indexed. But if Google keeps no data about that URL, they will never "remember" anything about that URL, and will return every few minutes to "discover" the content, again and again.
However, when you think about it, Google must keep a copy of that page internally so that they can tell when the content has changed, and so that they know where that page links to, and so that they know the index/noindex status of the page.
Such a page should never appear in search results, ever. Well, now they do, and in very large numbers; all (so far) marked as Supplemental Results, and with cache dates from a year ago.
It appears that something in their system is forgetting to check the index/noindex status of the pages in their database and is showing them all in the SERPs whatever their status.
I first noticed this yesterday; but found it on some searches that I have not done for several months. I have no idea how long this bug has been showing up... it could be several months.
Traditionally, a robots.txt disallow instruction for a URL simply led to the page appearing as a URL-only entry in Google's index if they ever saw a link to the disallowed page. The meta noindex tag ensured that nothing appeared for the page at all. That is now completely broken.
For Yahoo, when they find the meta noindex tag, they do include the page as a URL-only entry, but often they also try to build an entry for the page, by using the anchor text of one of the links pointing to the page (from an external site) as the SERPs entry title!
Anyhow, while we have your attention, I would like to suggest a robot.txt protocol that utilizes server headers. For example, the following example could be used to prevent Googlebot from indexing any jpg file on an Apache server.
<Files ~ "\.jpg$">
Header append Robots "noindex"
This way, we can block specific files types on the server side and keep the main robots.txt file compliant for bots that choke on a lot of the proprietary directives or special characters that some bots use. Besides, it's a much more flexible solution than a standard robots.txt file or meta tag is and it would save bandwidth for both parties.
The bot asking for URLs that do not exist, is trying to see what your real "404 error page" really looks like. Many brain dead webmasters serve a 302 redirect for "page not found" and their bot gets confused by it. They now test for that condition.
After playing with the robots.txt analysis tool on google sitemaps stats page i came across the following
in robots.txt the following line
now if I use the "Test URLs against your robots.txt file" with http://www.site.com/dirabc comes up with "Blocked by line 63: Disallow: /dirabc/"
But if I try it with http://site.com/dirabc then I get "Allowed Not in domain"
Guess its time to go play with .htacess to get the http://site.com domains to goto www.site.com, still I dont think that the above should be happening
[edited by: lawman at 11:04 pm (utc) on June 15, 2006]
Disallow, does not disallow folders, or files, but URLs that start with the partial-URL mentioned in the exclusion. The disallow is on a per-domain basis, with the robots.txt file being found in the same root folder of the domain as the files that it can exclude.
I put the tag:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
on each page thinking that would be enough to stop google indexing it and penalising me for duplicate content.
Today I notice my site is disapearing intermitantly from the google listings for its most popular phrase. However, when I check out its rankings at the Big Daddy Dance Tool site I see that my site is still ranking number 3 for each datacentre.
Would it be possible that google is now penalising me for duplicate content? I don't have a robots.txt file, should I have one of these, will it help? Or should I just remove the duplicate content all together. When I say duplicate, its not really, but its similar. Kind of like a summary.
Is anyone else noticing anything like this? Or is this intermitant dropping in and out business just normal and related to regular index movements.
my site is more older than year but just one page was crawled that is home page and that was indexed in top SERPs, for few key word it was on top rank but yesterday my all pages got crawled and now i am no where in SERPs can any body suggest what is that i mean what must be reason behind that
Amazon-owned Alexa shows MSN-crawled SERPs WITH code-forbidden Caches
NOARCHIVE and robots.txt instructions meaningless. Three strikes?
Over time, G* forgot about the month-old CACHED copies and started using copies from a year or more ago and called them SUPPLEMENTAL. Those obsolete pages seem to be (seem to be) going away a few at a time. Google now thinks it has under 9,000 of my pages. That's still too many but it keeps getting better.
My guess (just a guess) is that the 19,000 counted multiple archived copies of my pages from different dates. Now each crawl of a page with a <robots follow noindex noarchive> tag seems to kill off a few offending obsolete copies.
[NOTE: I am forcing one specific server to remove that as a variable.]