Forum Moderators: Robert Charlton & goodroi
Oh yeah! There are thousands of them. Now that is a programming bug.
.
Obviously when you place the <meta name="robots" content="noindex"> tag on a page, you expect that the page will be spidered, but not indexed. But if Google keeps no data about that URL, they will never "remember" anything about that URL, and will return every few minutes to "discover" the content, again and again.
However, when you think about it, Google must keep a copy of that page internally so that they can tell when the content has changed, and so that they know where that page links to, and so that they know the index/noindex status of the page.
Such a page should never appear in search results, ever. Well, now they do, and in very large numbers; all (so far) marked as Supplemental Results, and with cache dates from a year ago.
It appears that something in their system is forgetting to check the index/noindex status of the pages in their database and is showing them all in the SERPs whatever their status.
I first noticed this yesterday; but found it on some searches that I have not done for several months. I have no idea how long this bug has been showing up... it could be several months.
Looks like the only sure way to stop google indexing a page is to serve Googlebot a 403 error for that page.
I found a scraper directory with noindex tags on it's "details" page which copies descriptions/titles/keyword 6 times on this page and in the alt tags for transparent images and they list multiple keywords in the URL also (which means they can easily outrank any site in their directory) and they do it to every site in the directory.
These scrapers will show up in the "intitle:yourdomain.com" search in Google even though the noindex tag is in place.
This site is from a foreign country and host their own site so contacting the host is useless. I reported the site to Google AdSense and Google Spam and also have been writing to all the web design sites listed in the directory asking them to do the same.
Some of the more cynical of us might notice that we had to sign up for a solution to a problem they created in the first place, but since they are they web, when Google ain't happy, ain't nobody happy.
My original observation is that pages containing the <meta name="robots" content="noindex"> tag (for the last three years) are now showing in SERPs with a full title and snippet, and have a link to a cached page from 2005 July, and that the cached page clearly shows the meta noindex tag too.
What with Google using a "crawl caching proxy," shared by various bots, including the Adwords bot and Googlebot, and the Adwords bot ignoring robots.txt, I've been anticipating that problems might happen.
Some questions/thoughts...
While Google has talked publicly about the Adwords bot ignoring traditional robots.txt prohibitions, it hasn't mentioned anything about the robots meta tag. For a number of reasons, it would be nice to have an official word on how the Adwords bot regards this tag. I'm also wondering whether the Adwords bot ever goes beyond specified landing pages.
Additionally, with regard to g1smd's situation, and that of others whose <meta name="robots" content="noindex"> tagged pages are being indexed, I'm wondering if those affected might simultaneously be using robots.txt to disallow bots to these pages.
I'm also wondering whether anyone who'd disallowed the Adwords bot specifically has seen their blocked pages in the index.
I'd then give some nice instructions that were NOT written by programming but someone who could channel Denzel (explain this to me like I'm a six year old) while doing it.
<offtopic> graywolf - I've long felt that this should apply to all software documentation and interfaces... even pitched it to some companies... but the idea hasn't caught on.</offtopic>
One thing I've noticed is that with the toolbar it tries to index those pages, my site is half designed with a robots.txt on it with no links and it has been indexed by Google... it has had the robots.txt since the day I bought the domain
>>>That would be MS rather than G. <<<
Nope..that's G.
MS thinks they are all things pc.
Google automatically takes a "snapshot" of each page it crawls and archives it.
(...)
Users can access the cached version by choosing the "Cached" link on the search results page.
(...)
Note: this tag only removes the "Cached" link for the page. Google will continue to index the page and display a snippet.
Metatags only prevent displaying "Cached" links or content on SERPs (as long as the display flag checking program runs perfectly).
Traditionally, a robots.txt disallow instruction for a URL simply led to the page appearing as a URL-only entry in Google's index if they ever saw a link to the disallowed page. The meta noindex tag ensured that nothing appeared for the page at all. That is now completely broken.
For Yahoo, when they find the meta noindex tag, they do include the page as a URL-only entry, but often they also try to build an entry for the page, by using the anchor text of one of the links pointing to the page (from an external site) as the SERPs entry title!
Anyhow, while we have your attention, I would like to suggest a robot.txt protocol that utilizes server headers. For example, the following example could be used to prevent Googlebot from indexing any jpg file on an Apache server.
<Files ~ "\.jpg$">
Header append Robots "noindex"
</Files>
This way, we can block specific files types on the server side and keep the main robots.txt file compliant for bots that choke on a lot of the proprietary directives or special characters that some bots use. Besides, it's a much more flexible solution than a standard robots.txt file or meta tag is and it would save bandwidth for both parties.
[robotstxt.org...]
.
The bot asking for URLs that do not exist, is trying to see what your real "404 error page" really looks like. Many brain dead webmasters serve a 302 redirect for "page not found" and their bot gets confused by it. They now test for that condition.
[EDIT]
After playing with the robots.txt analysis tool on google sitemaps stats page i came across the following
in robots.txt the following line
Disallow: /dirabc/
now if I use the "Test URLs against your robots.txt file" with http://www.site.com/dirabc comes up with "Blocked by line 63: Disallow: /dirabc/"
But if I try it with http://site.com/dirabc then I get "Allowed Not in domain"
Guess its time to go play with .htacess to get the http://site.com domains to goto www.site.com, still I dont think that the above should be happening
[edited by: lawman at 11:04 pm (utc) on June 15, 2006]
Disallow, does not disallow folders, or files, but URLs that start with the partial-URL mentioned in the exclusion. The disallow is on a per-domain basis, with the robots.txt file being found in the same root folder of the domain as the files that it can exclude.
I put the tag:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
on each page thinking that would be enough to stop google indexing it and penalising me for duplicate content.
Today I notice my site is disapearing intermitantly from the google listings for its most popular phrase. However, when I check out its rankings at the Big Daddy Dance Tool site I see that my site is still ranking number 3 for each datacentre.
Would it be possible that google is now penalising me for duplicate content? I don't have a robots.txt file, should I have one of these, will it help? Or should I just remove the duplicate content all together. When I say duplicate, its not really, but its similar. Kind of like a summary.
Is anyone else noticing anything like this? Or is this intermitant dropping in and out business just normal and related to regular index movements.
Amazon-owned Alexa shows MSN-crawled SERPs WITH code-forbidden Caches
NOARCHIVE and robots.txt instructions meaningless. Three strikes?
[webmasterworld.com...]
Over time, G* forgot about the month-old CACHED copies and started using copies from a year or more ago and called them SUPPLEMENTAL. Those obsolete pages seem to be (seem to be) going away a few at a time. Google now thinks it has under 9,000 of my pages. That's still too many but it keeps getting better.
My guess (just a guess) is that the 19,000 counted multiple archived copies of my pages from different dates. Now each crawl of a page with a <robots follow noindex noarchive> tag seems to kill off a few offending obsolete copies.
[NOTE: I am forcing one specific server to remove that as a variable.]