Another interesting thing that is happening here is some results that have the description displayed against them have a cached link, when I see a meta noarchive tag used for those pages!
indyank
5:15 pm on Jan 23, 2011 (gmt 0)
I would like to add that I got this link through a tweet in my timeline...from the link, it is obvious that something is broke as these are urls with parameters towards the end.My guess was these would have been blocked via robots.txt, but google still seem to be show the urls!
I wanted to test this and did a site search for one other site.It showed 1600 results. But when I clicked through to the 53th page (10 results per page), google stopped, stating the rest are duplicates! When I clicked the "repeat the search with the omitted results included", and go till 53 page, I start seeing urls blocked by robtos.txt.
So the site command does behave strange.Has anyone experienced this?
goodroi
12:31 pm on Jan 24, 2011 (gmt 0)
the site: command has many issues with accuracy. it has been like this for some time.
it looks like the nytimes is trying to do some special handling for syndicated content that is on their site like the ap news feed.
indyank
4:49 pm on Jan 24, 2011 (gmt 0)
But why does google show the cached link for some pages, when they have specified noarchive meta tag for them?! Yes, they do cache all pages anyway but I thought noarchive will ensure the pages aren't available via the cached link in the SERPS!
Robert Charlton
1:44 am on Jan 25, 2011 (gmt 0)
But why does google show the cached link for some pages, when they have specified noarchive meta tag for them?!
I did see the noarchive meta in the url-only page source, but I didn't see any url-only results with a cache link in the serps, so I'm not sure what you're thinking. There may have been some with cache links, but I didn't see them.
If some of the pages were cached, though, note that caching is done independently of indexing. I believe there was a discussion here a year or two or three ago (these things get fuzzy) where there was a noarchive meta reported on the live page, but none yet on the cached page. In other words, an older version of the page was cached, and the caching hadn't caught up with the live version with the noarchive meta. The example was also a prominent newspaper, but not the NYT.