| This 49 message thread spans 2 pages: < < 49 ( 1  ) || |
|The Google site: operator seems broken - is this intentional?|
In many discussions around this forum, members are noting that Google's site: operator currently returns very limited results. The total numbers are much lower than the special operator showed just a few months ago.
For example, in a recent thread, g1smd makes this observation:
|In WMT I see 950 URLs listed for one site. The site: search lists between 260 to 320 depending on the day. |
It certainly doesn't give much away like it used to. The website is a bit more than six months old.
While there were always some oddities in the site: results, the current situation is quite frustrating to many webmasters. Some who depend on the site: operator to understand how deeply Google is indexing their site are becoming concerned that they now have some kind of penalty, or at least a technical problem with their website or server.
Is this change is an artifact of the new Caffeine infrastructure? That is, will the site: results eventually become more accurate again? Or is this a new and intentional situation, a limit on the site: operator something like Google has always done with the link: operator.
In past years it often happened that Google would make back end changes to upgrade their core search results and various special operator reports would be disrupted for a short period. So, I currently lean toward the idea of an unintented Caffeine side effect.
But these newly uninformative site: results have now been with us for many months and in the last few weeks the distortion seems to be intensifying. It is heartening that Webmaster Tools reports higher numbers in many cases - but does this mean Google won't be showing accurate numbers to anyone but those verified as responsible for the website?
The site: operator seems intended to be used in combination with a keyword - and sometimes that does seems to improve the results. For example, one site I've been working with for fourteen years currentl shows:
site:example.com - 329 results
site:example.com keyword - 816 results
In the absence of any official word from Google, we can only guess what's happening. I'm hoping that it's a temporary disruption, but I wonder how others see this.
I'm definitely seeing less accurate results in site: and it also appears that some pages that really should be indexed are not being indexed at all.
I'm also seeing recently changed pages appearing in SERPs for keywords that do not appear in the cached version, not sure if that's related.
My guess is this is a resource limitation and probably temporary.
There's hope! One site I manage has gone up on the site: operator count. After dropping from 1800 to 349, it rebounded to 1100. The total number of URLs on the server is about 2100.
I have been seeing this for close to two years now, though I am now seeing it on far more sites. I even asked about it awhile back, about the possibility of a hidden index.
I suspect that G has brought back a supplemental index, like old school. I refer to it as the sub-supplemental. There, but not called on unless they really, really need to, like when you do a site:example.com keyword search.
I see it most with sites that have incomplete sitemap.xml. About 60% of the time, creating a complete sitemap pulls the URLs out of what I suspect is a sub-supplemental index.
In almost all cases, pages in this area are weakly linked, low content.
I guess, what I am trying to say is this is not new and I doubt it is going away.
My site has [6 months of] internal duplicate content penalty since Dec 2009 [5 months already passed :)] and current fluctuation seems normal to me.
However I tried to keep eye on some bigshots' position for site: operator.
I found that site:techcrunch.com is losing approx 10K pages weekly!
Matt told in webmasterhelp video that; if you are really high authority site; then we would crawl all possible url from your sitemaps. now, if Techcrunch is losing pages like us, then what else can we do?
|But it's affecting all types of sites, including one of a kind major corporate sites that are not showing duplicated or scraped results in Google. |
I see it the most with ecom sites - original content or no. Like I said, weakly link, deeper pages with shallow content.
If the site does not rely heavily on long tail searches, there is no blip to the traffic, but if it does, the site loses big time on traffic. That is why I suspect a sub-supplemental index. It can affect traffic. If it was just a "broken" operator, this would not happen.
It also explains why WMT's count is different than the site: count. WMT shows a count of all indexed pages, site: would show only "viable" pages.
I also wonder if some new kind of data partitioning isn't kicking in. Google engineers were talking about changes in that area last summer when Caffeine was first announced. Also they've also got a patent about multiple database partitions.
|I see it most with sites that have incomplete sitemap.xml. About 60% of the time, creating a complete sitemap pulls the URLs out of what I suspect is a sub-supplemental index. |
Would you elaborate on what you call an "incomplete" sitemap.xml? I am very curious if your experience has to do with pagination of content pages and inclusion (or not) of pages other than #1 in the sitemap.xml
Also, I don't suppose you are talking about including all 100% of possible URLs that Gbot can come across on a site - tag pages, category pages, navigational pages etc, leading up to content but not the content itself - include all that in the sitemap.xml?
As far as site:operator, I find that site:example.com example.com brings the amount close to what's actually might have been indexed. I do have sites that don't do well in Google lately - mostly long tail issues - and those sites show the biggest difference between just site:example.com and site:example.com example.com , so you may be onto something with a (good old or brand new) supplemental index.
I seem to be seeing the opposite of what most people here are seeing. The number of results for my site with the site: command have actually gone up significantly since the recent SERP changes (which also seem to have benefited us fairly significantly). Now we already block Google from indexing a lot of duplicate or very very similar pages which have little value. Could these types of pages which are on one hand crawled but then no actually included in the index be accounting for the discrepancies when looking at page totals in different locations? I'm not sure whether perhaps my site is acting different due to the fact that we have 000's of pages not tens or hundreds of 000's. Does this seem to be happening more with sites with large numbers of dynamic pages?
They may have no sitemap.xml, a sitemap.xml that only has "important" pages or a sitemap.xml that only has higher level pages and they are relying on the bots to find deeper pages from there e.g. an ecom site that has sub-cat pages listed but not product pages because links to all product pages are on the sitemap listed sub-cat pages.
From what I have seen, it seems as though a URL being on the sitemap.xml plays a part (but not the whole part) in it getting shuffled out. I would swear it looks like G is saying if you don't have enough time/incination to bother with a URL, neither do we. ;)
|They may have no sitemap.xml, a sitemap.xml that only has "important" pages or a sitemap.xml that only has higher level pages and they are relying on the bots to find deeper pages from there e.g. an ecom site that has sub-cat pages listed but not product pages because links to all product pages are on the sitemap listed sub-cat pages. |
OK, I see ... In my case I have an exact opposite of such "incomplete" sitemap - I have all the content pages but, looking at the sitemap.xml you would not have guesses exactly how you arrive at those pages because all the intermediate navigation steps (one of two, depending on how old the content is) are missing in sitemaps.xml. The thing that got me worried is that since, as you point out, sitemap.xml is not the only signal used for discovery/ranking, by excluding intermediate navigational URLs from sitemap, I am saying that these navigational URLs are not important and yet they may be important from the standpoint of ranking the content pages they link to.
So, the question then becomes: do you make your sitemap.xml smaller by leaving only essential content pages in it and hoping that higher percentage of the sitemap URLs will get crawled or you "cramp" as many URLs into the sitemap as possible with an aim to have maybe lower amount of content URLs crawled but have the supporting navigational structure crawled as well.
Have there been any reports of crawl rates dropping that might be linked to this.
|I'm also seeing recently changed pages appearing in SERPs for keywords that do not appear in the cached version, not sure if that's related. |
This is serving results from the Supplemental index. Google keeps old copies of pages going back months, maybe as long as two years, and the URLs will continue to appear in SERPs for any words currently or previously on those pages, whether or not those words are still on the page now. This has been normal Google behaviour for many years. I noticed it after updating some contact details on a site. All the pages were re-cached and re-indexed and appeared in searches for the new details, however, even six months later searching for the old details still brought up the URLs for these pages in SERPs even though the old details were not on the real page nor in the visible Google cache. In fact, in the cache view the terms were highlighted as "the following terms only appear in links to this page" which isn't really the right message because the terms don't appear anywhere on the live web and certainly weren't in any anchor text pointing at those pages.
|Links are a part... but the squatty part. It is, and always has been, the content. |
I know of many SERPs where content is not the reason a site is ranking in the top five. One example is a site ranking for a two word phrase despite returning 404 error pages for the linked content. It is ranking despite that there is no content. The reason it is ranking is LINKS. The site has many links to content that no longer exists and are returning 404 error pages.
From my experience, as far as Google is concerned, links play a large role, with content playing a smaller role. This is one of the reasons why webmasters see a difference between Yahoo SERPs and Google SERPs. Links are counted differently. That is not just my opinion, it's a statement based on my experience and observation.
|Google keeps old copies of pages going back months, maybe as long as two years, and the URLs will continue to appear in SERPs for any words currently or previously on those pages |
g1smd, yeah, I have seen that sort of thing before, but I wasn't being clear. What I was talking about is the inverse of what you are describing. The page has been changed and new words have been added that were not there before. The cache still shows the old version of the page, but the page shows up for the new words. So it's not that G is showing results based on an older version that is no longer cached. It is showing results for a new version that has not been cached yet.
It is possible that this was going on before as well, and I just never noticed before. My experience has typically been that a new page starts showing up for new words when the new version is cached.
The cached version does say that the words only appear in links to the page.
I just did a major revision of my site, including siginficant changes to nav structure, and... I don't know, it just feels different from when I've done that before. Usually within a day or two the new versions of pages start to show up in the cache, and at the same time they start ranking for new keywords. At the same time, new pages also begin to show up in the cache, and start ranking for new keywords.
Now it just seems very different. Not a single new version has been cached, but if you search for new keywords you can see snippets of the new version of the page. None of the new URLs have been indexed at all. It's like G is sort of holding the entire site in some sort of limbo while it sorts out what I've done to change it. That didn't happen last time I made major changes to a site.
Here's something else I haven't seen before: I have a bunch of pages that have changed URL. Of course I have 301s in place. Some of the new URLs show up in a site: search. There is a "cached" link below the result. However, if you click the "cached" link, you get this message:
|Your search - cache:c9OPXTNBRoUJ:www.example.com/new-url site:example.com - did not match any documents. |
|The cache still shows the old version of the page, |
The publicly available "cached" page is not the same as the version(s) of a page that reside in Google's back end cache. You're seeing the signs of this. Thinking about it for a while, we know things must be this way - or else a no-cache meta tag would also remove the URL from rankings altogether.
|That is not just my opinion, it's a statement based on my experience and observation. |
I tend to think content wins overall, but there's no doubt you can't find content without links... and you can't find links without links. Personally, I tend to think that links to links have limited appeal and links to content fare better.
What is broken (or appears to be broken) is the active part of site: which used to give more info than it does in recent days. As to why that is happening there's only speculation, no answers.
Has G has bit off more than it can chew? Has it invested way too much in personalizing ads, and chasing the long tail to do same, and is running out of go juice to get it done? Merely a personal observation of my experience with google over the last ten years.
However, I do freely admit that my attempt at humor with "squatty parts" might not have been appreciated in the same manner as intended. I will keep that humor in check henceforth! I certainly have no intention of misleading folks.
|The page has been changed and new words have been added that were not there before. The cache still shows the old version of the page, but the page shows up for the new words. |
Oh, that's normal too. There are separate databases that feed:
- the displayed title
- the displayed snippet
- the public cache page
- the "ranking"
and each of those updates on a different cycle. There's also the Supplemental databases that return results based on "historical" content - content that used to be on the page but no longer is, or for URLs that are now redirecting or are returning 404.
You'll see this in action in WMT where some reports update daily and others at longer intervals.
|There is a "cached" link below the result. However, if you click the "cached" link, you get this message |
This is an indication of the different cycles getting out of step. They have assigned a DOC ID and linked to it before they have actually made available what they have already spidered. That doesn't usually last more than a few days.
When there is a cache page on view, it's also interesting to see the cache date update daily, and then pause for a few days now and then, and occasionally drop back to an older date several days or weeks ago, before reverting to a more recent date and then continuing to increment.
OK, so this asynchronization of databases isn't new, and it makes sense that they would be out of step sometimes. It might be a clue as to the general weirdness we're seeing here though. Perhaps they're more out of step than usual.
| This 49 message thread spans 2 pages: < < 49 ( 1  ) |