| This 49 message thread spans 2 pages: 49 (  2 ) > > || |
|The Google site: operator seems broken - is this intentional?|
In many discussions around this forum, members are noting that Google's site: operator currently returns very limited results. The total numbers are much lower than the special operator showed just a few months ago.
For example, in a recent thread, g1smd makes this observation:
|In WMT I see 950 URLs listed for one site. The site: search lists between 260 to 320 depending on the day. |
It certainly doesn't give much away like it used to. The website is a bit more than six months old.
While there were always some oddities in the site: results, the current situation is quite frustrating to many webmasters. Some who depend on the site: operator to understand how deeply Google is indexing their site are becoming concerned that they now have some kind of penalty, or at least a technical problem with their website or server.
Is this change is an artifact of the new Caffeine infrastructure? That is, will the site: results eventually become more accurate again? Or is this a new and intentional situation, a limit on the site: operator something like Google has always done with the link: operator.
In past years it often happened that Google would make back end changes to upgrade their core search results and various special operator reports would be disrupted for a short period. So, I currently lean toward the idea of an unintented Caffeine side effect.
But these newly uninformative site: results have now been with us for many months and in the last few weeks the distortion seems to be intensifying. It is heartening that Webmaster Tools reports higher numbers in many cases - but does this mean Google won't be showing accurate numbers to anyone but those verified as responsible for the website?
The site: operator seems intended to be used in combination with a keyword - and sometimes that does seems to improve the results. For example, one site I've been working with for fourteen years currentl shows:
site:example.com - 329 results
site:example.com keyword - 816 results
In the absence of any official word from Google, we can only guess what's happening. I'm hoping that it's a temporary disruption, but I wonder how others see this.
When I use site: on subdirectories, I am more likely to get a number that makes sense, say 100,000 indexed pages. Sometimes, a subdirectory site: will screw up, and come back with, say, 145, or so. I would say a subdirectory site: result is correct about 70% of the time.
However, site: on the top level domain is really screwed up, and rarely gives back a result that makes sense. Clearly site: should be close to the sum of the site: for all the subdirectories. As mentioned above, on a couple of my subdirectories, I get counts greater than 100,000, which is close to reality. On site: main domain, I have been getting counts of around 4000 lately.
Curiously, though, for one whole day last week, Thursday I think it was, site: for my main domain worked! It was around 400,000! This is the first time in about 6 months that I have had that count.
Alas, site: main domain is now working as always, reporting around 5000 results.
So, for me, site: is a waste of time, and has been for months. I watch it because... well, I don't know, out of habit I guess. Or maybe I watch it as a simply harbinger of when gorg may have themselves finally sorted out.
I did find it curious though that for 1 day last week for the first time in months and months, it worked OK.
P.S. WMY sitemap numbers are way off too!
There's more! This concerns another site, one that has been online since last Autumn.
The pages were originally using
non-www, and all were internally linking to the
non-www. At that time there was no canonical
www->non-www 301 redirect in place.
At the end of the year, the internal links were all changed from
www and a canonical
non-www->www 301 redirect was added at the same time.
This was done because although a large number of the
non-www pages of the site were indexed,
non-www root was not indexed,
www root was indexed,
- all external incoming links pointed to the
www version anyway.
Now, four months later, the situation shows even more odd results:
site:example.com - 420~ results (all URLs listed are www)
site:www.example.com - 420~ www results
site:example.com -inurl:www - 850~900 non-www results (most without cache link).
WMT reports show Google pulling both
www URLs every day (but mostly www). Internal link reports show a large number of
www->www links and the number growing quite fast, and a much smaller number of
non-www->non-www links with the number shrinking slowly.
The number of
www->www internal links now listed by Google is much higher than the highest number ever listed for
non-www->non-www internal links.
The interesting point is that while the
-inurl:www" site search returns ~900 results, the
non-www WMT report lists less than 400 internal links concerning 350
non-www URLs. So, Google 'knows' that the
non-www URLs don't link out to anywhere (because they are now redirects) and knows that the URLs redirect, yet has three times more
non-www URLs showing in a
site: search than
www URLs which do return content.
The other point is that the
www site: search lists less than a quarter of the number of URLs listed in the "internal links" WMT report.
So, I'll guess that WMT results look only at the 'main' results and not the stuff in Supplemental (whatever that means these days) so numbers in the
site: search can be higher than WMT in that case; and that not everything listed in WMT will always appear in a
site: search and so the numbers can be lower than WMT in that case.
The disposition of the URLs in question is of key importance. For a very dynamic site with ever-changing content both factors might come into play. So, even if the numbers 'look right' there might still be 'issues'.
I noticed on a site I was appraising this week that the numbers were the reverse of what I'd expect.
I'd expect the initial site: query to return a number or urls. If there were 'Supplemental' results then at the bottom of the results on the last page you'd have the option to see more urls. That second list of urls would be bigger.
However on this site this week I was not presented with that option once (even though I know the site has loads of dupe content that would be 'Supplemental'). The initial number of urls was big - and once I'd got as far as Google was prepared to take me through the results, it would fall - sometimes by a large percentage.
Try manually adding
&filter=0 to the end of the google.com search URL, and see where it gets you.
Also, make a decision to use 10 results per page or 100 results per page and then stick with it. You'll get different results if you use a different number of results per page.
&num=100 is a quick way to force 100 results per page.
My guess would be this is simply Google freeing up computer power for real user searches?
It's a possibility, mack, but I find it unlikely. The long range effect would be to query Google one URL at a time to see if each one is indexed or not. That uses up more of their resources, not less. It's especially so if you're researching a group of competitor sites.
adding &filter=0 does make a significant difference - making the number of pages spidered accurate.
Will have to keep using this!
How many more easter eggs like this one are inside of Google?
I use "site:domain.tld" to get page totals
Then switch to [100 entries per page]:
then add "&start=900" to the search url string.
And it's also easier to spot Duplicate Titles
|I use "site:domain.tld" to get page totals |
However, the accuracy of those numbers is the reason I started this thread. For many sites those numbers are falling very rapidly - like to 10% of what they were. But other indicators (especially server logs) seem to show that those missing URLs are still in the Google index and getting search traffic.
The more you drill down directories, the more accurate it gets. If you add the number you get for a regular site search to the number you get for a supplemtal search, you get a much more accurate result. It isn't simply a matter of A + B = C though unfortunately.
Interesting yes I have been following this and feel it may be the implementation of some other factors that determine crawl depth and are affecting the site:www.yourweb.com command which would be brought in retrospective and explain why established sites still see existing pages indexed.
These days I think Google is skim reading... ie.: the surface of a site rather than the depth. Results in quicker returns for searches and only goes deep when an exact uri applies. To tell the truth, I'm dang surprised that G, B or Y can actually return any results considering several hundred million webmasters are churning out several trillion pages year on year... and am equally surprised that the webmasters are miffed that their "golden content" has been replicated hundreds of thousand times, even if they created it brand new. After all, how many ways can you describe:
At some point, in the algo, there will be a decision of "enough is enough" and no benefit to (our ads showing) displaying more... and, more importantly, how many consumers actually even know about the site: operator and would use it? Geeky webmasters need not apply...
In literature there are seven stories (recently upgraded to ten) and every story involves those thematics. Websites fall into equally limited categories (obviously not as many as 10)... and the language used on those pages does as well.
Rather than attempting to list 100,000 pages in the SEs I'd rather have 10 solid landing pages really high in the SERPs... with compelling content sufficient to get the visitor to continue clicking ON MY SITE. I keep them by interest, not by search alone...
Pretty sure the SEs are of the same opinion as in: "These pages work, but more of the same is not better, so show this many and no more. Move along, nothing to see here..."
Rambling, I know, apologies extended.
|Rather than attempting to list 100,000 pages in the SEs I'd rather have 10 solid landing pages really high in the SERPs |
I would prefer 10 solid landing pages and 100,000 long tail search.
If this only affected lots of "look alike" websites, I'd buy what you're saying completely. But it's affecting all types of sites, including one of a kind major corporate sites that are not showing duplicated or scraped results in Google.
The site:example.com/directory/ search has been a long time factor - but right now (last couple of months) I'm seeing something new, some other factor that is depressing the site:example.com numbers in an inaccurate way.
I don't want to start automating queries one URL at a time to see how well a site is indexed, but right now don't see an "approved" way to get even a ballpark idea. The situation is making straightforward webmastering even harder than it has been.
Well tedster isn't the approved way in webmastertools?
I'm not sure what is causing the problem. All I can offer is Mrs. Crabtree, my 5th grade teacher who, in later years, became a librarian at the local library (remember those... that's where Books were kept). Smart lady, getting a bit long in the tooth, could find a few things when asked (and most of those we already knew) but was under pressure to find the facts. Just had too much info to deal with.
Not saying G is broken. Not saying that at all. I am suggesting that the returns Caffeine can return are enormous and the DISPLAY side has limitations. Just like Mrs. Crabtree.
Edit: Just realized this message count (1949) is my birth year. Ye gollies, I need to shut up! :)
Happy Birthday, tangor!
Ahh I like Mrs Crabtree.... Bet she returned only a few authority books and kept you away from the crap.
|kept you away from the crap |
I, kinda sort not quite saying I told you so, rest my case.
The amount of "info" on the web is gigantic. It takes Titans to deal with it. At present we have three: B, G, Y. The Titans have their ways of dealing with humans.
What I am more interested in is WHICH of my pages made the grade in site: and what do I need to do to make the other pages list?
That, kiddies, is where the work really begins.
(and thanks for the b-day wish, I'm glad to be here after a bout with cancer...)
Tangor I think thats the golden question? will links determine page quality or originality, it will be one of the two.
Links are a part... but the squatty part. It is, and always has been, the content. Make that sing and rise to the top.
Ermm I have played by the rules and still got burned and lost everything. I will try again.....
I have been seeing odd results for both the site: and in WMC timing is pretty close to when they released the new SERP placement and CTR reports
If it's not fixed ( and it's been inaccurate for years ) then it's Google's intention to leave it broke. It's been complained about by webmasters for about the same amount of time and not moved anyone to do anything about it at G.
It hasn't been broken this way for years - this is a new level of broken.
|brotherhood of LAN|
With this kind of Google behaviour and the merging of ATW/Yahoo... site/link type queries are becoming quite limited.
Yes, I'm getting broken. I published 5K pages a few months ago and Google was slowly indexing them. The site: operator was returning ever increasing numbers, into the thousands.
Today it returns 179 pages. Note that the rest of the 4800 pages are in fact original well written text, so if it's a quality issue, it's only based on backlinks (the 5k pages don't have many deep links).
I also checked another site I set up years ago that had 3 to 5 thousand pages on it, the pages have been indexed for years. Same thing, the site operator returns 229 pages.
One site I manage used to return 9+ million pages back in '08. Currently it's returning between 500k and 600k. I'm glad to see more people having issues with this, as I thought I was being penalized.
| This 49 message thread spans 2 pages: 49 (  2 ) > > |