| 6:48 pm on Apr 25, 2010 (gmt 0)|
When I use site: on subdirectories, I am more likely to get a number that makes sense, say 100,000 indexed pages. Sometimes, a subdirectory site: will screw up, and come back with, say, 145, or so. I would say a subdirectory site: result is correct about 70% of the time.
However, site: on the top level domain is really screwed up, and rarely gives back a result that makes sense. Clearly site: should be close to the sum of the site: for all the subdirectories. As mentioned above, on a couple of my subdirectories, I get counts greater than 100,000, which is close to reality. On site: main domain, I have been getting counts of around 4000 lately.
Curiously, though, for one whole day last week, Thursday I think it was, site: for my main domain worked! It was around 400,000! This is the first time in about 6 months that I have had that count.
Alas, site: main domain is now working as always, reporting around 5000 results.
So, for me, site: is a waste of time, and has been for months. I watch it because... well, I don't know, out of habit I guess. Or maybe I watch it as a simply harbinger of when gorg may have themselves finally sorted out.
I did find it curious though that for 1 day last week for the first time in months and months, it worked OK.
| 6:50 pm on Apr 25, 2010 (gmt 0)|
P.S. WMY sitemap numbers are way off too!
| 7:53 pm on Apr 25, 2010 (gmt 0)|
There's more! This concerns another site, one that has been online since last Autumn.
The pages were originally using
non-www, and all were internally linking to the
non-www. At that time there was no canonical
www->non-www 301 redirect in place.
At the end of the year, the internal links were all changed from
www and a canonical
non-www->www 301 redirect was added at the same time.
This was done because although a large number of the
non-www pages of the site were indexed,
non-www root was not indexed,
www root was indexed,
- all external incoming links pointed to the
www version anyway.
Now, four months later, the situation shows even more odd results:
site:example.com - 420~ results (all URLs listed are www)
site:www.example.com - 420~ www results
site:example.com -inurl:www - 850~900 non-www results (most without cache link).
WMT reports show Google pulling both
www URLs every day (but mostly www). Internal link reports show a large number of
www->www links and the number growing quite fast, and a much smaller number of
non-www->non-www links with the number shrinking slowly.
The number of
www->www internal links now listed by Google is much higher than the highest number ever listed for
non-www->non-www internal links.
The interesting point is that while the
-inurl:www" site search returns ~900 results, the
non-www WMT report lists less than 400 internal links concerning 350
non-www URLs. So, Google 'knows' that the
non-www URLs don't link out to anywhere (because they are now redirects) and knows that the URLs redirect, yet has three times more
non-www URLs showing in a
site: search than
www URLs which do return content.
The other point is that the
www site: search lists less than a quarter of the number of URLs listed in the "internal links" WMT report.
So, I'll guess that WMT results look only at the 'main' results and not the stuff in Supplemental (whatever that means these days) so numbers in the
site: search can be higher than WMT in that case; and that not everything listed in WMT will always appear in a
site: search and so the numbers can be lower than WMT in that case.
The disposition of the URLs in question is of key importance. For a very dynamic site with ever-changing content both factors might come into play. So, even if the numbers 'look right' there might still be 'issues'.
| 8:50 pm on Apr 25, 2010 (gmt 0)|
I noticed on a site I was appraising this week that the numbers were the reverse of what I'd expect.
I'd expect the initial site: query to return a number or urls. If there were 'Supplemental' results then at the bottom of the results on the last page you'd have the option to see more urls. That second list of urls would be bigger.
However on this site this week I was not presented with that option once (even though I know the site has loads of dupe content that would be 'Supplemental'). The initial number of urls was big - and once I'd got as far as Google was prepared to take me through the results, it would fall - sometimes by a large percentage.
| 9:19 pm on Apr 25, 2010 (gmt 0)|
Try manually adding
&filter=0 to the end of the google.com search URL, and see where it gets you.
Also, make a decision to use 10 results per page or 100 results per page and then stick with it. You'll get different results if you use a different number of results per page.
&num=100 is a quick way to force 100 results per page.
| 9:27 pm on Apr 25, 2010 (gmt 0)|
My guess would be this is simply Google freeing up computer power for real user searches?
| 9:40 pm on Apr 25, 2010 (gmt 0)|
It's a possibility, mack, but I find it unlikely. The long range effect would be to query Google one URL at a time to see if each one is indexed or not. That uses up more of their resources, not less. It's especially so if you're researching a group of competitor sites.
| 9:43 pm on Apr 25, 2010 (gmt 0)|
adding &filter=0 does make a significant difference - making the number of pages spidered accurate.
Will have to keep using this!
How many more easter eggs like this one are inside of Google?
| 10:31 pm on Apr 25, 2010 (gmt 0)|
I use "site:domain.tld" to get page totals
Then switch to [100 entries per page]:
then add "&start=900" to the search url string.
And it's also easier to spot Duplicate Titles
| 10:38 pm on Apr 25, 2010 (gmt 0)|
|I use "site:domain.tld" to get page totals |
However, the accuracy of those numbers is the reason I started this thread. For many sites those numbers are falling very rapidly - like to 10% of what they were. But other indicators (especially server logs) seem to show that those missing URLs are still in the Google index and getting search traffic.
| 10:59 pm on Apr 25, 2010 (gmt 0)|
The more you drill down directories, the more accurate it gets. If you add the number you get for a regular site search to the number you get for a supplemtal search, you get a much more accurate result. It isn't simply a matter of A + B = C though unfortunately.
| 11:35 pm on Apr 25, 2010 (gmt 0)|
Interesting yes I have been following this and feel it may be the implementation of some other factors that determine crawl depth and are affecting the site:www.yourweb.com command which would be brought in retrospective and explain why established sites still see existing pages indexed.
| 11:35 pm on Apr 25, 2010 (gmt 0)|
These days I think Google is skim reading... ie.: the surface of a site rather than the depth. Results in quicker returns for searches and only goes deep when an exact uri applies. To tell the truth, I'm dang surprised that G, B or Y can actually return any results considering several hundred million webmasters are churning out several trillion pages year on year... and am equally surprised that the webmasters are miffed that their "golden content" has been replicated hundreds of thousand times, even if they created it brand new. After all, how many ways can you describe:
At some point, in the algo, there will be a decision of "enough is enough" and no benefit to (our ads showing) displaying more... and, more importantly, how many consumers actually even know about the site: operator and would use it? Geeky webmasters need not apply...
In literature there are seven stories (recently upgraded to ten) and every story involves those thematics. Websites fall into equally limited categories (obviously not as many as 10)... and the language used on those pages does as well.
Rather than attempting to list 100,000 pages in the SEs I'd rather have 10 solid landing pages really high in the SERPs... with compelling content sufficient to get the visitor to continue clicking ON MY SITE. I keep them by interest, not by search alone...
Pretty sure the SEs are of the same opinion as in: "These pages work, but more of the same is not better, so show this many and no more. Move along, nothing to see here..."
Rambling, I know, apologies extended.
| 11:41 pm on Apr 25, 2010 (gmt 0)|
|Rather than attempting to list 100,000 pages in the SEs I'd rather have 10 solid landing pages really high in the SERPs |
I would prefer 10 solid landing pages and 100,000 long tail search.
| 11:48 pm on Apr 25, 2010 (gmt 0)|
If this only affected lots of "look alike" websites, I'd buy what you're saying completely. But it's affecting all types of sites, including one of a kind major corporate sites that are not showing duplicated or scraped results in Google.
The site:example.com/directory/ search has been a long time factor - but right now (last couple of months) I'm seeing something new, some other factor that is depressing the site:example.com numbers in an inaccurate way.
I don't want to start automating queries one URL at a time to see how well a site is indexed, but right now don't see an "approved" way to get even a ballpark idea. The situation is making straightforward webmastering even harder than it has been.
| 11:52 pm on Apr 25, 2010 (gmt 0)|
Well tedster isn't the approved way in webmastertools?
| 11:59 pm on Apr 25, 2010 (gmt 0)|
I'm not sure what is causing the problem. All I can offer is Mrs. Crabtree, my 5th grade teacher who, in later years, became a librarian at the local library (remember those... that's where Books were kept). Smart lady, getting a bit long in the tooth, could find a few things when asked (and most of those we already knew) but was under pressure to find the facts. Just had too much info to deal with.
Not saying G is broken. Not saying that at all. I am suggesting that the returns Caffeine can return are enormous and the DISPLAY side has limitations. Just like Mrs. Crabtree.
Edit: Just realized this message count (1949) is my birth year. Ye gollies, I need to shut up! :)
| 12:05 am on Apr 26, 2010 (gmt 0)|
Happy Birthday, tangor!
| 12:05 am on Apr 26, 2010 (gmt 0)|
Ahh I like Mrs Crabtree.... Bet she returned only a few authority books and kept you away from the crap.
| 12:16 am on Apr 26, 2010 (gmt 0)|
|kept you away from the crap |
I, kinda sort not quite saying I told you so, rest my case.
The amount of "info" on the web is gigantic. It takes Titans to deal with it. At present we have three: B, G, Y. The Titans have their ways of dealing with humans.
What I am more interested in is WHICH of my pages made the grade in site: and what do I need to do to make the other pages list?
That, kiddies, is where the work really begins.
(and thanks for the b-day wish, I'm glad to be here after a bout with cancer...)
| 12:27 am on Apr 26, 2010 (gmt 0)|
Tangor I think thats the golden question? will links determine page quality or originality, it will be one of the two.
| 12:34 am on Apr 26, 2010 (gmt 0)|
Links are a part... but the squatty part. It is, and always has been, the content. Make that sing and rise to the top.
| 12:38 am on Apr 26, 2010 (gmt 0)|
Ermm I have played by the rules and still got burned and lost everything. I will try again.....
| 1:10 am on Apr 26, 2010 (gmt 0)|
I have been seeing odd results for both the site: and in WMC timing is pretty close to when they released the new SERP placement and CTR reports
| 5:09 am on Apr 26, 2010 (gmt 0)|
If it's not fixed ( and it's been inaccurate for years ) then it's Google's intention to leave it broke. It's been complained about by webmasters for about the same amount of time and not moved anyone to do anything about it at G.
| 6:01 am on Apr 26, 2010 (gmt 0)|
It hasn't been broken this way for years - this is a new level of broken.
|brotherhood of LAN|
| 10:53 am on Apr 26, 2010 (gmt 0)|
With this kind of Google behaviour and the merging of ATW/Yahoo... site/link type queries are becoming quite limited.
| 2:10 pm on Apr 26, 2010 (gmt 0)|
Yes, I'm getting broken. I published 5K pages a few months ago and Google was slowly indexing them. The site: operator was returning ever increasing numbers, into the thousands.
Today it returns 179 pages. Note that the rest of the 4800 pages are in fact original well written text, so if it's a quality issue, it's only based on backlinks (the 5k pages don't have many deep links).
I also checked another site I set up years ago that had 3 to 5 thousand pages on it, the pages have been indexed for years. Same thing, the site operator returns 229 pages.
| 2:49 pm on Apr 26, 2010 (gmt 0)|
One site I manage used to return 9+ million pages back in '08. Currently it's returning between 500k and 600k. I'm glad to see more people having issues with this, as I thought I was being penalized.
| This 49 message thread spans 2 pages: 49 (  2 ) > > |