Forum Moderators: Robert Charlton & goodroi
My experience suggest that there is a formula for what is in and out of the main index related to PR and that there is an upper limit of 1000 or so. It seems that you can only make small changes to this ratio without an extreme ammount of work. Has anyone else found this?
site:example.com
1,000 results (the limit for any search)
only #1000 is supplemental
total results about 300,000
site:example.com/directory-a
1,000 results
only #1000 is supplemental
total results about 19,000
site:example.com/directory-b
1,000 results
only #1000 is supplemental
total results about 13,000
See what I mean? The first set of results seems to say they've only got 999 results in the main index. But the results in the second and third searches alone seem to indicate 1998 main index results just from those two directories. If I do site: searches for all their directories and add up the number of non-supplementals, it becomes very large.
As for 2) think of total PageRank X (sum of all inbound PageRank to your domain) split between Y number of pages. Roughly speaking, bigger page count = lower average PageRank per page (depending on your site structure). We know that a page with PageRank below minimum threshold "goes" supplemental. With excessively high page count, average falls too low, and you'll end up with many pages in the supplemental index. By reducing the number of pages, you slightly increase average PageRank per url. That can result in several supp pages popping back into the main index.
3) A minor point, but if a TBPR 5 page has 100 links and 90% of those are outbound, you are giving 90% of that page's juice to other sites instead of to your internal pages. Add more internal links to that page, lower your outbound percentage, and you have a little more juice to play with.
Both are minor tweaks compared to gaining trusted (non-paid, non-reciprocal) IBLs. No matter what you do, I don't see a domain with a TBPR 2 root getting 1,000 pages in the main index.
Shouldn't pages be evaluated on the uniqueness and useful content they provide rather than how many links they have?
What about those who advocate content is king?
This is not my experience - I have a PR 4 site with 600 pages. Most used to be in the main index - about a 2 months ago it started to go supplemental and stopped at 10% of pages in main and 90% out. There is no consistent difference between the ones in and out in terms of PR, links etc. I have good IBL's and have been adding links steadily. I have dealt with the cannonical issues and there is no duplication for any of the pages - each of which have 100+ unique words. The only issue remaining is deep linking to the individual pages with IBL's which is hard to do. Looking at similar sites with similar PR and 100+ pages they all seem to have similar ratios of 10% in and 90% out. It is my impression that these ratios are fixed and that its very hard to shift the sites out of the PR related ratio. I agree that sites with PR of 2 will mostly be supplemental, but it appears that even PR 4 and 5 sites will have a large percentage of sites supplemental in spite of all the listed causes being addressed. As I mentioned before I can move some pages into the main index, but shortly thereafter other sites will go supplemental. Is this Google's answer to the billions of websites - only index 50%, 30% or less of the pages for each URL? Is the era of many pages on a site over, despite their value and benefit to users? Is the era a specific information that can be provided on many specific pages that pop up in specific searches now over, due to the spammers?
You cannot see the PageRank of individual urls. TBPR 4 is on the low side anyway, so its not surprising to see fluctuations.
There is no correlation between TBPR and % supplemental. If you owned a TBPR 5 with 10 pages, you'll always have 100% of your site in the main index.
As for Google basing indexing on PageRank, it's a flawed concept but I don't think they have a better alternative, since Googlebot doesn't understand what's written on a page.
[edited by: Halfdeck at 5:07 pm (utc) on May 29, 2007]
BUT
"No matter what you do, I don't see a domain with a TBPR 2 root getting 1,000 pages in the main index."
AND
"If you owned a TBPR 5 with 10 pages, you'll always have 100% of your site in the main index."
I don't want to labour the point but I still think that there is a correlation and that the % supplemental is driven by a formula that has terms for Page Rank and number of pages. You can change things a little but in general terms if you have more than 10 pages, or so, the % supplemental will be related to PR. If you have a PR4 site with 100 pages say 80% may be supplemental, if its PR5 say 50% may be supplemental, if its PR2 then 99% may be supplemental. Perhaps its related to diluting the PR between pages, but its hard to shift from the range driven by the formula.
In my case I have a travel site with 600 pages, currently PR4 - one for each city. My strategy was to have my keywords linked with the town name so that searches would find the individual pages in Google results. Obviously having only 20% of the pages indexed makes a mess of this strategy. Most of my competitors are in a similar situation, only a small to moderate percentages of their sites are indexed, and the limit is 1000. If I want 95% in the main index what should I do - work on deep linking to each of the pages? Try to get the PR lifted to 5 or 6? Register 100 URL's and keep the number of pages to 10? Develop my own Travel search engine? Is it possible to have 95% of 100 pages indexed for a PR4 or PR5 site? Can you really control exactly what's in or out? The supplemental index has changed things - perhaps Google is heading towards only indexing the index page for a URL. The times they are a changin!. Its so hard working in the dark without knowing what will work and when.
PR Pages Main Main%
3 132 15 11
3 286 127 44
3 39 20 51
3 111 47 42
3 132 16 12
3 56 29 52 Mean 34% Max 52%
----------------------------------------------------------------
4 625 165 26
4 87 30 34
4 138 130 94
4 189 50 26
4 853 680 80
4 230 99 43
4 287 200 70
4 202 130 64
4 447 285 64 Mean 59% (50% without 94) max 94%
----------------------------------------------------------------
5 409 350 86
5 574 230 40
5 711 370 52
5 523 445 85 Mean 66% max 86%
----------------------------------------------------------------
Very large sites max 1000?
6 28400 1000
4 3350 542
4 3070 1000
4 4560 1000
Say your TBPR 4 remains constant but you add more and more pages. As you do so, % supplemental number will tend to increase. As you remove pages, % supplemental number will decrease (its like sharing a pizza with less people - each person gets a bigger slice). Similarly, if you pull links to your site and lower the home page TBPR to TBPR 1, % supplemental will increase. If you increase TBPR to 7, % supplemental will decrease (you're sharing your pizza with the same number of people, but you ordered 3 large pizzas instead of one small, so everyone gets a bigger share).
There is no arbitrary formula that says a site with PageRank X can have no more than Y% of its pages in the main index. Suppose for a second that a TBPR 2 is allowed to have 2% of its pages in the main index. A spammer can then force Google to index 40,000 pages by creating a 2,000,000 pages site.
However, through deep inbound links, it is possible for an internal page to have a higher PR than the domain root. That simple fact is a good reason not to fear publishing new pages that are good and could attract links on their own merit. In other words, there is truth in Halfdeck's analysis -- if you only have links to your home page.
The way PageRank flows through a site is unique for every site; How PageRanks flow into a site is also unique for every site (A blog might have multiple entry points, as people link to different posts, while a commercial site with thin product pages might only have one entry point - the home page). There is still a tendency for PageRank to gravitate upwards (e.g. PageRank will gravitate toward blog category/archive pages, as they're often linked to from every page, though again, it depends on your blog setup).
Halfdeck says
“If your site is largely supplemental, it means 1) not enough quality inbound links to your site 2) you have too many pages 3) you link out too much 4) Google may think your IBLs are artificial 5) Cannonical issues are causing PageRank to split.”
Sure you can translate these ‘site issues’ into an impact on the TBPR for each of the pages, but there is still reference to the ‘duplicate filter?’, low trust for the site, internal link structure, and other site wise penalties or issues?
The frustrating thing is that if I look at the pages that in the main index and compare them with ones that have gone supplemental there is no apparent difference between them: They are in a similar position in the hierarchy, they have unique content, they have the same number of unique words, they have been cached recently, etc., etc.,. Perhaps someone has linked to the individual pages that are in the main index, – but I don’t think so. The decision of what’s in and out appears to be arbitrary and if 50 are considered worthy of inclusion in the main index, why not include another 300 that should have essentially the same TBPR in terms of link structure, etc, etc. In my case it’s a travel site and each page for a town has a town description, holiday activities and lists of properties. Each page has unique meta info etc. I can't see what the difference is, and so I've nothing to work on. This was the reason for my impression that there was an arbitrary allocation – why include some but not others, when their features are identical. There's no common feature for the 'in' group, which would identify why they are 'in'. Perhaps it’s a penalty thing – but I’ve addressed all the issues I am aware of that may be causing this – no links to link farms etc. etc. Perhaps it’s a timing issue – maybe I should wait a few months and see what happens. Maybe the delays in response from Google, combined with my fiddling with things, are making it impossible to work out what’s happening. Or else focus on getting in bound links, including deep links to lift PR of the pages above the threshold for inclusion in the main index. It reminds me of trying to catch fish by 'thinking like a fish'- what bait, where, when. The Google system is a similar mystery - 'Thinking like the Googlebot' is another of life's frustrations! Particularly when the Google brain and system is for ever changing - at least the fish brains and thoughts are constant!
We have an ecommerce site with about 40k pages. We seem to have 10% pages in the main index and 90% in supplemental. With no means of controlling what is in and what is out. Pages with loads of content can go supplemental as well as those with little.
We only have a home page PR of 4 and realise that we need to build relevant quality inbound links. This is by no means easy, since we are asking ourselves the question why would anyone want to link to us. We are sorting this, with fresh content, but realise this will take along time to get the links. We have also been through and sorted any Canonical issues that were present, although we didn't have much of a problem here.
My thoughts are to try to stabilize things by reducing the number of pages we have, and working out how to spread the PR evenly across the pages we want in the index.
My question is, has anyone had any success in altering their internal linking structure to improve their spread of PR?
I would feel much happier if we could keep the good pages in, whilst we continue to build the inbound links.
see
but someone else may have experience with this
Isn't Google's policy of putting pages with low pageranks in supplemental flawed?Doesn't it force a webmaster to go after links rather than create pages that are of value with good content?
Shouldn't pages be evaluated on the uniqueness and useful content they provide rather than how many links they have?What about those who advocate content is king?
Not exactly. Assuming all PageRanks of webpages on the Intraweb adds up to 1, average PageRank is miniscule, something like 0.000000100204023934. Two pages (12+ months old) might both display 0 in the toolbar but internal PageRanks for them may be dramaticly different.
In TBPR terms it might be 0.5 say - if its 0.4 it goes to hell - if its 0.6 it goes to heaven.
Given a lot of issues and exceptions (including all the delays with the toolbar, spidering etc.) what you are saying is that no page with a TBPR above 2 or 3 should be in hell and that the average TBPR in hell, should be well below that in heaven, and that for a given site there should be a clear break between PR values at the boundary.
It may be time to do some research before I go fishing!
I am in the same position as you. My aim is to keep the first page in each category in google's main index at least. But, I two have been wrestling with the linking structure.
I also notice, having verified the site as mine, that google's count of my internal links is incorrect and very out of data - so don't expect any quick results!
If you got 60,000 pages, and you only got "this much" PageRank, and you divide it [...he mumbles], some of them are going to be in the supplemental index. Given "this many people" who link to you, we're willing to include "this many" pages in the main index.
Basically what I posted before.