|Are URLs that fall out of the index still in Google's web graph?|
This idea just fell on me - and although I can't think of a way to test it, maybe someone else can.
We've all seen URLs "fall out" of the index at times. But does Google still use those URLs as part of their web graph when they iterate link juice? Or is their web graph confined only to those URLs that can be retrieved as search results?
I'm sure that once a URL is crawled, the data is not just gone if the URL is no longer in the visible index - Google wouldn't throw away data, right ;) So I wonder if the publicly visible pages that can be shown in the SERPs at any one moment are only a subset of the full web graph that Google uses for PageRank calculations - and others, too.
What do you think?
One of my sites has about 50% of the pages with the "noindex,follow" robots attribute. They are not visible in the SERPs, but experience is that link juice just happily flows. That may not be fully representative for links which "fall out" of the index, but the juice graph of Google consists of definitely more than just visible URLs.
Thanks for that, lammert - it certainly does support my idea.
I started thinking in this direction because of baberjaved's question in another thread: Removing Low Ranked, Un-useful Pages - worth it? [webmasterworld.com] and Matt Cutts' frequent advice to let googlebot crawling and indexing be as open as you possibly can.
Add that to the difficulty in getting a stable number for indexed pages [webmasterworld.com], and it seems to me that what goes on with Google is a lot more than what we see, or even what we've guessed.
URLs that are in Google's back end bur not publicly visible could be Google's own version of dark matter [en.wikipedia.org].
He, he, he. This is what you're thinking about at 2:05 in the AM!?!
As part of that dark matter I still firmly believe that the supplemental index still exists and that a URL that "falls out" of the main index has simply "gone supplemental" as we used to say. And following on lammert's comment I'll even posit a NOINDEX index. And a 404 index. And then a...whatever. Yeah, plenty of stuff we can't see.
Bottom line though, I think Google would have to consider all of the URLs it knows about that are capable of passing link juice in its iterations simply because if it did not it would be creating innumerable dead ends where the juice couldn't flow back into the system. I ain't no math guy, but after umpty-ump iterations wouldn't the PR of the entire web then be effectively reduced to zero?
Just yesterday I reread Saul Hansell's 2007 NYT article Google Keeps Tweaking Its Search Engine [nytimes.com]. I'm still quite astounded that search professionals and academics were astounded at what Google was doing two years ago. Guess my hope is actually not understanding G, simply surviving in spite of my not understanding it.
Even though such pages aren't in the SERPs index, they could still be part of the underlying set of data that is fed into Google's algorithm to calculate the SERPs. This underlying database may have even expanded after the transition to Caffeine.
Google compartmentalizes everything. It all runs together but is stored separately so I imagine pagerank is no different. In fact I would be willing to bet that google iterates several different versions of pagerank. Is your sites social media value on fire via Twitter/Facebook (all nofollow links) but besides that you have no inbound links? No problem, influence has its own pagerank.
If more people's browsers visit your site, and report to Google that they visited and stayed a while, your pagerank will likely increase even with zero inbound links(outside social sites using nofollow). Can't test it but I know it's true. Popularity has value that may not equal number/quality of inbound links, Google knows this and it's weighed and measured accordingly.
One other thing to consider, Google serps fluctuate quite a bit on our end but they may not on their end. The various data centers our serps come from likely draw data from a main data center but at different intervals. If that's the case we can't see the current status of our sites webgraphs, at least not in real time.
|If more people's browsers visit your site, and report to Google that they visited and stayed a while, your pagerank will likely increase even with zero inbound links(outside social sites using nofollow). Can't test it but I know it's true. Popularity has value that may not equal number/quality of inbound links, Google knows this and it's weighed and measured accordingly. |
I don't think we should start muddling terms here.
Yes, there are different components of the algorithm and they all have different ways of assessing a document's value, but we shouldn't refer to them all as "PageRank," especially since Ted's question refers specifically to Google's web graph and how links are valued or not.
I believe they do not fall out of the web graph.
We have pages that you can only get to from certain pages that have disappeared from google yet these pages down further still retain pr and rankings without any other external links.
So those pages that have been "ghosted" as we call them around here do still pass some form of juice and are being stored someplace inside G.
>One of my sites has about 50% of the pages with the "noindex,follow" robots attribute.
Google has said time-and-time again that "noindex" means they will not show that page in any search result, but that is will still be used in pr and rank calculations. Matt Cutts said that here about circa 2003. Of course links on pages - even NOT in the index count. If GoogleBot can download it, then it is going into the algo calcs. The only way you are keeping that out of the index is to not let googlebot see it.
a 404 link is an entirely different discussion.
I tested recently and found that anchor text from 'noindex,follow' pages is certainly credited within the same site.