Forum Moderators: open
They don't show as backlinks, and content doesn't show up in serps. There may be exceptions.
Well, this was over a year ago, but google followed every link in my 280k site map and every page ended up with page rank. Some of the pages had no other links (that I knew of anyway) at the time. It makes sense that Google would do this, because they want to be able to find every page on the web. They are addicted to links.
Backlinks that show in the results are not necessarily directly related to what is used in the PR calculations. It may require that links be in the cached part of the page.
Then again, I may be remembering it wrong, or missed some other obvious explanation. Either way, it isn't worth it to me to worry about. My monster site map is gone, and I have no desire to set up a test.
I was never concerned about having a page show up for content that is after 100k, because a page should have some sort of content in the first few k letting you know what the page is really about. There might be some exceptions, but is it really worth the processing power to find them?
I never made any claims about whether that part was indexed, I just said that it followed the links and assigned PageRank. They are different things.
Have you done experiments that show that those links are not followed when earlier links are followed? I would be willing to accept those results, but basing that statement on backlinks showing up, or not being in the index is a non sequitur.
Like I said, I do not have recent proof. If you have done this specific experiment, then say so and I will accept it. Of course I would also accept it if GoogleGuy is more specific, but that would take some of the fun out of it;)
This does become slightly important on some directory or link pages. Sites at the beginning of the alphabet get pagerank passed and credit for the link, while those at the end of the alphabet get zip.
That is interesting and something I have not considered before. Although is it not that common it is an unfair inconsistency on Google's behalf.
What part of "indexing" is unclear? The text isn't indexed. The words can not be found via search. The backlinks don't show. Those page sections don't exist so far as the Google index is concerned. Of course there could be some sooper secret database of info that Google chooses only to use in determing PR, but that is pretty darn silly to suspect.
Building a webgraph is totally unrelated builing an index that can show search results. To build a searchable index, you need to keep a copy of the pages for longer quoted searches, and to show the snippets.
On the other hand, building a list of outbound links from a page is only a one time operation. In fact there is some anecdotal proof that this processing is done immediately at the time of crawling, even before any index building is begun.
Let's go back to the old days of the deep crawl and dance. Over the course of a week, Google went out and crawled as many pages as they could.
These pages were pages that were pointed to by pages that they had already found in the current crawl. That means that as a page is crawled, the most important information to immediately extract from it is the links.
The links are added to the list of additional pages to crawl, and then the file is passed off to a different server to build the index, which would take several weeks for all those files that were crawled in that one week.
It is also suggested that computing PR can take a significant amount of processing time. As the links have already been extracted from the document, there is no reason to have the indexing servers supply the list of links, the crawlers can just pass off the links and their corresponding index numbers to the PR calculating systems.
It is the indexing servers where space is an issue. The crawlers can easily process the file stream as it is coming in. Searching for <a in a stream is pretty damn trivial.
Then once they get the whole file, they send the first 100k to the indexers and they send the list of URLs to the PR calculator and enter any new URLs into the list of pages to be crawled.
So, given Google's goal of "indexing the web", and the fact that they download the entire file, not just 100k, and the fact that they obviously extract the URLs at a much earlier point in the process than they build the index, it seems very obvious to me that they should pay attention to ALL the links for both crawling and for calculating PR.
I have no proof that they do it this way, but from a programmer's standpoint, it is the obvious way to do it. Before suggesting that I do not understand, please take some time to consider that I may actually have a very good understanding of the different stages, and consider that there is a good chance that the system is actually a little more complex than everything in the process being "indexing".
For Google, every link should count, but for caching and building the search index, there is little point in messing around with the text beyond 100k. It is equivalent to the reason that they will only show 1000 results. If the keyword is buried 50 screens down on a page, no one will ever dig that far looking for it.
So once again, if you have any evidence that google does not follow links from the end of a large file, or consider those links in PR calculations, please share. I have taken the time to explain why I think they do, please show me the same courtesy, instead of intimating that I don't understand what "indexing" means.
You can postulate that they calculate all this data and then not bother to show it like they show parallel data (and offer no explanation for why they *choose* to not show this information). It's not something someone outside the plex can prove, but evidence is around. It's also off-topic from this thread and fairly trivial, unless you are a link at the bottom of key, overly large page and wondering why you aren't getting a PR boost.
There are actually a couple of problems with your uggestion of checking DMOZ. First, I'm having a lot of difficulty finding any of their pages that are over 50k, much less 100k.
Then there is the problem of the page being built up out of a huge number of links for it to get to 100k, that would be at least several hundred links, and the over-full categories like that tend to be low PR, and all the pages in DMOZ have other incoming links. Therefore, there is just a small amount of PR being distributed to each of them from that page.
Then there is the mirror in the google directories where they are ordered by pagerank, so they will end up in a different order there, which would mean different ones getting the pagerank if they were limited to the 100k.
DMOZ is simply a bad example.
Let me reiterate the situation I had before.
- I had a 280k sitemap pointing to around 1000 files.
- I had pages that were only pointed to by the sitemap and no other files in the index.
- The sitemap was PR4.
- After crawling the sitemap, Google crawled ALL the pages linked from the sitemap, and they all got PR in the next update.
Like you said, no one outside the plex can give a definitive answer, but my single, somewhat controlled experience seems to me less like "postulating" than what you are doing.
I mostly attribute this to the huge number of links that are on those pages. 1000 links will turn a PR5 into an insignificant link.
One thing I did notice, and got a laugh out of was how many web design promotion companies have PR0. That that is a totally different topic.