Forum Moderators: open

Message Too Old, No Replies

Page Size Limit

What does it take?

         

mikeD

1:43 pm on Dec 24, 2003 (gmt 0)

10+ Year Member



How big and slow does a page have to be for Google not to crawl and index it.

Elijah

2:47 pm on Dec 24, 2003 (gmt 0)

10+ Year Member



As far as I know google will only index the first 101KB of a page.

I think google will still index the first 101KB of a page even it the filesize is really large.

I keep my pages a lot smaller than that though.

Elijah

mquarles

2:51 pm on Dec 24, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Do they crawl an entire PDF?

MQ

Essex_boy

3:08 pm on Dec 24, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have come across crawled PDF files, its rare but it happens as to a wholecrawl.

Maybe its the 101K thing again, ask googleguy.

mikeD

4:36 pm on Dec 24, 2003 (gmt 0)

10+ Year Member



what would you say the limit is for page loadup, time wise

GoogleGuy

8:41 pm on Dec 24, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Um, PDFs can be bigger than 101K. 1MB is probably a safe guess. Slow is fine, but if a page actually times out because it's so slow, that can be bad. If a page takes over 30-60 seconds on a good net connections (I'm somewhat making up these timeout numbers, so take with a grain of salt), I'd think about better hosting. If a page takes so long to load for a bot, then it might not be the best page for a user, either, but most of our timeouts are pretty good though. Just a slow-loading page shouldn't be enough to knock out a page.

mikeD

9:32 pm on Dec 24, 2003 (gmt 0)

10+ Year Member



thanks for the help gg, sometimes my feed loadups can be a little slow, on a T1 1.44Mbps connection 3.48 seconds at best. Could be 9.60 seconds on a really bad day. Just hope Google can can still crawl it on a really bad day.

BigDave

10:17 pm on Dec 24, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Google will download the whole file, even beyond 101k. They will also find links and distribute PageRank if the links are after 101k.

They will only keep the first 101k in their cache, and I do not know if they will index for keywords beyond the 101k limit.

gerwin

6:26 am on Dec 25, 2003 (gmt 0)

10+ Year Member



It would be kind of strange if Google where not indexing the file after the 100KB. Isn't Google saying that you must make websites for users and not for searchbots? Users may just like a 100KB page if it has a good story on it... :)

GoogleGuy

6:49 am on Dec 25, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



gerwin, we did some tests and I think we found that instead of indexing the tail end of a really long doc (which is pretty rare anyway), it was better to use that space for other pages. mikeD, that doesn't sound like any kind of problem at all as far as load time.

Merry Christmas..
GoogleGuy

steveb

9:07 am on Dec 25, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



"They will also find links and distribute PageRank if the links are after 101k."

They don't show as backlinks, and content doesn't show up in serps. There may be exceptions.

SEOtop10

9:26 am on Dec 25, 2003 (gmt 0)

10+ Year Member



Merry Christmas to all of you at Google, GoogleGuy.

And of course to you, my fellow member.

Arun

BigDave

5:17 pm on Dec 25, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



They don't show as backlinks, and content doesn't show up in serps. There may be exceptions.

Well, this was over a year ago, but google followed every link in my 280k site map and every page ended up with page rank. Some of the pages had no other links (that I knew of anyway) at the time. It makes sense that Google would do this, because they want to be able to find every page on the web. They are addicted to links.

Backlinks that show in the results are not necessarily directly related to what is used in the PR calculations. It may require that links be in the cached part of the page.

Then again, I may be remembering it wrong, or missed some other obvious explanation. Either way, it isn't worth it to me to worry about. My monster site map is gone, and I have no desire to set up a test.

I was never concerned about having a page show up for content that is after 100k, because a page should have some sort of content in the first few k letting you know what the page is really about. There might be some exceptions, but is it really worth the processing power to find them?

steveb

12:08 am on Dec 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



See Google Guy's comment above: "... we found that instead of indexing the tail end..."

This does become slightly important on some directory or link pages. Sites at the beginning of the alphabet get pagerank passed and credit for the link, while those at the end of the alphabet get zip.

BigDave

12:43 am on Dec 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I did see that. but you might note that he did not say anything about following the links and passing pagerank.

I never made any claims about whether that part was indexed, I just said that it followed the links and assigned PageRank. They are different things.

Have you done experiments that show that those links are not followed when earlier links are followed? I would be willing to accept those results, but basing that statement on backlinks showing up, or not being in the index is a non sequitur.

Like I said, I do not have recent proof. If you have done this specific experiment, then say so and I will accept it. Of course I would also accept it if GoogleGuy is more specific, but that would take some of the fun out of it;)

MS_Excel

1:12 am on Dec 26, 2003 (gmt 0)



This does become slightly important on some directory or link pages. Sites at the beginning of the alphabet get pagerank passed and credit for the link, while those at the end of the alphabet get zip.

That is interesting and something I have not considered before. Although is it not that common it is an unfair inconsistency on Google's behalf.

steveb

1:46 am on Dec 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



"he did not say anything about following the links"

What part of "indexing" is unclear? The text isn't indexed. The words can not be found via search. The backlinks don't show. Those page sections don't exist so far as the Google index is concerned. Of course there could be some sooper secret database of info that Google chooses only to use in determing PR, but that is pretty darn silly to suspect.

BigDave

2:52 am on Dec 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Not at all.

Building a webgraph is totally unrelated builing an index that can show search results. To build a searchable index, you need to keep a copy of the pages for longer quoted searches, and to show the snippets.

On the other hand, building a list of outbound links from a page is only a one time operation. In fact there is some anecdotal proof that this processing is done immediately at the time of crawling, even before any index building is begun.

Let's go back to the old days of the deep crawl and dance. Over the course of a week, Google went out and crawled as many pages as they could.

These pages were pages that were pointed to by pages that they had already found in the current crawl. That means that as a page is crawled, the most important information to immediately extract from it is the links.

The links are added to the list of additional pages to crawl, and then the file is passed off to a different server to build the index, which would take several weeks for all those files that were crawled in that one week.

It is also suggested that computing PR can take a significant amount of processing time. As the links have already been extracted from the document, there is no reason to have the indexing servers supply the list of links, the crawlers can just pass off the links and their corresponding index numbers to the PR calculating systems.

It is the indexing servers where space is an issue. The crawlers can easily process the file stream as it is coming in. Searching for <a in a stream is pretty damn trivial.

Then once they get the whole file, they send the first 100k to the indexers and they send the list of URLs to the PR calculator and enter any new URLs into the list of pages to be crawled.

So, given Google's goal of "indexing the web", and the fact that they download the entire file, not just 100k, and the fact that they obviously extract the URLs at a much earlier point in the process than they build the index, it seems very obvious to me that they should pay attention to ALL the links for both crawling and for calculating PR.

I have no proof that they do it this way, but from a programmer's standpoint, it is the obvious way to do it. Before suggesting that I do not understand, please take some time to consider that I may actually have a very good understanding of the different stages, and consider that there is a good chance that the system is actually a little more complex than everything in the process being "indexing".

For Google, every link should count, but for caching and building the search index, there is little point in messing around with the text beyond 100k. It is equivalent to the reason that they will only show 1000 results. If the keyword is buried 50 screens down on a page, no one will ever dig that far looking for it.

So once again, if you have any evidence that google does not follow links from the end of a large file, or consider those links in PR calculations, please share. I have taken the time to explain why I think they do, please show me the same courtesy, instead of intimating that I don't understand what "indexing" means.

steveb

3:33 am on Dec 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Instead of postulating what they could do, since we don't know one way or another for sure, examine what they do do. It isn't hard to find one 101k+ page with good PR, like a dmoz page, examine the cache, see where they stopped, see the backlink show up in for the links cached, see the backlink not show up for those links not cached, see the corresponding PR boost for the cached/top pages with few other links, see no PR boost for the non-cached/bottom pages with few other links.

You can postulate that they calculate all this data and then not bother to show it like they show parallel data (and offer no explanation for why they *choose* to not show this information). It's not something someone outside the plex can prove, but evidence is around. It's also off-topic from this thread and fairly trivial, unless you are a link at the bottom of key, overly large page and wondering why you aren't getting a PR boost.

BigDave

4:03 am on Dec 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I am not randomly postulating, I am basing it on personal experience with a site that I was 100% in control of, as I stated in message #13.

There are actually a couple of problems with your uggestion of checking DMOZ. First, I'm having a lot of difficulty finding any of their pages that are over 50k, much less 100k.

Then there is the problem of the page being built up out of a huge number of links for it to get to 100k, that would be at least several hundred links, and the over-full categories like that tend to be low PR, and all the pages in DMOZ have other incoming links. Therefore, there is just a small amount of PR being distributed to each of them from that page.

Then there is the mirror in the google directories where they are ordered by pagerank, so they will end up in a different order there, which would mean different ones getting the pagerank if they were limited to the 100k.

DMOZ is simply a bad example.

Let me reiterate the situation I had before.

- I had a 280k sitemap pointing to around 1000 files.
- I had pages that were only pointed to by the sitemap and no other files in the index.
- The sitemap was PR4.
- After crawling the sitemap, Google crawled ALL the pages linked from the sitemap, and they all got PR in the next update.

Like you said, no one outside the plex can give a definitive answer, but my single, somewhat controlled experience seems to me less like "postulating" than what you are doing.

BigDave

4:13 am on Dec 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Okay, I have now checked 6 DMOZ categories that were over 101k. There was no notable difference in the PR of sites linked at the beginning and the end of the page.

I mostly attribute this to the huge number of links that are on those pages. 1000 links will turn a PR5 into an insignificant link.

One thing I did notice, and got a laugh out of was how many web design promotion companies have PR0. That that is a totally different topic.

GoogleGuy

5:03 am on Dec 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I wouldn't be surprised if we found/followed links after 101K, and I've heard people that I trust say we do it; I've just never verified it for myself. I think that there is a cap for indexing of the on-page text. As for the original question, I'd recommend keeping pages under 100K if you want to be absolutely safe. But people who run these sorts of experiments for themselves are the sorts of people who make good SEOs anyway; feel free to play around to see what works.

steveb

6:01 am on Dec 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The dmoz example is a good one because there are so few of these mega pages that have enough pagerank to pass enough meaningful PR for it to be noticed... meaning the megapage has to be PR6 or so, and it creates PR3 pages for those at the top but PR2 pages or less are at the bottom. And yes, basically all sites have other links so its not easy to be sure about PR and what is benefitting. However, regardless of what may or may not have happened last year, I see no reason to today doubt GoogleGuy. While exceptions may occur, it is prudent to keep pages under 101k, and a good idea to do what you can to see that your links are above the 101k level if a 101k page absolutely has to exist (like if it belongs to someone else). The reality is that while 101k+ pages that are all links are easier to study and speculate about, more often/usually these 101k+ pages will not be all links, so it is important to have the important stuff higher on the page, whatever the important stuff to you is.