Page rank and googlebot crawl schedule

Forum Moderators: open

Message Too Old, No Replies

Page rank and googlebot crawl schedule

What's your Page rank and when will googlebot deep crawl your site?

latimer

6:58 pm on Jun 5, 2002 (gmt 0)

Last month it seemed the theory that less page rank puts you lower on googlebots spidering schedule was confirmed (for existing sites, new sites seem to get special attention). Also, that higher page rank will result in more of a large sites pages being crawled. Anyone care to share this months crawl dates on their sites, number of pages crawled and page rank? We are currently with a lowly page rank of 3 on the index and 0 on other levels and haven't seen googlebot yet.

korkus2000

7:01 pm on Jun 5, 2002 (gmt 0)

PR 4
300 pages spidered - most of them dynamic urls
started 6/1
tapering off today

Grumpus

7:55 pm on Jun 5, 2002 (gmt 0)

I've got a PR4 this month (last was a 5, but it went down cuz credit for internal links back to my homepage was dropped).

The bot's been there for two days working on it a lot slower than usual, but still plugging along. I put up some fences to sort of corral the bot in the direction I wanted her to go (so that pages with popular search terms got indexed first) and that seems to be working.

Not sure how the crawl will continue. First month (when I was given PR5) about 40K pages were indexed. Last month (bringing me to a PR4) only 28K were indexed. I'm up to about 600 pages so far this month (and at a much slower rate than usual), but she's still plugging away. Will let ya know.

Doofus

8:07 pm on Jun 5, 2002 (gmt 0)

Excellent idea for a thread! Google has been slower to gear up this time, and had me worried for two days. Starting to pick up some serious speed today, however. The big question now is when will it decide to suddenly stop the crawl and start computing? This has always been an issue with "Domain2" below, which has 100,000 pages available but has never gotten more than 40,000 to 50,000 before the crawl stops cold for the month. My suspicion has been that if Domain2 was a PR8, for example, instead of a 6 (it's a high 6; some months it shows as 7), then the serious crawling would start earlier, and Google would get deeper. But I have no way to prove it.

All spidering began on June 1; I'm using New York City time.

Each of the five domains is a PR 6. Some are not mine, but I have log access.

The method for determining the total number of pages from each domain that made it into the index last time is this:
site:www.mydomain.com "www.mydomain.com"

Domain? / Pages crawled / Last update / Status

Domain1  4,163    4,230  Still busy

Domain2  8,158   40,200  Still busy

Domain3  9,644   19,100  Still busy

Domain4   39    35  Finished at 2002-06-02 00:30

Domain5   3     3  Finished at 2002-06-02 02:03

On domain2 and domain3, about half of the pages crawled thus far have been crawled in the last 14 hours alone. Starts slowly, speeds up gradually, goes crazy at the end -- classic Google.

Doofus

2:45 am on Jun 6, 2002 (gmt 0)

Update on previous post:

Domain1 4,319 crawled; occasional sniffing but probably finished
Domain2 31,510 crawled and still busy
Domain3 11,482 crawled and still busy

Been watching the Domain2 crawling behavior with tail -f access_log � grep "googlebot"

It spasms about every three minutes. During a spasm, it can fetch my little static HTML files at a rate of between 2 and 10 per second (yes, that's seconds!). Then it catches its breath for a few minutes, and comes back to do the same thing.

For me, that forever answers the question of whether it was a good idea to dump nearly the entire database into static files. The CPU load difference was between 10 and 100 times greater for this level of spidering when the files were all dynamic. Now it hardly makes a dent in the load. (This is just CPU load; bandwidth is not a problem for us.)

Lighter load + no program execution = faster delivery. Google doesn't have to wait, and may get deeper (assuming that there's no predetermined cutoff on the crawl for a given PageRank).

It also looks to me like the crawling is in directory-depth order, approximately. This is a strong argument for designing a shallow site, particularly if you're struggling with a situation where you're trying to get more pages crawled. I believe that Google comes into your homepage early or late in the crawling cycle, based on your homepage PR. Then it "guesses" at all other links it finds on your site and assigns them a home page minus one PR based on directory depth, and queues them in the crawl accordingly. Therefore, a shallow site helps provide a better PR on this guess. If you need Google to go deep, this might be helpful.

Once it gets the internal page, and goes through the monthly calculation, then the "guessed" PR is replaced with a real PR.

billy_t9

7:06 am on Jun 6, 2002 (gmt 0)

PR5 (maybe high five) till 3rd level have PR5
crawl started 3rd of May and not finished yet
20172 pages spidered - most of them dynamic urls

Abrexa_UK

11:48 am on Jun 6, 2002 (gmt 0)

Our PR 5 and 4 sites started on the first day of the crawl for us, and have been going constantly with a couple of thousand pages per day being indexed.

A PR3 started yesterday, but Googlebot is pulling the pages at a fair rate to try and catch up :)

So I would say that PR does influence the order. perhaps there is simply a threshold - PR 4 and above get spidered first. That would also match up with the links that get shown in the "backward links" pages.

Does Google have a set differentiation between sites of PR4+ and everyone else? Does this affect anything else within Google?

Doofus

3:56 pm on Jun 6, 2002 (gmt 0)

Google stopped crawling today around 06:00 New York City time.

My big site, Domain2, got 95,000 picked up by the end. This is twice the usual, and only slightly under the maximum possible of 103,000 pages.

I'm very happy. Now I have to figure out whether everyone is getting a deeper crawl this month, or whether it is due to improvements in my internal-page cross-linking on that site.

And, of course, I'm very interested to see if the inside pages end up with a passable PR once the next update kicks in. If the internal PR wasn't damaged, I should be in very good shape.

I've been trying to crack that barrier of 40,000 to 50,000 crawled for 18 months now!

latimer

4:42 pm on Jun 6, 2002 (gmt 0)

googlebot only sniffing here, no deep crawl yet. Last month it started around the 10th. Hoping she gets here soon and with the modifications we have made will be able to get all 15,000. From what the rest of you are seeing it sounds very encouraging. Apparently there is more to it than just page rank, with the other site of pr 3 already being spidered. Abrexa, was your site ever in the penalty box?

athinktank

4:51 pm on Jun 6, 2002 (gmt 0)

PR2 site, first crawed and listed in google serps last month. GB grabbed robots and / on June 3rd. Grabbed the 2nd level pages on the 4th. Started a slow deep crawl on the 5th. GB is still pulling pages and is up to 3000. Last month they pulled over 4500 pages out of a total of 33,000+.

taxpod

6:24 pm on Jun 6, 2002 (gmt 0)

PR6 (a few PR6's and several 5's). Googlebot has now been with me for the past four days and has crawled just under 100,000 pages.

To tell you the truth, I've never been crawled like this before. My inbound links took a pretty good increase this last dance. But I guess I won't know if Google will show all these pages until next dance.

I've had 50,000 pages crawled before but never more than 45,000 pages listed.

latimer

4:17 pm on Jun 7, 2002 (gmt 0)

googlebot finally showed up to deep crawl our PR3 site. So far pulled about 1000 pages in 8 hours. From the other posts here, it seems that page rank does influence the crawl schedule as well as the speed that googlebot pulls the pages, and perhaps the amount of pages that get pulled. For example, taxpod's pr6 site has had about 100,000 pages pulled in 4 days = 25,000 per day. Billy__t9 pr5 site 20,172 pages in 3 days = 6,000 per day. While athinktank's pr2 site is being crawled slowly at about a 3,000 per day pace which is the same pace that our pr3 site is being crawled. Also, these pr2 and pr3 sites have in common that last month only a small portion of the pages were crawled athinktank's 4,500 out of 33,000 and our under 2,000 of over 15,000. However, some of the much larger sites also have not been completely spidered but this seems to be improving this time around. We are hoping that googlebot picks up speed, and completes the crawl this month as she is here about 3 days earlier than last month. Has anyone else noticed that the crawl started earlier for them this month? Maybe google's scalability issues are improving giving more hope for the next update.

Doofus

6:11 pm on Jun 7, 2002 (gmt 0)

I think crawling based on PageRank should be ditched.

1) It's extremely uneven and erratic. While I'm happy that googlebot picked up 95,000 pages this time, I have a bad feeling that half of them won't show up in the index. After one of the above posters mentioned the fact that typically, more pages are crawled than indexed, I checked my logs. Sure enough, while I've been seeing 40,000 to 50,000 pages in the index in past months, the crawls for March, April, and May fetched 52,000, 64,000, and 68,000 respectively. So even with the new high of 95,000 pages crawled, I'm not very optimistic.

2) Google came on June 1, and finished with me early on June 6. About 87,000 of the 95,000 pages it crawled were fetched in the last 14-hour period. It happened in spurts, which means that as many as 10 GETs per second were occurring at the height of each spurt. This would be bad news if I was still feeding dynamic pages to Google. It's crazy to do it this way.

3) Some 7,000 of the most important pages on the site did not get crawled this time, while in past months all pages in this important category were crawled. Despite the fact that this category of 18,000 pages is more important than the other category, it only got 61 percent coverage this time. The tremendous increase in the crawling depth occurred on pages in a less-important category, which saw about 97 percent coverage. I cannot figure out why Google did it this way, as the internal linking and directory structure is about the same for both categories.

4) By way of comparison, a French bot called celsius.noos.net (I think it's a professional bot; I can't find out much about them) did some fairly broad manual surfing of my site two days ago, decided they liked it, checked the robots.txt once, figured out where the key doorway pages were, and crawled all allowed 104,000 pages in 32 hours at an even rate. More power to them. Also, fastsearch.net goes deep but spreads it out over months, because it is so slow. But at least it's consistently slow, and just keeps plugging away, so that by now it's gone deeper than Google.

I'm getting tired of PageRank; wish I could afford to leave behind my Google referrals.

The Contractor

6:56 pm on Jun 7, 2002 (gmt 0)

Doofus,

When I hear of 95,000 pages on a site being crawled I cannot help but wonder - is this site a dictionary with a small paragraph on each page :)
What kind of a site really has 95K pages of content?

I have to ask ;)

WebGuerrilla

7:04 pm on Jun 7, 2002 (gmt 0)

>>I think crawling based on PageRank should be ditched.

When you have a system where each page on the web influences the scores of other pages, and you have a limited amount of time to crawl them all, you have to give priority to the high ranking pages.

If you get to the end of the month and you haven't finished, which will have the least impact on the overall quality of the index? dropping a bunch of PR3's or a bunch of PR8's?

The Contractor

7:08 pm on Jun 7, 2002 (gmt 0)

BTW - I agree with WG and the following statement that makes perfect sense:

<<If you get to the end of the month and you haven't finished, which will have the least impact on the overall quality of the index? dropping a bunch of PR3's or a bunch of PR8's?>>

And to those that have 100K pages of useful content on a site, please sticky mail me as I have yet to find a site that is not all links that has that many pages :)

vitaplease

8:06 pm on Jun 7, 2002 (gmt 0)

The_Contractor,

I am with you,

I am amazed with sites containing such high number of pages (100K), not called Amazon or Microsoft, and still thinking they are adding something useful to the WWW. But then again, I would hate this forum not to be fully indexed, maybe I am just jealous.

I would hope Google gives preference to crawling new sites, above a PR5 site adding 10K pages to a 100K page site.

It would be normal if Google - as WG implies - crawls higher sites foremost.

I would also hope they put in a crawling criterium on page level.
That is, even if your index page has a PR of 8, this does not mean that your 98.897 th page with a pagerank of 1, deserves a full crawl every month over a newly submitted unranked site.

Doofus

9:14 pm on Jun 7, 2002 (gmt 0)

It's a nonprofit site; our tax-exempt nonprofit corporation, which was incorporated in 1989, exists to make the database available. The database has over 100,000 records of globally-unique information. Compilation of this data began in 1983.

We're not the only nonprofit that has useful data. There's an enormous amount of data on the Deep Web in the form of databases. Almost all of them keep spiders out. Perhaps they're a bit more protective of their data than we are, or it's a government or university site that doesn't care about increasing traffic, or they can't afford to let a spider go nuts on their bandwidth. And most spiders wouldn't do the data justice anyway. It has to be the sort of data that produces a record when you do keyword searches in a search box. That's not always the case with databases.

Google started this, not us. In October 2000, I noticed that Google had sneaked past my cgi-bin disallow while following a handful of external links. That was the first indication I had that this one spider, at least, was actually capable of crawling and indexing dynamic files. I thought, "What would happen if I lifted the cgi-bin disallow?

After nine months of high CPU loads during the Google crawl, even though Google wasn't getting as far as I'd like every month, I also began to enjoy the Google referrals from this new indexing. I thought, "What if other spiders could do this too?" So I dumped the records into static files. No other bot had ever done serious dynamic-file indexing apart from Google. In the 12 months since I decided to add static files, that has pretty much continued to be true.

I did it in stages. It was a good idea -- other bots were indeed interested in the static files, and there was no load problem at all when getting hit at a fast rate, compared to the load problem the dynamic files presented. Six months ago I converted the last group of records over to static files, and re-instituted a disallow on all cgi-bin.

Now here's the point I'm trying to make: Google may well be the world's biggest and best bot, but it's also true, in my opinion, that PageRank functions as a straitjacket that prevents them from going after the Deep Web. I have met them more than halfway, and remember, they started this. I've made some progress, and have also been hit with a partial penalty for a partial mirror site, but in the end the jury is still out on my Great Google Experiment.

Our database has been on the Internet since early 1995 (on telnet), and on the Web since early 1996. Google started sniffing around in 1999. If our site had not been well-established by the time Google was crawling, we'd probably be a PR4 today instead of a high PR6. How far do you think Google would get into our data if we had a PR4?

What if another site that has useful data decided to remove their password restrictions and their robots.txt disallow, and make it available to spiders all of a sudden? It could take a long time to build up their PR if they appeared suddenly on the Web with open access.

During this long period, their PR would be low and the depth to which Google crawls their data would be comparatively shallow. If this database site hasn't dumped to static files, they'd probably have load problems as well. I think they'd be discouraged, and I can't think why they'd want to start down that road at all, unless they already have a healthy PageRank and can expect significant traffic from the Google referrals.

A PageRank system doesn't make any sense for crawling an integrated database. A site that is database driven, or dumped its records to static files from a database, should be considered as a totality. Google is so enamored of doing everything automatically with algorithms, that they are inadvertently overlooking the down side. That goes for PR0 linking penalties, duplicate page penalties that can involve outright removal of these pages from the mother site, and it also goes for crawling sites such as ours.

I think PageRank is holding Google back. It's expensive for Google (many recursive calculations), and it creates the illusion of rationality, when in fact some of the consequences of basing your engine on PageRank look very much like a case of the tail wagging the dog.

The Contractor

9:34 pm on Jun 7, 2002 (gmt 0)

Doofus,

Your site may have unique info. But I still believe that google should "not" crawl 100's of thousands of pages of a site. A couple thousand yes - 10's of thousands no. You can get the idea of what a company offers from it's database so there isn't any sense to crawl every page. I know that thought will get a lot of people upset at me but it is realistic. I could tie up the market for almost any type of business/sites by just dumping phonebook databases including company names into a database and then into a website and creating dynamic/static content from it. I would have millions of "pages" - should I expect to get every page crawled?

Ready To Roll

10:00 pm on Jun 7, 2002 (gmt 0)

I'm with you, Contractor. Limit any site to a couple hundred pages at most, unless there really is a need for more, and even then it should be a non-commercial site.
R2R

Jack_Straw

10:16 pm on Jun 7, 2002 (gmt 0)

What are you guys thinking? Suppose you have a store with thousands of products? Are you saying those product pages have no value?

I have been recently shopping for an MP3 player. I would like to see more pages listed (not less) when I search for specific items. If I put in a particular brand and model, the search engine should show me information pages about that brand and model and place where I can buy it.

I really don't get how you are thinking. Pages from a site with lots of good product pages with information about each product and a way to buy them are very valid search results.

dcheney

10:45 pm on Jun 7, 2002 (gmt 0)

Contractor,
I have to disagree. My site (non-commercial/ad free) currently has roughly 10k pages with another 15k coming on line in the next year. I do have "overhead" pages (well under 10%) which help someone navigate to the proper page if they come in the front gate. But otherwise, the vast majority is unique content about a specific person/organization. The vast majority of search engine hits are directly to these inside unique pages based on the proper names of the person/organization.
My site is static, built via a custom database.

The Contractor

12:56 am on Jun 8, 2002 (gmt 0)

Ok guys my thoughts:

There is absolutely no reason for google to cache every page of every site when that could be thousands of pages. You mention you have thousands of products. What if google crawled every page of amazon.com and buy.com - do you really think you would have a chance to compete?

If you cannot get google to know what your site is about in a couple thousand pages - the extra 90K pages are not going to help.

If I had a site of public records, do you think it is Googles duty to spider every record?
Sure it's all unique content but not really ;)

Jack_Straw

1:10 am on Jun 8, 2002 (gmt 0)

Duty? I don't know about duty.

But, public records are valuable information. If I am researching obscure stuff, I hope to find it there. I believe Google intends to be be comprehensive. Isn't that why they call it Google?

You can't really be arguing that no one source can have more than "a couple thousand" pages of useful information.

The Contractor

1:15 am on Jun 8, 2002 (gmt 0)

Jack,

I am not saying at all that the site does not have usefull information. What I am saying if Google cached every page of the public records that are on file for local governments, how could a site that has the same compete? Do a search for "Windows Information" on Google. Now what if Google cached every page of Microsoft. Who do you think would have the first 1000 pages of SERPS?

Would you really like to compete with this on a grand scale? Just because somebody dumps a database into a website does not mean that Google should crawl every page.

Doofus

2:05 am on Jun 8, 2002 (gmt 0)

I'm afraid Google disagrees with you on this one, Contractor. I'm not worried at all that Google will chop me off someday simply because my site is too big. On the contrary, I'm surprised I don't get special bonus points from Google. That's why I implied that PageRank is ultimately cheating Google, in that it does certain things poorly that could be done better.

Each of my records is about a specific person, group, or corporation. Many, many searchers put the name of a specific person, group, or corporation into the Google search box. Google might even spit out a white pages listing when it responds, if it detects that it's a candidate for a white pages scan. If you put in a ten-digit U.S. telephone number, it will do a reverse lookup in the U.S. white pages. They aren't using the best collection of white pages data, but it's still useful.

Have you ever heard the term, "I googled him and found out..."? I think it was in a New York Times article about two people on a first date, and it turned out that each had already "googled" the other. It was a lightweight piece, and Google loved the publicity.

Name searches are powerful on Google if you know how to search, particularly when the name is not so common. I think this is one of the more significant contributions search engines have made to our society. For investigative journalists, Google is the first port-of-call. Over 95 percent of my Google referrals are zeroing in on a specific name, so I try to optimize that page (one page per name) for the name itself.

Google has even bragged about their ability in this respect. I believe it was Sergey who said that the first thing any employer might want to do when looking at a promising resume, is to "google" the person.

Google becomes a verb, as happened many years ago with Xerox, as in "I'll xerox a copy for you." This is like living in heaven for any public relations department in an aggressive company.

fathom

2:13 am on Jun 8, 2002 (gmt 0)

Thanks Doofus

I never put to two references together for now. Google/Xerox!

It is excellent branding eh!

The Contractor

2:23 am on Jun 8, 2002 (gmt 0)

Doofus - you are missing the whole point from where this thread started :)

Jack_Straw

6:47 am on Jun 8, 2002 (gmt 0)

Contractor,

It seems to me that Doofus has the point correct. His point, if I may re-state it, is that Google's method of determining what pages to crawl and index is counter to their goal of indexing all quality relevant content because it sometimes causes large sites with good content to be under-indexed. To me, this seems to be a good and interesting point.

You countered with assertions that large sites should not be indexed simply because they have lots of pages. And your posts carry a strong suggestion that you think that any site having a large number of pages must be spam. That, it seems to me, misses the point.

I think the suggestion that any site with many pages must be spam is a very wrong and ill conceived.

Google and the search public both benefit from an algorithm that indexes all quality and relevant content, irrespective of it comes from a large or small site.

vitaplease

7:51 am on Jun 8, 2002 (gmt 0)

Hopefully this discussion will be irrelevant in a year or two, when computer and memory prices have dropped further and Google will have more than enough capacity to spider and index anything.

For the moment, this does not seem to be the case.

If you were Google, how would you choose what to spider frequently and index deeply and what not? Does choosing for higher pagerank plus new unindexed sites not make sense?

This 41 message thread spans 2 pages: 41