Forum Moderators: open
To tell you the truth, I've never been crawled like this before. My inbound links took a pretty good increase this last dance. But I guess I won't know if Google will show all these pages until next dance.
I've had 50,000 pages crawled before but never more than 45,000 pages listed.
1) It's extremely uneven and erratic. While I'm happy that googlebot picked up 95,000 pages this time, I have a bad feeling that half of them won't show up in the index. After one of the above posters mentioned the fact that typically, more pages are crawled than indexed, I checked my logs. Sure enough, while I've been seeing 40,000 to 50,000 pages in the index in past months, the crawls for March, April, and May fetched 52,000, 64,000, and 68,000 respectively. So even with the new high of 95,000 pages crawled, I'm not very optimistic.
2) Google came on June 1, and finished with me early on June 6. About 87,000 of the 95,000 pages it crawled were fetched in the last 14-hour period. It happened in spurts, which means that as many as 10 GETs per second were occurring at the height of each spurt. This would be bad news if I was still feeding dynamic pages to Google. It's crazy to do it this way.
3) Some 7,000 of the most important pages on the site did not get crawled this time, while in past months all pages in this important category were crawled. Despite the fact that this category of 18,000 pages is more important than the other category, it only got 61 percent coverage this time. The tremendous increase in the crawling depth occurred on pages in a less-important category, which saw about 97 percent coverage. I cannot figure out why Google did it this way, as the internal linking and directory structure is about the same for both categories.
4) By way of comparison, a French bot called celsius.noos.net (I think it's a professional bot; I can't find out much about them) did some fairly broad manual surfing of my site two days ago, decided they liked it, checked the robots.txt once, figured out where the key doorway pages were, and crawled all allowed 104,000 pages in 32 hours at an even rate. More power to them. Also, fastsearch.net goes deep but spreads it out over months, because it is so slow. But at least it's consistently slow, and just keeps plugging away, so that by now it's gone deeper than Google.
I'm getting tired of PageRank; wish I could afford to leave behind my Google referrals.
When you have a system where each page on the web influences the scores of other pages, and you have a limited amount of time to crawl them all, you have to give priority to the high ranking pages.
If you get to the end of the month and you haven't finished, which will have the least impact on the overall quality of the index? dropping a bunch of PR3's or a bunch of PR8's?
<<If you get to the end of the month and you haven't finished, which will have the least impact on the overall quality of the index? dropping a bunch of PR3's or a bunch of PR8's?>>
And to those that have 100K pages of useful content on a site, please sticky mail me as I have yet to find a site that is not all links that has that many pages :)
I am with you,
I am amazed with sites containing such high number of pages (100K), not called Amazon or Microsoft, and still thinking they are adding something useful to the WWW. But then again, I would hate this forum not to be fully indexed, maybe I am just jealous.
I would hope Google gives preference to crawling new sites, above a PR5 site adding 10K pages to a 100K page site.
It would be normal if Google - as WG implies - crawls higher sites foremost.
I would also hope they put in a crawling criterium on page level.
That is, even if your index page has a PR of 8, this does not mean that your 98.897 th page with a pagerank of 1, deserves a full crawl every month over a newly submitted unranked site.
We're not the only nonprofit that has useful data. There's an enormous amount of data on the Deep Web in the form of databases. Almost all of them keep spiders out. Perhaps they're a bit more protective of their data than we are, or it's a government or university site that doesn't care about increasing traffic, or they can't afford to let a spider go nuts on their bandwidth. And most spiders wouldn't do the data justice anyway. It has to be the sort of data that produces a record when you do keyword searches in a search box. That's not always the case with databases.
Google started this, not us. In October 2000, I noticed that Google had sneaked past my cgi-bin disallow while following a handful of external links. That was the first indication I had that this one spider, at least, was actually capable of crawling and indexing dynamic files. I thought, "What would happen if I lifted the cgi-bin disallow?
After nine months of high CPU loads during the Google crawl, even though Google wasn't getting as far as I'd like every month, I also began to enjoy the Google referrals from this new indexing. I thought, "What if other spiders could do this too?" So I dumped the records into static files. No other bot had ever done serious dynamic-file indexing apart from Google. In the 12 months since I decided to add static files, that has pretty much continued to be true.
I did it in stages. It was a good idea -- other bots were indeed interested in the static files, and there was no load problem at all when getting hit at a fast rate, compared to the load problem the dynamic files presented. Six months ago I converted the last group of records over to static files, and re-instituted a disallow on all cgi-bin.
Now here's the point I'm trying to make: Google may well be the world's biggest and best bot, but it's also true, in my opinion, that PageRank functions as a straitjacket that prevents them from going after the Deep Web. I have met them more than halfway, and remember, they started this. I've made some progress, and have also been hit with a partial penalty for a partial mirror site, but in the end the jury is still out on my Great Google Experiment.
Our database has been on the Internet since early 1995 (on telnet), and on the Web since early 1996. Google started sniffing around in 1999. If our site had not been well-established by the time Google was crawling, we'd probably be a PR4 today instead of a high PR6. How far do you think Google would get into our data if we had a PR4?
What if another site that has useful data decided to remove their password restrictions and their robots.txt disallow, and make it available to spiders all of a sudden? It could take a long time to build up their PR if they appeared suddenly on the Web with open access.
During this long period, their PR would be low and the depth to which Google crawls their data would be comparatively shallow. If this database site hasn't dumped to static files, they'd probably have load problems as well. I think they'd be discouraged, and I can't think why they'd want to start down that road at all, unless they already have a healthy PageRank and can expect significant traffic from the Google referrals.
A PageRank system doesn't make any sense for crawling an integrated database. A site that is database driven, or dumped its records to static files from a database, should be considered as a totality. Google is so enamored of doing everything automatically with algorithms, that they are inadvertently overlooking the down side. That goes for PR0 linking penalties, duplicate page penalties that can involve outright removal of these pages from the mother site, and it also goes for crawling sites such as ours.
I think PageRank is holding Google back. It's expensive for Google (many recursive calculations), and it creates the illusion of rationality, when in fact some of the consequences of basing your engine on PageRank look very much like a case of the tail wagging the dog.
Your site may have unique info. But I still believe that google should "not" crawl 100's of thousands of pages of a site. A couple thousand yes - 10's of thousands no. You can get the idea of what a company offers from it's database so there isn't any sense to crawl every page. I know that thought will get a lot of people upset at me but it is realistic. I could tie up the market for almost any type of business/sites by just dumping phonebook databases including company names into a database and then into a website and creating dynamic/static content from it. I would have millions of "pages" - should I expect to get every page crawled?