Forum Moderators: open
The bot's been there for two days working on it a lot slower than usual, but still plugging along. I put up some fences to sort of corral the bot in the direction I wanted her to go (so that pages with popular search terms got indexed first) and that seems to be working.
Not sure how the crawl will continue. First month (when I was given PR5) about 40K pages were indexed. Last month (bringing me to a PR4) only 28K were indexed. I'm up to about 600 pages so far this month (and at a much slower rate than usual), but she's still plugging away. Will let ya know.
G.
All spidering began on June 1; I'm using New York City time.
Each of the five domains is a PR 6. Some are not mine, but I have log access.
The method for determining the total number of pages from each domain that made it into the index last time is this:
site:www.mydomain.com "www.mydomain.com"
Domain? / Pages crawled / Last update / Status
Domain1 4,163 4,230 Still busy
Domain2 8,158 40,200 Still busy
Domain3 9,644 19,100 Still busy
Domain4 39 35 Finished at 2002-06-02 00:30
Domain5 3 3 Finished at 2002-06-02 02:03
On domain2 and domain3, about half of the pages crawled thus far have been crawled in the last 14 hours alone. Starts slowly, speeds up gradually, goes crazy at the end -- classic Google.
Domain1 4,319 crawled; occasional sniffing but probably finished
Domain2 31,510 crawled and still busy
Domain3 11,482 crawled and still busy
Been watching the Domain2 crawling behavior with tail -f access_log ¦ grep "googlebot"
It spasms about every three minutes. During a spasm, it can fetch my little static HTML files at a rate of between 2 and 10 per second (yes, that's seconds!). Then it catches its breath for a few minutes, and comes back to do the same thing.
For me, that forever answers the question of whether it was a good idea to dump nearly the entire database into static files. The CPU load difference was between 10 and 100 times greater for this level of spidering when the files were all dynamic. Now it hardly makes a dent in the load. (This is just CPU load; bandwidth is not a problem for us.)
Lighter load + no program execution = faster delivery. Google doesn't have to wait, and may get deeper (assuming that there's no predetermined cutoff on the crawl for a given PageRank).
It also looks to me like the crawling is in directory-depth order, approximately. This is a strong argument for designing a shallow site, particularly if you're struggling with a situation where you're trying to get more pages crawled. I believe that Google comes into your homepage early or late in the crawling cycle, based on your homepage PR. Then it "guesses" at all other links it finds on your site and assigns them a home page minus one PR based on directory depth, and queues them in the crawl accordingly. Therefore, a shallow site helps provide a better PR on this guess. If you need Google to go deep, this might be helpful.
Once it gets the internal page, and goes through the monthly calculation, then the "guessed" PR is replaced with a real PR.
A PR3 started yesterday, but Googlebot is pulling the pages at a fair rate to try and catch up :)
So I would say that PR does influence the order. perhaps there is simply a threshold - PR 4 and above get spidered first. That would also match up with the links that get shown in the "backward links" pages.
Does Google have a set differentiation between sites of PR4+ and everyone else? Does this affect anything else within Google?
My big site, Domain2, got 95,000 picked up by the end. This is twice the usual, and only slightly under the maximum possible of 103,000 pages.
I'm very happy. Now I have to figure out whether everyone is getting a deeper crawl this month, or whether it is due to improvements in my internal-page cross-linking on that site.
And, of course, I'm very interested to see if the inside pages end up with a passable PR once the next update kicks in. If the internal PR wasn't damaged, I should be in very good shape.
I've been trying to crack that barrier of 40,000 to 50,000 crawled for 18 months now!
To tell you the truth, I've never been crawled like this before. My inbound links took a pretty good increase this last dance. But I guess I won't know if Google will show all these pages until next dance.
I've had 50,000 pages crawled before but never more than 45,000 pages listed.
1) It's extremely uneven and erratic. While I'm happy that googlebot picked up 95,000 pages this time, I have a bad feeling that half of them won't show up in the index. After one of the above posters mentioned the fact that typically, more pages are crawled than indexed, I checked my logs. Sure enough, while I've been seeing 40,000 to 50,000 pages in the index in past months, the crawls for March, April, and May fetched 52,000, 64,000, and 68,000 respectively. So even with the new high of 95,000 pages crawled, I'm not very optimistic.
2) Google came on June 1, and finished with me early on June 6. About 87,000 of the 95,000 pages it crawled were fetched in the last 14-hour period. It happened in spurts, which means that as many as 10 GETs per second were occurring at the height of each spurt. This would be bad news if I was still feeding dynamic pages to Google. It's crazy to do it this way.
3) Some 7,000 of the most important pages on the site did not get crawled this time, while in past months all pages in this important category were crawled. Despite the fact that this category of 18,000 pages is more important than the other category, it only got 61 percent coverage this time. The tremendous increase in the crawling depth occurred on pages in a less-important category, which saw about 97 percent coverage. I cannot figure out why Google did it this way, as the internal linking and directory structure is about the same for both categories.
4) By way of comparison, a French bot called celsius.noos.net (I think it's a professional bot; I can't find out much about them) did some fairly broad manual surfing of my site two days ago, decided they liked it, checked the robots.txt once, figured out where the key doorway pages were, and crawled all allowed 104,000 pages in 32 hours at an even rate. More power to them. Also, fastsearch.net goes deep but spreads it out over months, because it is so slow. But at least it's consistently slow, and just keeps plugging away, so that by now it's gone deeper than Google.
I'm getting tired of PageRank; wish I could afford to leave behind my Google referrals.
When you have a system where each page on the web influences the scores of other pages, and you have a limited amount of time to crawl them all, you have to give priority to the high ranking pages.
If you get to the end of the month and you haven't finished, which will have the least impact on the overall quality of the index? dropping a bunch of PR3's or a bunch of PR8's?
<<If you get to the end of the month and you haven't finished, which will have the least impact on the overall quality of the index? dropping a bunch of PR3's or a bunch of PR8's?>>
And to those that have 100K pages of useful content on a site, please sticky mail me as I have yet to find a site that is not all links that has that many pages :)
I am with you,
I am amazed with sites containing such high number of pages (100K), not called Amazon or Microsoft, and still thinking they are adding something useful to the WWW. But then again, I would hate this forum not to be fully indexed, maybe I am just jealous.
I would hope Google gives preference to crawling new sites, above a PR5 site adding 10K pages to a 100K page site.
It would be normal if Google - as WG implies - crawls higher sites foremost.
I would also hope they put in a crawling criterium on page level.
That is, even if your index page has a PR of 8, this does not mean that your 98.897 th page with a pagerank of 1, deserves a full crawl every month over a newly submitted unranked site.
We're not the only nonprofit that has useful data. There's an enormous amount of data on the Deep Web in the form of databases. Almost all of them keep spiders out. Perhaps they're a bit more protective of their data than we are, or it's a government or university site that doesn't care about increasing traffic, or they can't afford to let a spider go nuts on their bandwidth. And most spiders wouldn't do the data justice anyway. It has to be the sort of data that produces a record when you do keyword searches in a search box. That's not always the case with databases.
Google started this, not us. In October 2000, I noticed that Google had sneaked past my cgi-bin disallow while following a handful of external links. That was the first indication I had that this one spider, at least, was actually capable of crawling and indexing dynamic files. I thought, "What would happen if I lifted the cgi-bin disallow?
After nine months of high CPU loads during the Google crawl, even though Google wasn't getting as far as I'd like every month, I also began to enjoy the Google referrals from this new indexing. I thought, "What if other spiders could do this too?" So I dumped the records into static files. No other bot had ever done serious dynamic-file indexing apart from Google. In the 12 months since I decided to add static files, that has pretty much continued to be true.
I did it in stages. It was a good idea -- other bots were indeed interested in the static files, and there was no load problem at all when getting hit at a fast rate, compared to the load problem the dynamic files presented. Six months ago I converted the last group of records over to static files, and re-instituted a disallow on all cgi-bin.
Now here's the point I'm trying to make: Google may well be the world's biggest and best bot, but it's also true, in my opinion, that PageRank functions as a straitjacket that prevents them from going after the Deep Web. I have met them more than halfway, and remember, they started this. I've made some progress, and have also been hit with a partial penalty for a partial mirror site, but in the end the jury is still out on my Great Google Experiment.
Our database has been on the Internet since early 1995 (on telnet), and on the Web since early 1996. Google started sniffing around in 1999. If our site had not been well-established by the time Google was crawling, we'd probably be a PR4 today instead of a high PR6. How far do you think Google would get into our data if we had a PR4?
What if another site that has useful data decided to remove their password restrictions and their robots.txt disallow, and make it available to spiders all of a sudden? It could take a long time to build up their PR if they appeared suddenly on the Web with open access.
During this long period, their PR would be low and the depth to which Google crawls their data would be comparatively shallow. If this database site hasn't dumped to static files, they'd probably have load problems as well. I think they'd be discouraged, and I can't think why they'd want to start down that road at all, unless they already have a healthy PageRank and can expect significant traffic from the Google referrals.
A PageRank system doesn't make any sense for crawling an integrated database. A site that is database driven, or dumped its records to static files from a database, should be considered as a totality. Google is so enamored of doing everything automatically with algorithms, that they are inadvertently overlooking the down side. That goes for PR0 linking penalties, duplicate page penalties that can involve outright removal of these pages from the mother site, and it also goes for crawling sites such as ours.
I think PageRank is holding Google back. It's expensive for Google (many recursive calculations), and it creates the illusion of rationality, when in fact some of the consequences of basing your engine on PageRank look very much like a case of the tail wagging the dog.
Your site may have unique info. But I still believe that google should "not" crawl 100's of thousands of pages of a site. A couple thousand yes - 10's of thousands no. You can get the idea of what a company offers from it's database so there isn't any sense to crawl every page. I know that thought will get a lot of people upset at me but it is realistic. I could tie up the market for almost any type of business/sites by just dumping phonebook databases including company names into a database and then into a website and creating dynamic/static content from it. I would have millions of "pages" - should I expect to get every page crawled?
I have been recently shopping for an MP3 player. I would like to see more pages listed (not less) when I search for specific items. If I put in a particular brand and model, the search engine should show me information pages about that brand and model and place where I can buy it.
I really don't get how you are thinking. Pages from a site with lots of good product pages with information about each product and a way to buy them are very valid search results.
There is absolutely no reason for google to cache every page of every site when that could be thousands of pages. You mention you have thousands of products. What if google crawled every page of amazon.com and buy.com - do you really think you would have a chance to compete?
If you cannot get google to know what your site is about in a couple thousand pages - the extra 90K pages are not going to help.
If I had a site of public records, do you think it is Googles duty to spider every record?
Sure it's all unique content but not really ;)
But, public records are valuable information. If I am researching obscure stuff, I hope to find it there. I believe Google intends to be be comprehensive. Isn't that why they call it Google?
You can't really be arguing that no one source can have more than "a couple thousand" pages of useful information.
I am not saying at all that the site does not have usefull information. What I am saying if Google cached every page of the public records that are on file for local governments, how could a site that has the same compete? Do a search for "Windows Information" on Google. Now what if Google cached every page of Microsoft. Who do you think would have the first 1000 pages of SERPS?
Would you really like to compete with this on a grand scale? Just because somebody dumps a database into a website does not mean that Google should crawl every page.
Each of my records is about a specific person, group, or corporation. Many, many searchers put the name of a specific person, group, or corporation into the Google search box. Google might even spit out a white pages listing when it responds, if it detects that it's a candidate for a white pages scan. If you put in a ten-digit U.S. telephone number, it will do a reverse lookup in the U.S. white pages. They aren't using the best collection of white pages data, but it's still useful.
Have you ever heard the term, "I googled him and found out..."? I think it was in a New York Times article about two people on a first date, and it turned out that each had already "googled" the other. It was a lightweight piece, and Google loved the publicity.
Name searches are powerful on Google if you know how to search, particularly when the name is not so common. I think this is one of the more significant contributions search engines have made to our society. For investigative journalists, Google is the first port-of-call. Over 95 percent of my Google referrals are zeroing in on a specific name, so I try to optimize that page (one page per name) for the name itself.
Google has even bragged about their ability in this respect. I believe it was Sergey who said that the first thing any employer might want to do when looking at a promising resume, is to "google" the person.
Google becomes a verb, as happened many years ago with Xerox, as in "I'll xerox a copy for you." This is like living in heaven for any public relations department in an aggressive company.
It seems to me that Doofus has the point correct. His point, if I may re-state it, is that Google's method of determining what pages to crawl and index is counter to their goal of indexing all quality relevant content because it sometimes causes large sites with good content to be under-indexed. To me, this seems to be a good and interesting point.
You countered with assertions that large sites should not be indexed simply because they have lots of pages. And your posts carry a strong suggestion that you think that any site having a large number of pages must be spam. That, it seems to me, misses the point.
I think the suggestion that any site with many pages must be spam is a very wrong and ill conceived.
Google and the search public both benefit from an algorithm that indexes all quality and relevant content, irrespective of it comes from a large or small site.
For the moment, this does not seem to be the case.
If you were Google, how would you choose what to spider frequently and index deeply and what not? Does choosing for higher pagerank plus new unindexed sites not make sense?