Page is a not externally linkable
- Google
-- Google News Archive
---- Google update May 20, 2001


Everyman - 1:08 am on May 22, 2001 (gmt 0)


All pages are in the form of:

www.XXXXX.org/cgi-bin/YYYY.cgi?AAAA_BBBB_CCCC
where AAAA_BBBB_CCCC is a proper name.

The XXXXX is always the same.

The YYYY alternates between two cgi programs, but I've locked Google out of one of these by returning a "Server too busy" because all the names are covered with the other cgi program.

The AAAA_BBBB_CCCC is always changing.

Each page returned from the above link has from several to several hundred additional links on it in the same form, but with new names in the links.
Each of these also links to a page with from several to several hundred in the same form. The page itself is usually less than 50K bytes.

And so on, and so on. That's deep.

It would be possible to run out of names after 115,000 pages if: 1) Google got that far, and if: 2) Google could detect on the fly whether it already got that name, and if: 3) Google stopped asking for that second cgi program that repeats the name and always comes back "Server too busy" because I've locked them out of this search that returns a Java applet.

As it stands now, Google would actually have to get 230,000 pages to run out of names, assuming it can detect skip duplicates on the fly. Half of these would be "Server too busy."

With six crawlers working at once, I don't think Google can detect duplicates on the fly, because I don't think the crawlers are talking to each other much, if at all. So I suspect that it's getting the same name several times, and these get purged later into just one page for each name. Very inefficient.

Usually I end up with from 20,000 to 40,000 useful pages in the index before it quits. By "useful," I mean a page that isn't merely "Server too busy." (Actually these "too busy" pages aren't entirely useless, because the name is in the link and folks hit on it. It's just Google's cache copy that's useless.)

In all, Google often ends up with lots of obscure names when they ought to be going after the least obscure names.

That's why I'd just like to send them a CD-ROM once a year, with the best data, all laid out and Linux-ready per their specs. No response from Larry Page on this, and it's been six months.


Thread source:: http://www.webmasterworld.com/google_archive/858.htm
Brought to you by WebmasterWorld: http://www.webmasterworld.com