Forum Moderators: open
Now that's an impossible number to grasp. Even Google's one billion is way too big. A goal of "all the web, all the time" is a total pipe dream and will probably always be, no matter what the public misconceptions are -- or how much hype and spin gets tossed out to nourish those misconceptions.
I was speaking to the IT manager at a large e-commerce business that maintiains nine well-known, interlinked sites. I learned that until a recent upgrade, building a full index for their database took five days. And one glitch meant starting over!
That's only maintaining a search engine for one business -- a miniscule slice of the total web. They've got (somewhat) predictable keywords. Tables with well chosen fields and index keys. It took five days. This gives me serious pause when I think about what a web search engine is trying to deal with.
Plus, a site search engine has a major advantage -- no one is actively trying to fool their indexing. I just don't see how a general web search can be an effective model for very much longer -- it must already take weeks to do a build for a major engine! Petabytes of data -- even "googol-bytes", or whatever the next level up is called.
I'm amazed that any SE can ever list any new pages in a week or two. Even more amazed when Alta Vista can list something with a day or two -- they certainly seem to have one of the best infrastructures going, no matter what happens in their algorithm struggles.
I have no brainstorm on better ways to help people find their way around the web. But I'm sure that those ways are needed, and that this represents a business opportunity for those who can build a better system.
The current ingredients in "themes" indexing -- term vectors, clustering, hubs, authorities, etc -- these were developed several years ago by the academics in Information Science. And these methods are just now coming into application, years later. I have some connections to academia, but from what I hear, there is no quantum advance on the drawing boards. And that's what is needed -- a total paradigm shift.
So I have much sympathy. Trying to offer a good search tool for the HUGE data pile that makes up today's web -- AND make profit at the same time! It boggles my mind. In the long term, photonics may replace electronics and speed everything along. But that certainly won't be soon enough for the present challenges.
My best guess for the mid range future is that general, one-size-fits-all search engines will recede from usefulness as the web continues to mushroom. They will be replaced by partial, targeted databases -- regional, topical, etc. There will be a few big companies where hundreds of separate databases are maintained and you drill down through a "directory" to the one you want.
This brings me to a wish-list item I've been nursing -- I'd love to find a search engine that ONLY deals with frequently refreshed sites. Or maybe a regular SE with an option to exclude old results.
Sorry Tedster, I'm already ahead of you on that, can't shake me that easily. Had a script written in C++ that adds a character to every html file, or if the character is already there, deletes it. Runs on the cron if necessary. Been doing it for almost two years now.
If the spidering could keep up with the member sites' fresh content, it could become THE place to go for the very latest info and breaking news -- ever more so as the general SEs freshness bogs down in the information flood.
I'm just not sure it's a viable business idea. Heck, I'm not convinced than ANY search engine or directory is a viable business idea.
Hey Rc jordan,
I have been doing this manualy (gear head not a programmer :(). You wouldn't be interested in sharing this, would you?
Brian
They aren't ..but you are already going down that road in another thread.
I posted that script bit to show that practically any move an SE makes has a counter-move. Some, like this one, are deployed before some of the SE's even put it in the algo.
>you wouldn't be interested in sharing
GWJ, I would, but I don't have it "readily available" -that's not I dodge, you'd have to understand how I work, I have scripts written, installed, and maintained by the staff supporting my dedicated server. Many -like this one- I never even bother to get a copy of the source because I tend to migrate through scripts at a fairly fast pace. The one mentioned here is already obsolete because I'm moving to writing static pages from a "build" script --every time I run the build, the file date changes. Oh, Tedster, this runs on templates so that I can be sure the "recently updated" page is "significantly changed, however that was defined."
in the short run, the answer for Yahoo – the highest-volume directory and therefore the one under the most strain – lays with the oldest of economic tools: the price system.....--article here [thestandard.com]...in fact the crucial breaking point came with search engines, which is what this summer's news indicates. And in retrospect, we should have seen it coming.
>we should have seen it coming.
But we DID see it coming... waaAaay before this July 2, 2000 thread, the ethics of a search return page [webmasterworld.com]. Everyone just keeps hoping that it's going to go away, but guess what...