Welcome to WebmasterWorld Guest from 54.81.69.220

Forum Moderators: mademetop

Message Too Old, No Replies

Sympathy for the Search Engines

     
8:49 pm on Aug 16, 2000 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


I've recently been pondering what the near and mid-range future may hold for web-wide search utilities. This was touched off when someone here posted a link to a study that indicates the total size of the web is currently nearing one trillion pages.

Now that's an impossible number to grasp. Even Google's one billion is way too big. A goal of "all the web, all the time" is a total pipe dream and will probably always be, no matter what the public misconceptions are -- or how much hype and spin gets tossed out to nourish those misconceptions.

I was speaking to the IT manager at a large e-commerce business that maintiains nine well-known, interlinked sites. I learned that until a recent upgrade, building a full index for their database took five days. And one glitch meant starting over!

That's only maintaining a search engine for one business -- a miniscule slice of the total web. They've got (somewhat) predictable keywords. Tables with well chosen fields and index keys. It took five days. This gives me serious pause when I think about what a web search engine is trying to deal with.

Plus, a site search engine has a major advantage -- no one is actively trying to fool their indexing. I just don't see how a general web search can be an effective model for very much longer -- it must already take weeks to do a build for a major engine! Petabytes of data -- even "googol-bytes", or whatever the next level up is called.

I'm amazed that any SE can ever list any new pages in a week or two. Even more amazed when Alta Vista can list something with a day or two -- they certainly seem to have one of the best infrastructures going, no matter what happens in their algorithm struggles.

I have no brainstorm on better ways to help people find their way around the web. But I'm sure that those ways are needed, and that this represents a business opportunity for those who can build a better system.

The current ingredients in "themes" indexing -- term vectors, clustering, hubs, authorities, etc -- these were developed several years ago by the academics in Information Science. And these methods are just now coming into application, years later. I have some connections to academia, but from what I hear, there is no quantum advance on the drawing boards. And that's what is needed -- a total paradigm shift.

So I have much sympathy. Trying to offer a good search tool for the HUGE data pile that makes up today's web -- AND make profit at the same time! It boggles my mind. In the long term, photonics may replace electronics and speed everything along. But that certainly won't be soon enough for the present challenges.

My best guess for the mid range future is that general, one-size-fits-all search engines will recede from usefulness as the web continues to mushroom. They will be replaced by partial, targeted databases -- regional, topical, etc. There will be a few big companies where hundreds of separate databases are maintained and you drill down through a "directory" to the one you want.

This brings me to a wish-list item I've been nursing -- I'd love to find a search engine that ONLY deals with frequently refreshed sites. Or maybe a regular SE with an option to exclude old results.

9:18 pm on Aug 16, 2000 (gmt 0)

Senior Member

WebmasterWorld Senior Member rcjordan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Apr 22, 2000
posts:9138
votes: 0


>I'd love to find a search engine that ONLY deals with frequently refreshed sites. Or maybe a regular SE with an option to exclude old results.

Sorry Tedster, I'm already ahead of you on that, can't shake me that easily. Had a script written in C++ that adds a character to every html file, or if the character is already there, deletes it. Runs on the cron if necessary. Been doing it for almost two years now.

11:26 pm on Aug 16, 2000 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


I'll bet that someone COULD create a search portal exclusively for frequently changed sites. It would need a trial period where newly admitted sites proved their bona fides, and then some ongoing human intervention for sites flagged by an algo as "not significantly changed", however that was defined.

If the spidering could keep up with the member sites' fresh content, it could become THE place to go for the very latest info and breaking news -- ever more so as the general SEs freshness bogs down in the information flood.

I'm just not sure it's a viable business idea. Heck, I'm not convinced than ANY search engine or directory is a viable business idea.

GWJ

12:10 pm on Aug 17, 2000 (gmt 0)

Full Member

joined:June 21, 2000
posts:339
votes: 0


>Had a script written in C++ that adds a character to every html file, or if the character is already there, deletes it.

Hey Rc jordan,

I have been doing this manualy (gear head not a programmer :(). You wouldn't be interested in sharing this, would you?

Brian

2:11 pm on Aug 17, 2000 (gmt 0)

Senior Member

WebmasterWorld Senior Member rcjordan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Apr 22, 2000
posts:9138
votes: 0


>I'm just not sure it's a viable business idea. Heck, I'm not convinced than ANY search engine or directory is a viable business idea

They aren't ..but you are already going down that road in another thread.

I posted that script bit to show that practically any move an SE makes has a counter-move. Some, like this one, are deployed before some of the SE's even put it in the algo.

>you wouldn't be interested in sharing
GWJ, I would, but I don't have it "readily available" -that's not I dodge, you'd have to understand how I work, I have scripts written, installed, and maintained by the staff supporting my dedicated server. Many -like this one- I never even bother to get a copy of the source because I tend to migrate through scripts at a fairly fast pace. The one mentioned here is already obsolete because I'm moving to writing static pages from a "build" script --every time I run the build, the file date changes. Oh, Tedster, this runs on templates so that I can be sure the "recently updated" page is "significantly changed, however that was defined."

9:13 pm on Sept 2, 2000 (gmt 0)

Administrator from US 

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 21, 1999
posts:38250
votes: 111


I have to agree entirely with the original thoughts Tedster put forth. I was asked earlier this year to build a search engine for 800 content rich sub-domains (400k pages/3gig). I got it done using a mix of SQL and Perl. However, we maxed the machine out to the point (numerous 1 gig databases) that it was too slow. Tried a faster top-of-line (700p3/1gig ram/scsi) server and response time was still around 8seconds per search (75% to slow). Add in the horror of maintenance on a daily basis, constant spidering and baby sitting - we scraped the whole thing as it become clear it would be a full time job for two people to maintain. We just couldn't see anything short of hiring a full time db programmer as a solution.
9:52 pm on Sept 2, 2000 (gmt 0)

Senior Member

WebmasterWorld Senior Member rcjordan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Apr 22, 2000
posts:9138
votes: 0


I don't think this is what Tedster was wishing, but it is the answer. quotes reordered
in the short run, the answer for Yahoo the highest-volume directory and therefore the one under the most strain lays with the oldest of economic tools: the price system.....

...in fact the crucial breaking point came with search engines, which is what this summer's news indicates. And in retrospect, we should have seen it coming.

--article here [thestandard.com]

>we should have seen it coming.
But we DID see it coming... waaAaay before this July 2, 2000 thread, the ethics of a search return page [webmasterworld.com]. Everyone just keeps hoping that it's going to go away, but guess what...