Very, very interesting article. Nicely done and very understandable explanations by Matt Wells.
|I do not think Google's results are the best anymore, but the other engines really are not offering enough for searchers to switch. |
Fascinating article altogether.
<Added> I'll be happy for him when a big buyout or venture capital comes his way. If anyone deserves it...
[edited by: Freedom at 6:10 am (utc) on April 17, 2004]
Indeed a good article. Thanks for the link here.
Some excellent information in there.
Especially knowledgable for people like be :)
Great article... fascinating at many levels, from nuts and bolts to visionary. It gives me added respect for Matt Wells, who was already kind of a hero.
I thought that this was a particularly intriguing observation...
|I would suppose that the amount of information stored on the Internet is around the level of the adult human brain. Now we just need some higher-order functionality to really take advantage of it. At one point we may even discover the protocol used in the brain and extend it with an interface to an Internet search engine. |
This brings to mind a related observation by Albert-Laszlo Barabasi [webmasterworld.com] that (paraphrasing) "the human body is an example of a dynamic network that has evolved over millions of years. The internet is less than 30 years old."
I thought the interview was rather amateurish, myself (Come on, "tell a little about your background"? That's the textbook example of a lame question.), but I saw one interesting thing: Matt's apparently changed his mind about XML. He used to think it was "bloated and ugly", but now he thinks it will replace SQL.
Anyway, if anyone cares, Slashdot found the interview, and many posters there think the interview is all hype.
The interview was amateurish because you have to look at who was doing the interview. He has never been known for his professional journalism skills. Rather, something else.
Great article - I hope Matt can get some deal arranged with someone and give google a real competitor.
Interesting opinions on the Death of Page Rank - the problem of not using PR is that we go back to the the keyword laiden domain names scenario... not pretty.
Its actually quite interesting to compare Gigablast with other 'up and coming' engines like Nutch.
|Our current goal is to create a good-sized public demo that can handle moderate traffic. Even this takes a fair amount of hardware and bandwidth. Fortunately, the Internet Archive has donated bandwidth, so all that we need now is hardware. We estimate that a two-hundred-million page demo system that can handle moderate traffic will require less than $200,000 in hardware. |
|Gigablast is a search engine that I've been working on for about the last three years. I wrote it entirely from scratch in C++. The only external tool or library I use is the zlib compression library. It runs on eight desktop machines, each with four 160-GB IDE hard drives, two gigs of RAM, and one 2.6-GHz Intel processor. It can hold up to 320 million Web pages (on 5 TB), handle about 40 queries per second and spider about eight million pages per day. Currently it serves half a million queries per day to various clients, including some meta search engines and some pay-per-click engines. |
and according to www.gigablast.com, Gigablast has
|273,661,136 pages indexed |
I know which one will have the lower overhead cost structure - Gigablast is already over 200 million pages running on 8 PCs!
I'm not convinced this is a fair comparison.
The use of the word "moderate" on Nutch's site is ambiguous. I meant several million searches per day (~50/second peak). One can also use Nutch to build a 200M page web search engine for less than $10,000, but it probably wouldn't be able to handle more than a few queries per second. I'll clarify that on the website.
Collecting and searching 200M pages is not very expensive. What's expensive is handling lots of traffic.
I have no idea how much traffic Gigablast sees, but I'd be surprised if they're handling millions of searches per day over 200M pages on just $8,000 of hardware.
Hello cutting, welcome to WebmasterWorld.
|I have no idea how much traffic Gigablast sees, but I'd be surprised if they're handling millions of searches per day over 200M pages on just $8,000 of hardware. |
Well, most of the answers are in the message just before your posting
Handling 40 queries per second for a whole day would result in handling 3.5 million queries a day. I'm not sure if Gigablast can really handle 40 queries per second for 24 hours, but even at 60% of that peak load it still would be able to handle more than 2 million queries per day. BTW, Gigablast is now searching 321 million pages which is about the maximum on the the 5TB Matt Wells mentioned in the interview.
- Currently Gigablast serves half a million queries per day
- Gigablast can handle about 40 queries per second
500k queries a day is roughly 6 queries per second.. not too shabby.
6 queries per second and 250 million pages is comparible to nutch/lucene on 10 servers if you tweak out the configuration well enough.. You have to figure that even a good percentage of the queries can be cached by os based on trends as not every query is unique.
However 40 queries per second seems a little high.. i can squeeze that through nutch running a benchmark program based on dictionary terms (As they would be cached "warmed up" over time)
When you do 40queries per second, is that single term or with joins, an "and" or "or" type query?