Forum Moderators: bakedjake
There have been more discussions here, with some people being very sceptical.
I have to say it's a pretty risky path for LS to go. In theory this is a revolutionary approach at indexing the web.
In a way I just wish it would have been a organisation like dmoz or some other non commercial entity coming up with that idea.
Essentially there are two main factors, which decide if it's going to be a flop or a hit:
- the voluntaristic nature in combo with commercial intent
- the problem of producing a high quality frontend (i.e. ranking algo) for the masses of data collected
What is the real problem in running a search engine? It's the indexing, especially the part where the link structure of the web (or more precise: the subgraph of the web, that the engine has visited) comes into account. It isn't that easy to parallelize this task, and it's even more difficult to do this for a set of distributed computers.
IMHO, the Grub concept has parallelized the part of a search engine that doesn't need to be parallelized.
Most worrying, he said, would be the ability to hack the system in order to promote certain sites.
From the article - Danny Sullivan has a pretty good grasp of it (perhaps he's been following along here too? :) )
And, Andre Stechert is a member here. :) Nice to see fellow WebmasterWorld members get some good publicity.
How hard can it be to program a spider to read and comprehend this trivially straightforward file?
Cheers,
Andre/the Grub team
If you dig through this forum, you'll see that Kord Campbell, the guy behind the project, even posted here regarding this issue some time ago.
Seems that failure to listen to feedback continues...though to be honest, it seem that they are trying -> listening is the first step.
Action, then, should be the next. After all, we wouldn't want to see LookSmart getting bad press over all this - of a kind with their move to PPC that definitely caused a 'rift' with them & the webmaster community as a whole.
This, then, could be 'their chance' to forge a relationship that might be to the betterment of everybody.
I also noticed that between noon Sunday and noon Monday, grub clients retrieved my index.html file exactly 2,300 times. Yow! I'll cut them some slack on this matter because they're probably experiencing some growing pains right now, but that's trying to keep things a little TOO fresh, if you ask me.
[Andre: The grub client has now been 403'ed from the site in my profile for the time being, but if you want to make a manual check, your comments on either of these issues would be welcome.]
As I've said before, we're 100% behind sorting out any of these problems. I've taken a look at your user profile so I now know what site you're at and I'm checking into the problem. Will reply when your site has been updated.
Cheers,
Andre
p.s. Please forgive if it takes until tomorrow...new servers came online, but the new clients ate all the additional capacity...we had a bunch more boxes on order that were supposed to show up today, but!#$@ you know how that goes.
Google probably has the most efficient advertising and PR of any of the search engines. Its ranking algorithm gave a relevance to the results that was lacking in most of its competition. The massive resources that it threw at the problem happened to be a more efficient way of tackling the problem because of the lower startup costs for n thousand servers running an Open Source operating system. Altavista on the other hand, used a closed source operating system and concentrated the servers. The difference in internet generations was critical - Altavista was advanced for its time but it was also a product of that time in that it used an approach that worked and it used in-house technology. The one killer for any search engine is in monetizing the results - making the search engine pay. Altavista's approach was to market search appliances in much the same way as Google and to try to sell advertising. Google's approach has been integrated at a far lower level so that the advertising on the results page is related (often closely) to the actual results.
Purely from a background in codebreaking, flinging loads of resources at a problem and hoping for a breakthrough is not the best approach. You have got to use your resources efficiently and choose the right point or approach. The distributed grub crawler is an interesting idea but it would generate a lot of data. Hoping that analysis and a good engine will give that data relevance for the SE user is something else indeed.
Regards...jmcc
Not sure I agree with your analysis of Google's success, though. IMO it had little to do with advertising or public relations, and less to do with Open Source vs. proprietary technology. Google took over the world because
(1) they offer fast, bloat-free search results, and
(2) they have better search results than everybody else, probably due in large part to factoring in pagerank and link anchor text.
Not sure I agree with your analysis of Google's success, though. IMO it had little to do with advertising or public relations, and less to do with Open
Source vs. proprietary technology. Google took over the world because
(1) they offer fast, bloat-free search results, and
(2) they have better search results than everybody else, probably due in large part to factoring in pagerank and link anchor text.
I don't think that the our reasons for Google's success are too far apart Jomaxx.
The Open Source vs closed source operating system and the Page Ranking of the results facilitated 1 and 2 above. The fifty thousand or so servers that Google runs all use Redhat Linux apparently. The cost of running these on a close source OS would be, to say the least, expensive. The cost of technology had also fallen dramatically in the period between the start of Altavista and the start of Google. However advertising and public relations are very important aspects that are often overlooked. Running a search engine is running a business. The business has to make money. Under the nice happy-clappy exterior of Google lurk hardnosed business people. Most of the SEs that failed in the last few years failed because they did not monetize the results in the same effective fashion as Google. Google's fast, bloat-free results and the quality of results gave it that critical edge.
The one thing that bothers me about the Looksmart/Grub and indeed all crawler approaches is that they all seem to follow the Christopher Columbus school of discovery. They don't know where they are going. They don't know where they are when they get there. And they don't know where they've been when they get back. Microsoft, seemed to cop on to a very important aspect of the web - a lot of it does not change on a day to day (or even month to month) basis. If the Looksmart/Grub crawler managed to identify this fresh web core, then it would have a unique selling point.
Regards...jmcc
To distribute the human powered task of describing the pages on the web a completely new paradigm is required which enables the web page owners to organize and describe the web themselves. A few hundred hired editors will not suffice for this immense task, something which management at Looksmart and other major directories probably has long realized..
Grub is certainly an interesting project but it will not solve the hard part of indexing the web. It is not primarily the data collection that needs to be distributed but rather the much more cumbersome task of describing the collected web pages in terms of their subject and geographical location.
This is the problem I have been working on for some time now. The scary thing is that some countries can have between 10 and 70% of their associated websites hosted outside their IP space. Thus for com/net/org/info, there is a significant amount of each country's websites excluded from the 'pages from $country' searches. The quick and nasty solution used by most of the bigger search engines has been to limit the search indices to the relevant country cctld (eg: .ie, .uk, de) and exclude the com/net/org domains. Recently, Google and some of the others have used a simplistic IP based mapping to improve results. However that still does not solve the problem. The only efficient way to solve the problem is to build a complex usage model for each individual country rather than try for one huge index.
Regards...jmcc
<meta name="ICBM" content="XX.XXXXX, XX.XXXXX">
<meta name="DC.title" content="THE NAME OF YOUR SITE">
or
<meta name="geo.position" content="latitude;longitude">
However, whether the information someone gave was valid or not could cause problems.
Couldn't this be overcome to a certain extent by taking into account the ICBM coordinates for people that have registered .com's etc rather than the
relevant country ones. Is this what you mean by building a complex usage model?
Not exactly Jrobbio.
The complex model tends to idetify websites hosted outside of a particular country rather than relying on the webmaster to do it. With the ICBM method, there is too high a margin of error and even if it became a standard a lot of webmasters would not include it any way. The complex model uses a lot of sources and analysis to determine whether a domain 'belongs' to a specific country. The ICBM method would be a lot easier but I'd hate to think what form the DoS attacks could have. ;)
Regards...jmcc