LookSmart Tries Again

Forum Moderators: bakedjake

Message Too Old, No Replies

LookSmart Tries Again

"It will be the first comprehensive index (of the Net),"

Robino

1:53 pm on Apr 17, 2003 (gmt 0)

[wired.com...]

Mike12345

2:01 pm on Apr 17, 2003 (gmt 0)

Nice find Robino,

Sounds like a novel concept, i wonder it will take off?

Any thoughts anyone, i quite like the idea of it but i wont be participating!

Macguru

2:04 pm on Apr 17, 2003 (gmt 0)

[webmasterworld.com...]

A site search for "grub" will show 45 other threads about it. ;)

heini

2:08 pm on Apr 17, 2003 (gmt 0)

Actually WebmasterWorld has put out the original story on the grub acquisition and consequent plans:
[webmasterworld.com...]

There have been more discussions here, with some people being very sceptical.

I have to say it's a pretty risky path for LS to go. In theory this is a revolutionary approach at indexing the web.
In a way I just wish it would have been a organisation like dmoz or some other non commercial entity coming up with that idea.

Essentially there are two main factors, which decide if it's going to be a flop or a hit:
- the voluntaristic nature in combo with commercial intent
- the problem of producing a high quality frontend (i.e. ranking algo) for the masses of data collected

volatilegx

2:19 pm on Apr 17, 2003 (gmt 0)

The grub spider is wonderful if you want the same file spidered over and over and over and over...

The distributed spider concept is a good one but I'm not sure they've gotten it right.

Fischerlaender

3:26 pm on Apr 17, 2003 (gmt 0)

Why are there projects like Seti@Home? Because the combined power of millions of PCs out there gives a greater chance to solve a really big mathematical problem.

What is the real problem in running a search engine? It's the indexing, especially the part where the link structure of the web (or more precise: the subgraph of the web, that the engine has visited) comes into account. It isn't that easy to parallelize this task, and it's even more difficult to do this for a set of distributed computers.

IMHO, the Grub concept has parallelized the part of a search engine that doesn't need to be parallelized.

jeremy goodrich

4:03 pm on Apr 17, 2003 (gmt 0)

Most worrying, he said, would be the ability to hack the system in order to promote certain sites.

From the article - Danny Sullivan has a pretty good grasp of it (perhaps he's been following along here too? :) )

And, Andre Stechert is a member here. :) Nice to see fellow WebmasterWorld members get some good publicity.

charge

1:34 am on Apr 18, 2003 (gmt 0)

Anthony Rowlston of MICROSOFT was quoted in a very recent article in NEW SCIENTIST magazine on the GRUB as saying that technically he sees nothing wrong with the project.He sees the main drawback as getting people to become clients.Easily fixed if MSN do a promo I would think.
cheers charge.

jeremy goodrich

1:59 am on Apr 18, 2003 (gmt 0)

>>MSN do promo?

They just ramped up their own search team, internally. Why on earth would they do such a thing?

charge, we're discussing the Wired article. Please, if you would like to discuss the New Scientist article -> start a new thread. :)

Cheers.

charge

9:56 pm on Apr 18, 2003 (gmt 0)

1310 clients and 69 mill url crawled in last 24 hours.Seems someone likes the project.Interesting that GOOGLE made comment on the GRUB in the wired article they oviously find it of interest.
cheers,
charge!

Fischerlaender

8:18 am on Apr 19, 2003 (gmt 0)

From the article:

To block attempts to spam or spoof the index, (...) the same work is given to several volunteers.

This means that Grub is creating a multiple on traffic than what would be needed. I'm not sure if I'd like this robot to visit my sites on a daily basis.

jomaxx

7:06 pm on Apr 21, 2003 (gmt 0)

Grub is another robot (like Microsoft's new prototype) that I have noticed is ignoring my robots.txt file. In fact, it spidered 7,000 or my pages over the past 2 days with not a single get of the robots.txt.

How hard can it be to program a spider to read and comprehend this trivially straightforward file?

stechert

8:28 pm on Apr 21, 2003 (gmt 0)

yo -- the grub robots.txt spidering is kicked off by us periodically and is a bit labor intensive. We've been using URL lists that are pre-filtered for robots.txt (from a crawl about 2-3 weeks ago). It's entirely possible that robots.txt's have been updated in the meantime...so, we're planning to kick off another robots.txt run (we've been busy with the number of clients growing like crazy -- now by a factor of 40 in the last 4 weeks), but we also have a mechanism in place to allow you to do a real-time update of our record of your robots.txt file. It's on the grub home page under tools and is called "Robots refresh". Check it out, or private message me with your domain name(s) and I'll do it for you.

Cheers,
Andre/the Grub team

jomaxx

9:01 pm on Apr 21, 2003 (gmt 0)

Andre, the directories where my executables reside have been disallowed for 4 years.

jeremy goodrich

9:12 pm on Apr 21, 2003 (gmt 0)

jomaxx, unfortunately, Grub has had these problems since inception -> way before Andre & LookSmart got involved.

If you dig through this forum, you'll see that Kord Campbell, the guy behind the project, even posted here regarding this issue some time ago.

Seems that failure to listen to feedback continues...though to be honest, it seem that they are trying -> listening is the first step.

Action, then, should be the next. After all, we wouldn't want to see LookSmart getting bad press over all this - of a kind with their move to PPC that definitely caused a 'rift' with them & the webmaster community as a whole.

This, then, could be 'their chance' to forge a relationship that might be to the betterment of everybody.

jomaxx

10:07 pm on Apr 21, 2003 (gmt 0)

Thanks for the info, Jeremy. Out of curiosity I went as far back as March 31 without seeing any attempt by anything identifying itself as *grub* to retrieve my robots.txt file.

I also noticed that between noon Sunday and noon Monday, grub clients retrieved my index.html file exactly 2,300 times. Yow! I'll cut them some slack on this matter because they're probably experiencing some growing pains right now, but that's trying to keep things a little TOO fresh, if you ask me.

[Andre: The grub client has now been 403'ed from the site in my profile for the time being, but if you want to make a manual check, your comments on either of these issues would be welcome.]

stechert

12:32 am on Apr 22, 2003 (gmt 0)

Hi Jomaxx --

As I've said before, we're 100% behind sorting out any of these problems. I've taken a look at your user profile so I now know what site you're at and I'm checking into the problem. Will reply when your site has been updated.

Cheers,
Andre

p.s. Please forgive if it takes until tomorrow...new servers came online, but the new clients ate all the additional capacity...we had a bunch more boxes on order that were supposed to show up today, but!#$@ you know how that goes.

jrobbio

12:45 am on Apr 22, 2003 (gmt 0)

Stechert, in concordance with jomaxx's comment, what is the IP, reverse DNS and user agent of the machine that checks the robots.txt and I don't mean the one that is used to yank it unless of course its the same. It certainly doesn't say grub anywhere on it.
Cheers.

jomaxx

12:50 am on Apr 22, 2003 (gmt 0)

Andre, no problem. I'm just bug-reporting - don't fixate too much about my site in particular.

jmccormac

3:06 am on Apr 22, 2003 (gmt 0)

I think that there is a fundamental flaw in this approach of crawling all available links. It is like sieving the Atlantic ocean in the hope of catching a fish. The main problem from what I can see at the moment is that the bigger search engines produce too many results. In the WIRED article, Peter Norvig of Google has it right. Some of the article quotes were worrying. This one in particular, seemed to have all the wrong reasons for events:
"Google is also popular because it throws enormous resources at the problem, Stechert said. In fact, Stechert said, the success of search engine companies has been closely tied to the amount of computing power
they use to index the Web. " (Sorry if I am quoting out of context Stechert.)

Google probably has the most efficient advertising and PR of any of the search engines. Its ranking algorithm gave a relevance to the results that was lacking in most of its competition. The massive resources that it threw at the problem happened to be a more efficient way of tackling the problem because of the lower startup costs for n thousand servers running an Open Source operating system. Altavista on the other hand, used a closed source operating system and concentrated the servers. The difference in internet generations was critical - Altavista was advanced for its time but it was also a product of that time in that it used an approach that worked and it used in-house technology. The one killer for any search engine is in monetizing the results - making the search engine pay. Altavista's approach was to market search appliances in much the same way as Google and to try to sell advertising. Google's approach has been integrated at a far lower level so that the advertising on the results page is related (often closely) to the actual results.

Purely from a background in codebreaking, flinging loads of resources at a problem and hoping for a breakthrough is not the best approach. You have got to use your resources efficiently and choose the right point or approach. The distributed grub crawler is an interesting idea but it would generate a lot of data. Hoping that analysis and a good engine will give that data relevance for the SE user is something else indeed.

Regards...jmcc

jomaxx

5:10 pm on Apr 22, 2003 (gmt 0)

I agree that the mechanics of crawling the Web is just one aspect -- and not the most interesting or important one -- of building a search engine. Grub will need a great ranking algorithm to survive.

Not sure I agree with your analysis of Google's success, though. IMO it had little to do with advertising or public relations, and less to do with Open Source vs. proprietary technology. Google took over the world because
(1) they offer fast, bloat-free search results, and
(2) they have better search results than everybody else, probably due in large part to factoring in pagerank and link anchor text.

jmccormac

2:01 am on Apr 23, 2003 (gmt 0)

Not sure I agree with your analysis of Google's success, though. IMO it had little to do with advertising or public relations, and less to do with Open
Source vs. proprietary technology. Google took over the world because
(1) they offer fast, bloat-free search results, and
(2) they have better search results than everybody else, probably due in large part to factoring in pagerank and link anchor text.

I don't think that the our reasons for Google's success are too far apart Jomaxx.
The Open Source vs closed source operating system and the Page Ranking of the results facilitated 1 and 2 above. The fifty thousand or so servers that Google runs all use Redhat Linux apparently. The cost of running these on a close source OS would be, to say the least, expensive. The cost of technology had also fallen dramatically in the period between the start of Altavista and the start of Google. However advertising and public relations are very important aspects that are often overlooked. Running a search engine is running a business. The business has to make money. Under the nice happy-clappy exterior of Google lurk hardnosed business people. Most of the SEs that failed in the last few years failed because they did not monetize the results in the same effective fashion as Google. Google's fast, bloat-free results and the quality of results gave it that critical edge.

The one thing that bothers me about the Looksmart/Grub and indeed all crawler approaches is that they all seem to follow the Christopher Columbus school of discovery. They don't know where they are going. They don't know where they are when they get there. And they don't know where they've been when they get back. Microsoft, seemed to cop on to a very important aspect of the web - a lot of it does not change on a day to day (or even month to month) basis. If the Looksmart/Grub crawler managed to identify this fresh web core, then it would have a unique selling point.

Regards...jmcc

Go2

4:33 pm on Apr 23, 2003 (gmt 0)

Grub is certainly an interesting project but it will not solve the hard part of indexing the web. It is not primarily the data collection that needs to be distributed but rather the much more cumbersome task of describing the collected web pages in terms of their subject and geographical location.

To distribute the human powered task of describing the pages on the web a completely new paradigm is required which enables the web page owners to organize and describe the web themselves. A few hundred hired editors will not suffice for this immense task, something which management at Looksmart and other major directories probably has long realized..

jmccormac

5:15 pm on Apr 23, 2003 (gmt 0)

Grub is certainly an interesting project but it will not solve the hard part of indexing the web. It is not primarily the data collection that needs to be distributed but rather the much more cumbersome task of describing the collected web pages in terms of their subject and geographical location.

This is the problem I have been working on for some time now. The scary thing is that some countries can have between 10 and 70% of their associated websites hosted outside their IP space. Thus for com/net/org/info, there is a significant amount of each country's websites excluded from the 'pages from $country' searches. The quick and nasty solution used by most of the bigger search engines has been to limit the search indices to the relevant country cctld (eg: .ie, .uk, de) and exclude the com/net/org domains. Recently, Google and some of the others have used a simplistic IP based mapping to improve results. However that still does not solve the problem. The only efficient way to solve the problem is to build a complex usage model for each individual country rather than try for one huge index.

Regards...jmcc

jrobbio

9:40 pm on Apr 23, 2003 (gmt 0)

Couldn't this be overcome to a certain extent by taking into account the ICBM coordinates for people that have registered .com's etc rather than the relevant country ones. Is this what you mean by building a complex usage model?

However, whether the information someone gave was valid or not could cause problems.

jmccormac

11:17 pm on Apr 23, 2003 (gmt 0)

Couldn't this be overcome to a certain extent by taking into account the ICBM coordinates for people that have registered .com's etc rather than the
relevant country ones. Is this what you mean by building a complex usage model?

Not exactly Jrobbio.
The complex model tends to idetify websites hosted outside of a particular country rather than relying on the webmaster to do it. With the ICBM method, there is too high a margin of error and even if it became a standard a lot of webmasters would not include it any way. The complex model uses a lot of sources and analysis to determine whether a domain 'belongs' to a specific country. The ICBM method would be a lot easier but I'd hate to think what form the DoS attacks could have. ;)

Regards...jmcc