Welcome to WebmasterWorld Guest from 54.162.141.212

Forum Moderators: bakedjake

Message Too Old, No Replies

LookSmart has acquired this Grub Inc

Open Source crawling

     
2:41 am on Mar 15, 2003 (gmt 0)



LOOK have purchased this company

[grub.org...]

It appears this will be able to not only refresh the hosts index but "crawl" the web and refresh the web index on a daily basis.

Very interesting process much like Napster and Seti.

Certainly will give apowerful and cost affective edge to Wisenut.

2:50 am on Mar 15, 2003 (gmt 0)

10+ Year Member



Did you read somewhere that Looksmart has purchased grub? In this thread from January they say otherwise.
[webmasterworld.com...]
2:57 am on Mar 15, 2003 (gmt 0)



There is information in the Looksmart Annual Report just released today where it states Looksmart have acquired Grub and its under the section of "community participation"

Page 4 of the Annual report

3:12 am on Mar 15, 2003 (gmt 0)

10+ Year Member



I see..very interesting. Good find.

"In January 2003, we acquired substantially all of the assets of Grub, Inc., a developer of distributed computing software which allows community participants to assist in the development and updating of a web search index. We believe that by incorporating a distributed computing solution into our systems and processes for updating our search index, we may be able to achieve substantial gains in the freshness of the index and cost savings over the long term."

4:41 am on Mar 15, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



<Groan>

Me no like grubs.

Pendanticist.

4:53 am on Mar 15, 2003 (gmt 0)

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Is that report available online somewhere?
5:02 am on Mar 15, 2003 (gmt 0)
11:01 am on Mar 15, 2003 (gmt 0)

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



direct link to the right one [shareholder.com] (large).
11:06 am on Mar 15, 2003 (gmt 0)

WebmasterWorld Administrator mack is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Makes me wonder how many grub operators will think twice when they that all their unpaid work is going direct to looksmart.
5:08 pm on Mar 15, 2003 (gmt 0)

10+ Year Member



This Grub thing is interesting, anyone want to tell me why this program client may be bad, or is this a win win for everyone?
6:31 pm on Mar 15, 2003 (gmt 0)

10+ Year Member



I tried it out, in fact, for the past few days I've been posting in on my site, the client (so far) is pretty bad, mainly because it's still in its beta phases.

I had hoped with it being individually owned that it might make its index opensource, and maybe even release an API to query it much like Google did, but I don't think that that's going to happen.

I'd post a link to the comments about it on my site, but I'm pretty sure reading something about that on the rules section ;)

So I'll just put it this way: I refuse to run it now :) I've also heard complaints from other webmasters about the Grub crawler being rather agressive, so we'll see what Looksmart does for it.

7:03 pm on Mar 15, 2003 (gmt 0)

10+ Year Member



I am trying it now, it is interesting, seems to be on a "hang" at the moment, it has crawled 8000 sites thus far on my end.

I am on T1

7:54 pm on Mar 15, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I let it run last night on my cable modem full speed, crawled 113,000 URLs...

But I can't get it to crawl my site... anyone had sucess getting LOCAL URL's crawled? Does it automatically deep crawl and index your whole site, or do you have to submit a list to it? I'd rather not have to make a list.

I've got the site listed in my profile online, I checked off the box in the prefernces under local crawl. If anyone is getting their own content crawled/indexed please let me know what I'm doing wrong.

8:04 pm on Mar 15, 2003 (gmt 0)



I have tried it since yesterday and find it has some issues but if it can be tweaked and the bandwidth usage issue addressed this could be very interesting in making a search engine truly fresh.

I think the site canvasses who and who will not want to use it but any commercial site would have to think about using it as it is able to index the host site everyday and up load that information which has to be a plus for any one who wants to be found on the internet.

Subject to some minor tweaks this looks quite interesting.

8:06 pm on Mar 15, 2003 (gmt 0)

10+ Year Member



Jillibert I agree this could be big.
9:11 pm on Mar 15, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I actually banned in froim my sites. It did not respect robots.txt, and went tearing through stuff like crazy. I contacted Grub at the time, and it did not seem like it would add anything to my site, or it's searchability.

Maybe I will rething this, but the client will HAVE to be better behaved or it will not get past the blocks I have set up!

dave

9:17 pm on Mar 15, 2003 (gmt 0)

10+ Year Member



I'm testing it out at the moment although it is very little benefit to me until they allow none site admins to choose their sites to be crawled first. I enquired about this and apparently this is something that they are looking at.
They are currently working out ways to work with Zeal and Wisenut too.
11:35 pm on Mar 15, 2003 (gmt 0)

WebmasterWorld Senior Member googleguy is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Interesting--I checked out Grub a long time ago back when they were using wget to fetch pages. I haven't looked at it lately though--how do they make sure that the crawling volunteers are returning the content that they're supposed to return?
12:14 am on Mar 16, 2003 (gmt 0)

10+ Year Member



Hmmm... maybe I'm completely off track here, but wouldn't it be enough if they would act as a browser plugin and just index the sites I'm visting anyway? Of course without collecting personal data :-)

Sure, a lot of site would be covered too well but that should be solvable.

Together with an Alexa like concept (I like the basic idea of Alexa) you could even get PR like rankings.

12:20 am on Mar 16, 2003 (gmt 0)

WebmasterWorld Administrator mack is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



To follow on to what Googleguy asked is it possible to corrupt the data you return to make it favourable to you?
12:22 am on Mar 16, 2003 (gmt 0)

10+ Year Member



Having people crawl the data for you does introduce a problem of how to verify the data stream integrity. It is an issue that we have been thinking about for 3 years now.

The long and short of it is there are a variety of ways it can be handled - use your imagination!

:)

Kord
Client ID #1

12:44 am on Mar 16, 2003 (gmt 0)

10+ Year Member



I'm not very convinced of Grub's concept. Getting fresh data from the web via distributed crawlers may be nice. But the main problems for such an search engine would be how to
- put all those data together into a very fast frontend database which has to answer the queries of the users.
- get an effective ranking algorithm.

And there is of course another, more "social" problem: It's one thing to support researchers with my idle cpu time; but to support a large company with my bandwidth so that they can make money out of it is something completely different.

12:47 am on Mar 16, 2003 (gmt 0)

10+ Year Member



Googleguy

From looking at the way this works it appears they feed the urls to you first or on the fly and then have your end check the packets (url's). This is pretty wild, I have crawled about 300,000 already today myself.

Me thinks Google needs to do this. (Option on toolbar) Really! What do you think GG?

I think we should all push this "tang" and see how many pages we can update and if it can keep up!

Whata'ya say boys and girls, I love breakign things myself, hahahaha.

~Hollywood.

12:50 am on Mar 16, 2003 (gmt 0)

10+ Year Member



Fisch...

"And there is of course another, more "social" problem: It's one thing to support researchers with my idle cpu time; but to support a large company with my bandwidth so that they can make money out of it is something completely different."

I agree with this statement, I look at this the same way as DMOZ, a lot of help for free, when the day comes they sell it for millions will they pay up, Hmmm, (Thinking of ERn-RON Can I say it that way.) I doubt they will sport the doe!

~Hollywood... out

1:01 am on Mar 16, 2003 (gmt 0)

WebmasterWorld Senior Member googleguy is a WebmasterWorld Top Contributor of All Time 10+ Year Member



kordless, glad you're thinking of all these issues. Kinda like SETI where someone could pass back a packet saying "we detected aliens on this frequency!" :) So are all of the Grub folk moving over to LookSmart?
1:23 am on Mar 16, 2003 (gmt 0)

10+ Year Member



I guess we'll see about those aliens in a few weeks or so eh? :) I think the guys at SETI@Home have some telescope time booked on the 18th of this month - I can't wait!

Kord
Client ID #1

3:36 am on Mar 16, 2003 (gmt 0)

10+ Year Member



I see a spike on Grub home page for new urls checked, I think that is all me, I am even pushing!

hahahah

But really, check it out! Am I allowed to post a link on here, not sure so I wil not for now.

~Hollywood

3:50 am on Mar 16, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Kord, Can I index all my competitors sites? I'll be sure to send you a full and complete accounting of their sites so you can rank them properly ;)
4:00 am on Mar 16, 2003 (gmt 0)

10+ Year Member



Regarding user-corrupted data:

IIRC, Grub keeps checksums of all of these sites, and the client does nothing but calculate the checksum of the current sites, and sends it to the Grub server. If it's different, the site gets shipped off to processing (well, a big in-accessible database for now, nothing really gets processed yet :))

One could always claim the chechsums are different, or better yet, add some kinda PHP (or even do it in asp) in a comment and do a mktime(), so you're site is always different, every visit, no matter if you updated or not :) If this ever takes off it'd be interesting to see how many people do that.

Now regarding robots.txt

Grub checks the robots.txt every 30 days, if it even checks it at all, however, not every client seems to honor it when it's first reloaded. A few people have complained on the Grub forums, but the response from the Grub team so far seems to be "stfu," no matter what the suggestion.

4:02 am on Mar 16, 2003 (gmt 0)

10+ Year Member



Hi Fischerlaender,

You wrote:

And there is of course another, more "social" problem: It's one thing to support researchers with my idle cpu time; but to support a large company with my bandwidth so that they can make money out of it is something completely different.
======================================================================

I am not sure I follow your rationelle.

I would assume you are an SEO.
You have clients.
Your clients pay for you to provide the best possible mechanism to enrich thier website for ROI. Word of mouth, your clientelle grow.

To have your clients sites refreshed daily I would have thought would be in yours and your clients best interests.
I see it as a two way street. True, you are providing bandwidth according to your limits and not "Grubs" appetite. It is entirely up to you.

One thing to remember, how much bandwidth resources is left idle and not being fully maximised per month given you have already paid for a certain capacity of bandwidth allocated. Why not maximise your resources improving your clients Freshness on thier websites?

I see this as an opportunity cost. You forgo a certain amount of bandwidth per month and gain Freshness for your clients sites everyday. A win, win situation.

Am I missing your point?

Cheers Porkyoz

This 47 message thread spans 2 pages: 47
 

Featured Threads

Hot Threads This Week

Hot Threads This Month