homepage Welcome to WebmasterWorld Guest from 54.161.200.144
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Alternative Search Engines
Forum Library, Charter, Moderators: bakedjake

Alternative Search Engines Forum

This 47 message thread spans 2 pages: 47 ( [1] 2 > >     
LookSmart has acquired this Grub Inc
Open Source crawling
Jillibert




msg:464113
 2:41 am on Mar 15, 2003 (gmt 0)

LOOK have purchased this company

[grub.org...]

It appears this will be able to not only refresh the hosts index but "crawl" the web and refresh the web index on a daily basis.

Very interesting process much like Napster and Seti.

Certainly will give apowerful and cost affective edge to Wisenut.

 

fiestagirl




msg:464114
 2:50 am on Mar 15, 2003 (gmt 0)

Did you read somewhere that Looksmart has purchased grub? In this thread from January they say otherwise.
[webmasterworld.com...]

Jillibert




msg:464115
 2:57 am on Mar 15, 2003 (gmt 0)

There is information in the Looksmart Annual Report just released today where it states Looksmart have acquired Grub and its under the section of "community participation"

Page 4 of the Annual report

fiestagirl




msg:464116
 3:12 am on Mar 15, 2003 (gmt 0)

I see..very interesting. Good find.

"In January 2003, we acquired substantially all of the assets of Grub, Inc., a developer of distributed computing software which allows community participants to assist in the development and updating of a web search index. We believe that by incorporating a distributed computing solution into our systems and processes for updating our search index, we may be able to achieve substantial gains in the freshness of the index and cost savings over the long term."

pendanticist




msg:464117
 4:41 am on Mar 15, 2003 (gmt 0)

<Groan>

Me no like grubs.

Pendanticist.

Brett_Tabke




msg:464118
 4:53 am on Mar 15, 2003 (gmt 0)

Is that report available online somewhere?

fiestagirl




msg:464119
 5:02 am on Mar 15, 2003 (gmt 0)

[shareholder.com...]

Brett_Tabke




msg:464120
 11:01 am on Mar 15, 2003 (gmt 0)

direct link to the right one [shareholder.com] (large).

mack




msg:464121
 11:06 am on Mar 15, 2003 (gmt 0)

Makes me wonder how many grub operators will think twice when they that all their unpaid work is going direct to looksmart.

Hollywood




msg:464122
 5:08 pm on Mar 15, 2003 (gmt 0)

This Grub thing is interesting, anyone want to tell me why this program client may be bad, or is this a win win for everyone?

cchooper




msg:464123
 6:31 pm on Mar 15, 2003 (gmt 0)

I tried it out, in fact, for the past few days I've been posting in on my site, the client (so far) is pretty bad, mainly because it's still in its beta phases.

I had hoped with it being individually owned that it might make its index opensource, and maybe even release an API to query it much like Google did, but I don't think that that's going to happen.

I'd post a link to the comments about it on my site, but I'm pretty sure reading something about that on the rules section ;)

So I'll just put it this way: I refuse to run it now :) I've also heard complaints from other webmasters about the Grub crawler being rather agressive, so we'll see what Looksmart does for it.

Hollywood




msg:464124
 7:03 pm on Mar 15, 2003 (gmt 0)

I am trying it now, it is interesting, seems to be on a "hang" at the moment, it has crawled 8000 sites thus far on my end.

I am on T1

born2drv




msg:464125
 7:54 pm on Mar 15, 2003 (gmt 0)

I let it run last night on my cable modem full speed, crawled 113,000 URLs...

But I can't get it to crawl my site... anyone had sucess getting LOCAL URL's crawled? Does it automatically deep crawl and index your whole site, or do you have to submit a list to it? I'd rather not have to make a list.

I've got the site listed in my profile online, I checked off the box in the prefernces under local crawl. If anyone is getting their own content crawled/indexed please let me know what I'm doing wrong.

Jillibert




msg:464126
 8:04 pm on Mar 15, 2003 (gmt 0)

I have tried it since yesterday and find it has some issues but if it can be tweaked and the bandwidth usage issue addressed this could be very interesting in making a search engine truly fresh.

I think the site canvasses who and who will not want to use it but any commercial site would have to think about using it as it is able to index the host site everyday and up load that information which has to be a plus for any one who wants to be found on the internet.

Subject to some minor tweaks this looks quite interesting.

Hollywood




msg:464127
 8:06 pm on Mar 15, 2003 (gmt 0)

Jillibert I agree this could be big.

carfac




msg:464128
 9:11 pm on Mar 15, 2003 (gmt 0)

I actually banned in froim my sites. It did not respect robots.txt, and went tearing through stuff like crazy. I contacted Grub at the time, and it did not seem like it would add anything to my site, or it's searchability.

Maybe I will rething this, but the client will HAVE to be better behaved or it will not get past the blocks I have set up!

dave

jrobbio




msg:464129
 9:17 pm on Mar 15, 2003 (gmt 0)

I'm testing it out at the moment although it is very little benefit to me until they allow none site admins to choose their sites to be crawled first. I enquired about this and apparently this is something that they are looking at.
They are currently working out ways to work with Zeal and Wisenut too.

GoogleGuy




msg:464130
 11:35 pm on Mar 15, 2003 (gmt 0)

Interesting--I checked out Grub a long time ago back when they were using wget to fetch pages. I haven't looked at it lately though--how do they make sure that the crawling volunteers are returning the content that they're supposed to return?

matthias




msg:464131
 12:14 am on Mar 16, 2003 (gmt 0)

Hmmm... maybe I'm completely off track here, but wouldn't it be enough if they would act as a browser plugin and just index the sites I'm visting anyway? Of course without collecting personal data :-)

Sure, a lot of site would be covered too well but that should be solvable.

Together with an Alexa like concept (I like the basic idea of Alexa) you could even get PR like rankings.

mack




msg:464132
 12:20 am on Mar 16, 2003 (gmt 0)

To follow on to what Googleguy asked is it possible to corrupt the data you return to make it favourable to you?

kordless




msg:464133
 12:22 am on Mar 16, 2003 (gmt 0)

Having people crawl the data for you does introduce a problem of how to verify the data stream integrity. It is an issue that we have been thinking about for 3 years now.

The long and short of it is there are a variety of ways it can be handled - use your imagination!

:)

Kord
Client ID #1

Fischerlaender




msg:464134
 12:44 am on Mar 16, 2003 (gmt 0)

I'm not very convinced of Grub's concept. Getting fresh data from the web via distributed crawlers may be nice. But the main problems for such an search engine would be how to
- put all those data together into a very fast frontend database which has to answer the queries of the users.
- get an effective ranking algorithm.

And there is of course another, more "social" problem: It's one thing to support researchers with my idle cpu time; but to support a large company with my bandwidth so that they can make money out of it is something completely different.

Hollywood




msg:464135
 12:47 am on Mar 16, 2003 (gmt 0)

Googleguy

From looking at the way this works it appears they feed the urls to you first or on the fly and then have your end check the packets (url's). This is pretty wild, I have crawled about 300,000 already today myself.

Me thinks Google needs to do this. (Option on toolbar) Really! What do you think GG?

I think we should all push this "tang" and see how many pages we can update and if it can keep up!

Whata'ya say boys and girls, I love breakign things myself, hahahaha.

~Hollywood.

Hollywood




msg:464136
 12:50 am on Mar 16, 2003 (gmt 0)

Fisch...

"And there is of course another, more "social" problem: It's one thing to support researchers with my idle cpu time; but to support a large company with my bandwidth so that they can make money out of it is something completely different."

I agree with this statement, I look at this the same way as DMOZ, a lot of help for free, when the day comes they sell it for millions will they pay up, Hmmm, (Thinking of ERn-RON Can I say it that way.) I doubt they will sport the doe!

~Hollywood... out

GoogleGuy




msg:464137
 1:01 am on Mar 16, 2003 (gmt 0)

kordless, glad you're thinking of all these issues. Kinda like SETI where someone could pass back a packet saying "we detected aliens on this frequency!" :) So are all of the Grub folk moving over to LookSmart?

kordless




msg:464138
 1:23 am on Mar 16, 2003 (gmt 0)

I guess we'll see about those aliens in a few weeks or so eh? :) I think the guys at SETI@Home have some telescope time booked on the 18th of this month - I can't wait!

Kord
Client ID #1

Hollywood




msg:464139
 3:36 am on Mar 16, 2003 (gmt 0)

I see a spike on Grub home page for new urls checked, I think that is all me, I am even pushing!

hahahah

But really, check it out! Am I allowed to post a link on here, not sure so I wil not for now.

~Hollywood

born2drv




msg:464140
 3:50 am on Mar 16, 2003 (gmt 0)

Kord, Can I index all my competitors sites? I'll be sure to send you a full and complete accounting of their sites so you can rank them properly ;)

cchooper




msg:464141
 4:00 am on Mar 16, 2003 (gmt 0)

Regarding user-corrupted data:

IIRC, Grub keeps checksums of all of these sites, and the client does nothing but calculate the checksum of the current sites, and sends it to the Grub server. If it's different, the site gets shipped off to processing (well, a big in-accessible database for now, nothing really gets processed yet :))

One could always claim the chechsums are different, or better yet, add some kinda PHP (or even do it in asp) in a comment and do a mktime(), so you're site is always different, every visit, no matter if you updated or not :) If this ever takes off it'd be interesting to see how many people do that.

Now regarding robots.txt

Grub checks the robots.txt every 30 days, if it even checks it at all, however, not every client seems to honor it when it's first reloaded. A few people have complained on the Grub forums, but the response from the Grub team so far seems to be "stfu," no matter what the suggestion.

porkyoz




msg:464142
 4:02 am on Mar 16, 2003 (gmt 0)

Hi Fischerlaender,

You wrote:

And there is of course another, more "social" problem: It's one thing to support researchers with my idle cpu time; but to support a large company with my bandwidth so that they can make money out of it is something completely different.
======================================================================

I am not sure I follow your rationelle.

I would assume you are an SEO.
You have clients.
Your clients pay for you to provide the best possible mechanism to enrich thier website for ROI. Word of mouth, your clientelle grow.

To have your clients sites refreshed daily I would have thought would be in yours and your clients best interests.
I see it as a two way street. True, you are providing bandwidth according to your limits and not "Grubs" appetite. It is entirely up to you.

One thing to remember, how much bandwidth resources is left idle and not being fully maximised per month given you have already paid for a certain capacity of bandwidth allocated. Why not maximise your resources improving your clients Freshness on thier websites?

I see this as an opportunity cost. You forgo a certain amount of bandwidth per month and gain Freshness for your clients sites everyday. A win, win situation.

Am I missing your point?

Cheers Porkyoz

This 47 message thread spans 2 pages: 47 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Alternative Search Engines
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved