|Grub Distributed Web Crawling Client|
Recently acquired by Looksmart I understand...
I've got this client running on my computer right now, crawling the web in a distributed search engine system. It's called Grub and it's pretty neat. The screen saver is just fantastic to watch :)
Anyone interested in d/l'ing or finding out more can visit [grub.org....]
Supposedly the focus of grub is to tackle Google's "thousands of computers" with millions of distributed computers...a la SETI. Very interesting concept but possibly one that has it's downfalls/limitations.
One downfall I've seen so far is that I can spoof DNS information and replace web site's content with links to my site on the fly :) Making a proxy to do the same thing would be trivial--Imagine having a link to your site from Amazon! Wheee!
Sounds like they are still having some problems with that one :) It is an interesting idea though.
First mention of it was here: [webmasterworld.com...]
Imho, I just can't see letting somebody use my bandwidth at home unless I'm getting something out of it.
Google, Inktomi, Fast, Altavista, and even WiseNut have (so far...) been crawling & adding my sites to their databases just fine without my needing to 'volunteer' to help out.
When you mention the bit about setting up a proxy etc, and then getting a 'fake link' to your site -> can you explain in a bit more detail?
If I could have fun with it that way, I might even be tempted to download...1st link: Yahoo home page :)
not to burst your bubble or anything, but this kind of behavior won't have the effect you intend.
he he he, well - worth asking about, surely?
Thanks for chiming in, stechert.
Having never given the download a try, I have no idea how it works.
*If* there was some possibility of tweaking, I'm sure that would be the 1st thing LookSmart plugged up.
Grub exists to (in no particular order):
(a) deliver a fundamentally better kind of crawler
(b) give webmasters a way to participate directly in the update process
(c) give back to the community in the same way we do for Zeal (e.g., XML query interface is only the first step...remember, we've only been working with Grub for a few weeks).
(d) build a better index for the world to use
among other things. We'll continue to work with the community to find even more compelling applications of the technology, but the current effort must (and does) respect those principles.
What I mean by a proxy is that, instead of d/l from the yahoo.com site, you'd point grub at the proxy...but it would be a custom proxy that you develop yourself.
When the proxy d/l's yahoo's web site, it adds a link to whatever url you want at some place in the page before returning the results to grub.
Grub then passes the modified page on to the master server.
Of course, if grub has some sort of way of checking links via another user's results then you're sunk, but the technique lends itself well to collusion. And if grub has to check urls 3-4 or more times via different users to try to alleviate this technique, then grub will lose out to canonical SEs like Google, which only has to d/l yahoo.com once :)
There are other, more complex, ways of beating grub that simply could not be detected unless there was a central verification of web page contents...but I won't go into detail on these quite yet--because if grub gets *really* popular I'll want to use some of them ;)
Thanks for the explanation, Critter. :) Sounds like a bit more work than I'd be willing to do at the moment, for a very questionable amount of traffic.
Still, if it *does* get popular, perhaps you could drop me a sticky some time?
<----- Always interested in 'clever marketing techniques' that involve search engines.
I'm sure there are other folks (GG probably has to deal with this a lot) out there who share my sadness that, when there's so much opportunity to do something to make the world a better place, folks would continue to try to do this sort of thing.
Suffice it to say that we'll try to align incentives so that even though you could game the system, it'll be more painful to "cheat" than to use it as intended/designed. Consider, e.g., the way that zealots and staff collaborate at Zeal...you can, e.g., enter thousands of sites with good titles and descriptions and then decide to go break a high level listing, but we'd just detect it, fix it, and, unfortunately, suspend your privileges to contribute further to the directory. See, e.g.,
At the end of it, the scaled trust relationship led to a relationship that was mutually beneficial for zealots and the searching community and the good content will live on.
Long story short: I personally believe that people are fundamentally good. I've seen in practice that they are, at the very least, statistically good. When you throw in some lessons learned from the iterated prisoner's dilemma gedanken experiment, you can try to build systems that help people to come together to do something of mutual benefit that's bigger than the individual.
Looking forward to being a part of that with you...
he he he, well ya dude -> that's why we're all here, search engine reps & marketer's alike.
But, um, we can make a buck & make the world a better place too, ya?
Hence, the need for me to feel some 'self interest' to get my appetite going & download the grub client.
On my side of the table, I'm thinking, how do I take advantage & make money, whilst giving the searcher what they want and on your side of the table, it's about preventing me from taking 'special advantage of your systems, Grub, WiseNut, etc.
Of course, doing things one way, eg, creating high quality sites, etc - is good for people searching & search engines.
I just want to make sure that *my* high quality sites that cater to the user's needs show up first.
So do the other people here. :)
Back to Grub, if ya'll figure out a way to get people excited about it, then it'll work out much better. Nail that, and perhaps WiseNut could rise from obscurity...
Of course, I'm not going to jeopardize any relationship I have with a search engine, because SEs are so beneficial to good sites.
However, even though the majority of people are trustworthy and good (and I believe this) the old saw "one bad apple spoils the bunch" rings true today like yesterday.
I'm merely pointing out the shortcomings of the technology, and it has glaring shortcomings. While you are under the radar it will be no problem; but once popularity sets in you'll simply have a big mess to contend with.
Google has a hard enough time dealing with spammy website techniques; Grub will need to deal with spammy website techniques and a plethora of individual problem crawlers. And add to this the fact that, by its design, grub depends on the crawler (under user control) for its authoritative information and you have a recipe for disaster. And once that happens you'll run into another reality "never throw a technological solution at a social problem."
That will be the way things go if Grub becomes popular. Unfortunately, taking the view of "I'll design my technology around what I wish the world was like" does not work.
As far as cheating becoming too expensive, I don't believe that to be the case. You cannot design some sort of encryption algorithm for your protocol because the client accesses the web via regular http/html...you have no control over that part of the link. An automated cheat, once developed, will be extremely inexpensive and virtually undetectable.
I still think the screen saver is the coolest though.
Jeremy......be interested to hear your opinions about Grub after you download it and have a look at the concept?
Sorry -> I'm not downloading till I'm incentivized or the traffic from WiseNut & quality of results pushes me to jump in.
Though I love the idea / think it's got merit, etc. And, if it does take off, I'll be doing just what stechert said .
Well, after a fashion -> my writing really does deserve to be at the top. :) he he.
As soon as somebody posts what kind of real value I'm going to get out of it, like, monetary or otherwise, then I'll give it a whirl.
How's that sound? Fair enough, I think.
OK.....just sounds you are making a judgement prior to using it.
Fair enough and thats your choice.
As to why someone would or should incentivate you well that probably means you are not the sort of person they are seeking to be a host.
And thats OK too.......some will use it,some will not.
I thought personally the reasoning as to why anyone would use it were pretty compelling notwithstanding the inherent "geek" value.
Gees if they have to give people a reason to use it they are in trouble.
Again i will prefer to wait to make a judgement until the whole thing is rolled out and a deeper understanding of how and where the results will be used.
But for now i am happy to give feedback and to read and listen to actually what people say who are or have used it.
Great! Me too.
From what stechert said before, it sounds like you all have lots of volunteers for your project.
And, since it's so neat, wonderful, and just plain geeky there should be plenty of people visiting WebmasterWorld that might be interested in doing such a download.
Shoot, sounds like ya'll ought to just give me a product demonstration right here?
What else does it do:
A) Cook me dinner?
B) Vacuum my silk rug?
C) Take my kids out for a walk?
he he he, sorry - couldn't resist.
There will be plenty of users posting their feedback, thoughts, and experiences - I'm 100% sure of it.
Nah.....don't think it will do any of the 3 although it appears Stechert is working on some surprises.
Shame though......i would have thought being a moderator you would have had a more open view.
And by the way i DO NOT work or have any association with LOOK.
I just like to see plurality in the market place........competition is great for business.
Whether or not my view is 'open' has nothing to do with my willingness to devote my computing resources to a project that doesn't pay my electric bill.
I happen to live in stechert's & GoogleGuy's part of the world, electricity is *expensive here* more so the other states I've lived in.
So, perhaps that's why, with a project that wants me to volunteer for it I want to see how I am going to benefit first.
If all it involves is visiting a web page, and searching for something, great! I'll give it a whirl, tell my family, and my friends, everybody. :)
All the people I know that aren't "geeky" use Google, because I told them it's the best and they agree with me.
What I'm looking for, is the next best thing and so far, I would say Yuntis, WiseNut, and Gigablast are equal in that so far. And, Gigablast is run by 1 guy...
Trust me. Since I don't work for *any* search engines either, all I want is the 1 that is good and that will drive traffic to my sites.
Then, what I want is to be able to bend it to my way of viewing the web. Isn't that what every marketer wants?
imho, it is.
No one denies Google is not the benchmark..in fact it is almost impossible to argue against theri domination.
Hence my eagerness to see a competitor of real quality.
As to your electricity being expensive,quite frankly a trite comment like that has NO bearing whatsoever on the merits of whether Grub is or will be beneficial to LOOK/Wisenut.
As to who you work for i also could not care less..........it was your implication i was somehow involved with LOOK i was responding too.
But like i said........some will use it and some will not.
You are the nots.
Funny ......to know how to bend something or use it to your advantage one first has to know how to use it....doesn't one?
Its often said ignorance is bliss.
Personally i do not like bliss!
Well I have a very open view on this.
If I am to install software on my computer I need to be convinced that there is a benefit before so doing. I don't do things just because I am 'told' to - I need to know why - this seems a pretty open-minded approach to me. If I didn't take this stance I would spend all day reacting to TV adverts and spam e-mails, and not manage to get any work done.
In order to decide whether the benefit is 'real' or not, I will consider the product manufacturer's claims and track record, but give them somewhat less credence than any independant reports I can find.
In order to decide which reports are 'independant' I need to filter out reports or comments from anyone that can clearly be identified as consistently presenting a one-side view supporting the product manufacturer.
In this case I have done all the above and have yet to be convinced that this will be of benefit to me.
But I have a very open mind on the issue.
Some like plurality and some like perspective.
From the most recent...
Grub crawler download [webmasterworld.com]
6:07 pm on Apr 3, 2003 (utc 0) Posted by: AmericanBulldog
LookSmart has acquired this Grub Inc.
Open Source crawling [webmasterworld.com]
2:41 am on Mar 15, 2003 (utc 0) Posted by: Jillibert
Are you running it? Would you ever consider running it? [webmasterworld.com]
10:20 am on Mar 16, 2003 (utc 0) Posted by: chris_f
spiders behavior on 301 redirect [webmasterworld.com]
9:51 am on Jan 22, 2003 (utc 0) Posted by: pardo
Looksmart uses "Grubs"....?!?
This gets a tad confusing. [webmasterworld.com] Note: Must be logged in to view this post.
1:43 am on Jan 9, 2003 (utc 0) Posted by: Pendanticist
htaccess doesn't stop this agent
i thought this would stop it [webmasterworld.com]
7:04 am on Jan 3, 2003 (utc 0) - Posted by: incywincy
Can't figure out why .htaccess is blocking. [webmasterworld.com]
8:49 pm on July 26, 2002 (utc 0) Posted by: bobothecat
3:58 pm on Mar 22, 2002 (utc 0) Posted by: volatilegx
Grub Crawler has new name [webmasterworld.com]
9:46 am on Feb 7, 2002 (utc 0) Posted by: Josk
visit's daily and ignores my robots.txt file [webmasterworld.com]
10:35 am on June 15, 2001 (utc 0) Posted by: jimbo_mac
...to one of the oldest.
Grub To reduce/eliminate bias, the only posts not included here were those posted by (now) expired accounts, were covered in cross posts or had no slant in content one way or the other.
distributed crawling client [webmasterworld.com]
1:04 pm on May 13, 2001 (utc 0) Posted by: theperlyking
(That's my disclaimer folks.)
Query terms used: 'Grubs', 'grub-client' and 'grub.org'
After reading thru almost everything ever posted here about this grub - it is my considered opinion that this crawler/spider/bot (et., al.) be enabled with the ability to learn. Sound silly? Not to me. To learn that when a 301 code sends them off somewhere else - they need to go back there the very next visit! To learn that when a 404 code tells them the file no longer exists - they need never go back for that old dead rotten file again...ever! And especially, to learn that when a 403 code tells them they aren't allowed - to never return!
As long as I pay the bandwidth fees for any crawler/spider/bot's usage of my domain (generally on a Surprise! basis at that), for whatever purposes, it is my choice to both throttle it by all/any means at my disposal and not have to take my time to have my url removed. That 403 I have set-up seems fairly self-explanatory to me....<so why not them?> <- Rhetorical Question.
I think that Jeremy's unwittingly demonstrating another of Grub's shortfalls (hey believe me folks, I think it's neat...really). To wit: the folks that are incentivized to use Grub may be the spammers/cheaters that can benefit from its flaws.
I'm not calling Jeremy a spammer/cheater...what I'm saying is that the crowd that is eventually drawn to Grub will be the ones that can "get something out of it" if you know what I mean.
If they eventually decide to start paying people to use their crawler they will only add more problems to the ones they'll have--now people will be incentivized to use their crawler but the technology will still allow for the cheating.
And Jillbert, I think you're being a little hard on Jeremy. It may be true that you need to try something to see the benefits; but people don't go working at new jobs *for no pay* in order to feel them out. They want money whether the experience is good or bad. In fact, if the experience is/was bad the money is even more necessary :)
It's one thing to say "we're searching for life in space, help us out", it's totally another to say "hey, we're making skabillions of dollars! thanks saps!". I don't really foresee a bright future for Grub... :)
|As long as I pay the bandwidth fees for any crawler/spider/bot's usage of my domain (generally on a Surprise! basis at that), for whatever purposes, it is my choice to both throttle it by all/any means at my disposal and not have to take my time to have my url removed. |
i'll have to agree on that point. as a matter of fact - it's the grub client which brought me to this forum in the first place. their web site doesn't offer a lot of usefull info on how stop slow it down or stop it completely - so i came here looking for a .htaccess method. and since a few of the "hard core" computer enthusiasts ( the ones who rule most of the other distributed computing projects ;) ) are getting ready to jump into the "teams" grub is getting ready to authorise, we're going to see a HECK of a lot more traffic from this thing.
Yep, the .htaccess methoed (for the user agent) is the best way to get rid of the thing.
Welcome to WebmasterWorld, btw.
thanks for the welcome Jeremy!
now if we could just see a working .htaccess entry to get rid of this thing, i could quick making the "dent" in my desktop bigger ...
Check out this thread:
it has some pretty good info, and some links to other discussions on modifying .htaccess to prevent unauthorized bots from sucking down your bandwidth.
:) You also know I'm guessing that they distribute (purposefully) the same task to multiple clients in the Grub project?
And, that means you're going to get crawled multiple times per single indexing of your site.
Not only that, but if you dig deep into this forum (here) you will see that 2 years ago, Grub.org (when it was independent) had serious problems obeying robots.txt - Kord Campbell, the guy who was heading up the project, posted here about it a few times back then.
The problem, since they were bought by LookSmart, has still not been fixed.
So, it seems to me that perhaps they don't respect the webmasters who labor so much to generate the content they are trying hard to spider excessively (by their own admission).
Cool project. lol.
If Grub starts doing multiple crawls of our sites it is going to cost us a fair amount of money. We are already near the transfer limits on some sites, so any excess is going to cost real money. Can I see any benefit in letting Grub spider the sites? not yet.
Jeremy: no need to guess. DC projects do send the same "tasks" to multiple clients. then verify the integrity of the data by comparing the results or checksums.
and Bobby: yes it will. the grub forum already has people complaining over there about bandwidth sucking which a lot of the people over there seem to have "little sympathy" for. maybe they should try paying for hosting instead of running a site on GeoCities or something and see what their feelings on the subject are then.
the grub board is one of about three or so boards i'll be posting the .htaccess entry to get rid of this thing on if i can either cobble together ( ... i'm like 0 for 7 on this one now ... ) or find a working one of ... :)
The grub clients have now started to crawl robots.txt files, which I must say is a good sign. Lets see if they start being obeyed.
I really hope they start showing some results soon, I'm starting to lose a little bit of interest as time goes by.