grub and ZyBorg - Crawler, Spider, and User Agent ID forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

grub and ZyBorg

what's going on?

claus

12:47 pm on Jun 19, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Hello all,

I have consulted the various subjects on WebmasterWorld on grub. I have also consulted the grub forums as well as conducted a google search on the subject. I'm a bit puzzled.

As far as I can judge, the grub spider is some kind of customized UA that also crawls pages, eg. a "distributed spider". The crawled pages are apparently being used in some unspecified manner by the wisenut SE.

Yesterday I had three different versions of the grub UA/spider visiting one of my sites. From 11 different IPs, and fetching 11 different pages.

On the same day, I had the ZyBorg spider (from wisenut) visiting four different pages, one of which were also being visited by grub 19 hours later.

None of the 11+4=15 pageviews had referrer info. Meaning: They went straight for the page off some list, grabbed it, and then left.

All requests were GET / HTTP 1.1 - meaning that the server delivered the whole page to the User-Agent and not just the headers.

It's not a lot of pageviews and hence not a lot of server load. Plus: I do not mind crawlers as long as they are well-behaved.

I have tried entering the site URL in the search box on grub.org with no matches returned.

Now, I'm thinking: What's going on?

Are the pages delivered to the grub UA's "real pageviews" in a browser or are they just being fed into a database with noone actually looking at the pages?

Is the distributed spider "well behaved"? I have seen their declaration that they support robots.txt, but my concerns are not with this:

Rather: Is it a coincidence that 11 different IPs fetch 11 different pages? If Yes, why are ZyBorg requesting one of the same pages first?

Is there a risk, that if more people visit my site having grub UA's, that they will eventually flood the site?

Does anyone have any opinion about this? Experience? Or knowledge?

Thanks
/claus

wilderness

3:28 pm on Jun 19, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

For the site search?
Just use "grub" and you'll see plenty.

claus

12:48 pm on Jun 21, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

wilderness:

I have consulted the various subjects on WebmasterWorld on grub

...I did see plenty, just as You thought I would, also I read a lot, but I did not get it anyway. I've tried this site, google and the grub hp and forums as well, but neither source has a good explanation of, exactly what grub does, when it does so, and for what purpose.

Anyway, I will observe the client for some time before banning it - until now it behaves nicely.

/claus

wilderness

6:05 pm on Jun 21, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Claus,
Grub is some kind of open source bot. They have a variety of versions and the IP is the users rather than the software's.
Many grug users abuse the tool.

Many folks have it denied.
As to your own choice?

It is urgent that you keep in mind that each webmaster does what they believe is the best choice to enhance the operation of their own website. Whether that choice is allowing or denying a bot which others may choose otherwise.

claus

7:41 pm on Jun 22, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Thanks. I think it's good to have some amount of personal judgement in stead of just blindly accepting and implementing what others think may be a good idea.

This sounds a bit odd to me, though. I think it might be due to early versions misbehaving:

Many grub users abuse the tool

- i've read a lot in the grub forum lately, to get an understandig of what is going on, and the thing that stikes me most is that quite a few topics concern the issue : "are these data being used or not - and if so, where and when?"

It seems to me that a lot of grub users has downloaded and installed the client, performed extensive data-collecting with it and seen the funny worm-pictures... without knowing a thing about what the client actually does. They are able to spider their own sites, but as far as i can tell, they can not decide which other sites are spidered, nor can they benefit in any way from the scan of their own site.

The company mentions the wisenut engine, but grub-users don't seem to have been able to track any of their efforts in this SE.

This goes so far that users have tried to look at both Wisenut, Looksmart, Zeal, and even MSN to try to find some sense in this, but until now, absolutely no results seem to have surfaced.

There's rumors of a major wisenut upgrade sometime this year, but all is speculation and the representatives of grub refuses to answer specific questions about the use of data - although they participate when it comes to the questions about operation of the client.

About possible abuse, grub seems to have done a few things: respect of robots.txt, and some mechanism that ensures that no more than five pages are fetched from the same site in a row, (sse their FAQ) which leads me to believe that the bot is actually well behaved now.

I guess that there's really not much more to say about it, apart from "let's wait and see". Future developments will most likely be off-topic in the "SE Spider ID" forum (and this thread is quickly becoming so too, i apologize for that).

What bothers me, though, is not so much if the robot is well behaved or not - rather, I would like to know where in the world the data collected from my sites end up, and if this bot is actually spidering or if it is really ripping in stead?

well, nobody knows, i guess, so i'll just have to keep on researching, and stop writing about it for now - otherwise it will end up as pure speculation.

nuff said, i'll get back to work :)

WitchLars

9:38 pm on Jun 22, 2003 (gmt 0)

10+ Year Member

My primary concern with Grub (or any distributed spider for that matter) is spoofing. With Google and friends, it is quite easy to verify their authenticity by comparing the USER_AGENT against the IP. With Grub there is no way to verify whether it is a legitimate crawl, or someone trying to sneak under the radar. With that in mind, I've opted to block Grub in robots.txt and with mod rewrite.

Just my two cents...

-Lars

claus

10:36 pm on Jun 30, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Just to follow up:

After my last post i decided to block the user agent grub.

After i have done so i've been visited by ZyBorg again:

14 visits - 816 files - 1503 hits.

The site in quesion has around 70-something pages. It is a deep scan. From my logs i can see that redirects like this are being followed:

GET /somedirectory/lnk.cgi?name=IDtoLinkTo HTTP/1.1

/claus

claus

8:17 am on Jul 1, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I did a closer inspection of logfiles this morning. Apparently i was wrong in the post above. I did not block grub before yesterday when i posted the latest edition of htaccess.

Now i'm being visited by both grub, ZyBorg and Zealbot on the same days. Seems like the looksmart company has some kind of internal bot-marathon going on.

/claus