grub-crawler - bad manners - Crawler, Spider, and User Agent ID forum at WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

grub-crawler - bad manners

good one for an .htaccess ban

mivox

7:24 pm on Jun 8, 2001 (gmt 0)

Not only did this obnixious little beast NOT requests a robots.txt, it has a nasty habit of requesting multiple files at once... here's some info from its first (and last) visit to my site:

grub-crawler.vistavdi.com - "Wget/1.6"
07 Jun -- 22:55:42 -- 10 simultaneous requests
07 Jun -- 22:55:43 -- 2 simultaneous requests
07 Jun -- 22:55:44 -- 9 simultaneous requests

After digging through my logs, I discovered that it's made multiple visits in the past, never requesting more than my root directory ( "/" ). I guess it decided not to be so shy anymore...

Edited by: mivox

toolman

7:26 pm on Jun 8, 2001 (gmt 0)

There's that "Wget/1.6" thing again.

theperlyking

8:17 pm on Jun 8, 2001 (gmt 0)

Could this be the grub of [grub.org...] ? It appears to use wget too.

toolman

8:57 pm on Jun 8, 2001 (gmt 0)

[wh58-508.st.uni-magdeburg.de...]

mivox

9:02 pm on Jun 8, 2001 (gmt 0)

It wasn't the wget software I was perturbed with (at least, I don't *think* it is)... it's the bad configuration of the grub-crawler bot specifically. Grub.org seems a likely suspect, with their 'distributed crawling" babble... apparently they've distributed their crawling to tat least one organization (vistavdi.com) with no manners.

Everyman

11:50 pm on Jun 8, 2001 (gmt 0)

Son_House

8:47 am on Jun 9, 2001 (gmt 0)

We have also been seeing a lot of Wget/1.6 lately. The latest one is from [grub.org...] Made 12 request and did not request robots.txt

Wget/1.6
crawler.grub.org
206.30.142.100

theperlyking

11:55 am on Jun 9, 2001 (gmt 0)

I've posted to their sourceforge forum, as follows.

Hi all,
I've seen a few reports from webmasters that grub is not honouring the robots.txt file, and that it may be requesting multiple pages either very quickly or simultaneously.
Unfortunately because of these points grub may be seen as a badly behaved crawler, and there is already some talk of banning the grub crawler from web sites.
It may help to keep the webmasters on your side and look at implementing robots.txt and slowing down the crawling of individual sites a little.
Thanks :)

mivox

8:16 pm on Jun 9, 2001 (gmt 0)

Well, let hope they take note. ALthough I find it really hard to take a search engine project seriously if the programmers don't already understand the basics of spider protocol....

kord

4:06 am on Aug 15, 2001 (gmt 0)

Hello - Grub guys here.

We are posting to hopefully answer some question about what we are doing and try to make things right for the people that we have troubled with our crawlers. We are here to (hopefully) make the Internet a BETTER place, not to jack it up or make your life miserable. We apologize for any problems that our bot has caused and welcome your feedback on what we could do to make it better and more friendly to your site(s).

First off, here's a breakdown of what we are doing right now, and the changes that we have made in the past few months:

1. We currently use Wget/1.6 in the client to fetch pages and do so with permission from Wget's maintainer. More info about Wget is available from [gnu.org...]

2. We are planning on switching to the cURL libraries within a month or so from this post. When we do, we will start using grub-client-x.x-x as the UA identifier. More info on cURL is at [curl.haxx.se....]

3. We are currently scheduling/indexing sites taken from the DMOZ site. If you are listed on DMOZ, then we are definitely indexing you.

4. We have NOT been recursively crawling sites, nor have we been adding sites to our indexed URL database, other than user submitted sites. We will begin adding new URLs once we have completed yanking the robots.txt files from the servers listed in our database. This will probably occur toward the first of September.

5. Prior to the 15th of July, it was possible for us to assign multiple URLs for a particular host to a single crawler, which in turn would fire up multiple connections to the host to yank the pages. This behavior has been mostly eliminated by randomizing the order of the URLs in our database. We will continue to do this to minimize the impact of the crawler on sites with multiple URLs.

6. Prior to last week, we were NOT pulling back the robots.txt files from your servers. If we haven't yanked yours yet, we will within a week or so of this post. We know that this seems strange behavior from a crawler, but then again we were not recursively crawling sites. Assuming that DMOZ's list contained URLs that wanted to be indexed *seemed* to be a good idea at the time and in fact a few people that emailed us, complaining about the crawler, only sited the fact that we were crawling multiple URLs at a time as a problem. None of the URLs we were crawling appeared in their robots.txt files upon inspection. Regardless of this sillyness on our part, the scheduler now honors the robots.txt file as it should. We will also be honoring all entries labled "grub-client" in the robots.txt files, should you wish to limit our crawler on your site.

The main thrust of our project is to lower refresh time for crawlers and at the same time lower traffic to the sites that are being indexed, by allowing them to index their own content and submit it back to Grub, and other search engines. During beta testing things don't always go as planned, and as such we've made a few of your lives a bit more interesting that they should have been.

If you would like more information, or simply want all of your URLs taken out of our index, please contact us at support@grub.org, we will be more than happy to help.

Kord Campbell
kord@grub.org

mivox

5:30 pm on Aug 15, 2001 (gmt 0)

Thanks for stopping by... the only problem I had with your crawlers were: it was not requesting robots.txt, and it was requesting multiple files from our sit simultaneously. It is an established bit of "good form" on the part of a sprider to time it's file requests from a single site at least a few seconds (if not a minute or two) apart from each other.

If both of those problems have indeed been fixed, I may remove the grub crawler from my htaccess list...