Forum Moderators: open
grub-crawler.vistavdi.com - "Wget/1.6"
07 Jun -- 22:55:42 -- 10 simultaneous requests
07 Jun -- 22:55:43 -- 2 simultaneous requests
07 Jun -- 22:55:44 -- 9 simultaneous requests
After digging through my logs, I discovered that it's made multiple visits in the past, never requesting more than my root directory ( "/" ). I guess it decided not to be so shy anymore...
Edited by: mivox
[webmasterworld.com...]
Wget/1.6
crawler.grub.org
206.30.142.100
Hi all,
I've seen a few reports from webmasters that grub is not honouring the robots.txt file, and that it may be requesting multiple pages either very quickly or simultaneously.Unfortunately because of these points grub may be seen as a badly behaved crawler, and there is already some talk of banning the grub crawler from web sites.
It may help to keep the webmasters on your side and look at implementing robots.txt and slowing down the crawling of individual sites a little.
Thanks :)
We are posting to hopefully answer some question about what we are doing and try to make things right for the people that we have troubled with our crawlers. We are here to (hopefully) make the Internet a BETTER place, not to jack it up or make your life miserable. We apologize for any problems that our bot has caused and welcome your feedback on what we could do to make it better and more friendly to your site(s).
First off, here's a breakdown of what we are doing right now, and the changes that we have made in the past few months:
1. We currently use Wget/1.6 in the client to fetch pages and do so with permission from Wget's maintainer. More info about Wget is available from [gnu.org...]
2. We are planning on switching to the cURL libraries within a month or so from this post. When we do, we will start using grub-client-x.x-x as the UA identifier. More info on cURL is at [curl.haxx.se....]
3. We are currently scheduling/indexing sites taken from the DMOZ site. If you are listed on DMOZ, then we are definitely indexing you.
4. We have NOT been recursively crawling sites, nor have we been adding sites to our indexed URL database, other than user submitted sites. We will begin adding new URLs once we have completed yanking the robots.txt files from the servers listed in our database. This will probably occur toward the first of September.
5. Prior to the 15th of July, it was possible for us to assign multiple URLs for a particular host to a single crawler, which in turn would fire up multiple connections to the host to yank the pages. This behavior has been mostly eliminated by randomizing the order of the URLs in our database. We will continue to do this to minimize the impact of the crawler on sites with multiple URLs.
6. Prior to last week, we were NOT pulling back the robots.txt files from your servers. If we haven't yanked yours yet, we will within a week or so of this post. We know that this seems strange behavior from a crawler, but then again we were not recursively crawling sites. Assuming that DMOZ's list contained URLs that wanted to be indexed *seemed* to be a good idea at the time and in fact a few people that emailed us, complaining about the crawler, only sited the fact that we were crawling multiple URLs at a time as a problem. None of the URLs we were crawling appeared in their robots.txt files upon inspection. Regardless of this sillyness on our part, the scheduler now honors the robots.txt file as it should. We will also be honoring all entries labled "grub-client" in the robots.txt files, should you wish to limit our crawler on your site.
The main thrust of our project is to lower refresh time for crawlers and at the same time lower traffic to the sites that are being indexed, by allowing them to index their own content and submit it back to Grub, and other search engines. During beta testing things don't always go as planned, and as such we've made a few of your lives a bit more interesting that they should have been.
If you would like more information, or simply want all of your URLs taken out of our index, please contact us at support@grub.org, we will be more than happy to help.
Kord Campbell
kord@grub.org
If both of those problems have indeed been fixed, I may remove the grub crawler from my htaccess list...