Forum Moderators: open

Message Too Old, No Replies

Wget creators respond!

...

         

icehousedesigns

12:35 am on Jul 27, 2001 (gmt 0)



Chris,

I know that this reply is late. I had deleted the original
by accident and then discovered later that I had not answered
your question.

You are half correct about the wgets crawling your site. We
don't actually use wget recursively, so it doesn't have a clue
about robots.txt or the robots META tags. It just crawls what
we tell it to. However...

The reason that you get multiple connections is because we have
your URLs in the database, in order, and hand them out to clients,
again in order. The clients in turn crawl multiple sites, all at
once, which causes you to have multiple connections open to the
client when it crawls you.

We had implemented a randomizing scheduler to fix this problem,
but the load on our scheduling server was to great to keep it
running and be able to schedule normally.

Regardless, I have just completed a complete randomization of the
URL IDs in our database using a new utility that I wrote today.
Statistically, it should prevent your URLs from being crawled in
sequence again. We/I apologize that we had not done this sooner,
but we've had our hands full with trying to get other aspects
of the project going, and I just now had the time to code up the
solution to fix this problem.

I see that you have blocked wget from crawling your site - you
may wish to re-enable wget access so that we may continue indexing
your pages. I checked the database and we do not have any of your
URLs that match your robots.txt file, and we will continue to honor
your robots.txt file should we add any of your other URLs to our
database. For your reference, I have included a list of your URLs
below that are in our database.

If you have any questions, please let me know.

Thanks,

Kord

--------------------------------------------------------------
Kord Campbell Grub.Org Inc.
President 6051 N. Brookline #118
Oklahoma City, OK 73112
kord@grub.org Voice: (405) 843-6336
[grub.org...] Fax: (405) 848-5477
--------------------------------------------------------------
+-------------------------------------------------------------+-----------+
¦ the_url ¦ url_id
+-------------------------------------------------------------+-----------+

¦ [icehousedesigns.com...] ¦ 137735336
¦ [icehousedesigns.com...] ¦ 136865414
¦ [icehousedesigns.com...] ¦ 135690065
¦ [icehousedesigns.com...] ¦ 137488299
¦ [icehousedesigns.com...] ¦ 137065677
¦ [icehousedesigns.com...] ¦ 137358587
¦ [icehousedesigns.com...] ¦ 135757491
¦ [icehousedesigns.com...] ¦ 138543593
¦ [icehousedesigns.com...] ¦ 137051291
¦ [icehousedesigns.com...] ¦ 136893366
¦ [icehousedesigns.com...] ¦ 138623902
¦ [icehousedesigns.com...] ¦ 138364801
¦ [icehousedesigns.com...] ¦ 138805982
¦ [icehousedesigns.com...] ¦ 136993962
¦ [icehousedesigns.com...] ¦ 138841029
¦ [icehousedesigns.com...] ¦ 137004924
¦ [icehousedesigns.com...] ¦ 136723798
¦ [icehousedesigns.com...] ¦ 137138806
¦ [icehousedesigns.com...] ¦ 138863830
¦ [icehousedesigns.com...] ¦ 136274889
¦ [icehousedesigns.com...] ¦ 137325407
¦ [icehousedesigns.com...] ¦ 136843068
¦ [icehousedesigns.com...] ¦ 137110390
¦ [icehousedesigns.com...] ¦ 135845369
¦ [icehousedesigns.com...] ¦ 136963876
¦ [icehousedesigns.com...] ¦ 138801774
¦ [icehousedesigns.com...] ¦ 138707739
+-------------------------------------------------------------+-----------+
---------- Forwarded message ----------
Date: Thu, 26 Jul 2001 10:41:49 -0500
From: Igor Stojanovski <ozra@grub.org>
To: Kord Campbell <kord@grub.org>
Subject: FW: Wget software

-----Original Message-----
From: Christopher Lover [mailto:chrisl@worldpath.net]
Sent: Saturday, June 30, 2001 7:10 AM
To: support@grub.org
Subject: Wget software

Greetings,

I have recently been forced to ban a large amout of domains from hitting my
website that run the Wget software. ( not knowing much about the project at
the time ). Recently I have discovered what this project is about, and
decided to contact you directly.

It seems Wget is a very unforgiving peice of software that will
simultaniously request at least several HTTP docs from a specific domain
within the same second, and sometimes from multiple IP's at the same time.
I've had several occurances over the past couple of weeks where within a 2
minute time frame I have been hit by over 20 different IP's around the world
requesting multiple docs at the same time or within seconds of eachother,
all running the Wget agent for the grub project.

As you all know, webmasters can get very tweaked out when the any spider or
bot appears to have misbehaved, and it will either get banned at server
level ( .htaccess ) or via robots.txt. ( Wget seems to ignore the robots
exclusion protocol so the latter is not an option. ) So unfortunately many
of us webmasters have resorted to writing custom scripts that will deny
access to any Wget User-Agent, instead of banning individual IP's. We do not
want to exclude our sites from being crawled unless we have to, however I
would much rather just see the spider more friendly at comply with the
robots exclusion protocol ( similar to googlebot @ google.com ).

Just thought I would bring this matter to your attention.

Christopher Lover
Webmaster, Icehouse Designs
[icehousedesigns.com...]

roscoepico

1:31 am on Jul 27, 2001 (gmt 0)

10+ Year Member



Thanks for posting the response. I noticed IceHouseDesigns is in Farmington, NH....Im in Dover.....small world....

icehousedesigns

2:03 pm on Jul 27, 2001 (gmt 0)



:)