Forum Moderators: open

Message Too Old, No Replies

Legit Spider Usage?

Are there ways I can use spiders that don't cause problems for site owners?

         

OpeRabbit

6:45 pm on May 29, 2002 (gmt 0)



Warning - long post follows, and it may be completely because of my misunderstanding things!

Though I'm not a computer newbie, I am a newbie to these forums, so when I discovered reams of messages talking about problems with spiders, and how to block them, I wondered if:

1) My automatic web page downloader is what you people are calling a spider.

2) If my use of such a program is causing problems for your site (which I definitely did not anticipate, nor want, though I feel I have a legitimate reason to use such a program).

Pardon me if I am missing the point entirely, but as I said, I'm new here. Let me start by explaining my use:

I have a computer station set up in a public museum to work as a standalone Internet kiosk. The focus of the station is allowing the public to access websites about insects, spiders (the OTHER kind! :)), and the like.

For many reasons: speed, cost, ease of maintenance, security (we didn't want to let the general public have access to our network), and also to limit the sites visited (we didn't want sites unrelated to insects to be accessed, and especially didn't want "inappropriate" sites to be displayed on the screen when the next visitor sits down), we decided to use a website download program to download about 20 useful sites (as complete as was practical), and store those files on a local hard drive.

At the kiosk, the visitor is shown a list of these sites in a bookmark file, and allowed to explore them in depth as much as they like (obviously limited to the pages that are stored on the local drive). Obviously at times they come across some link that is not available, and they get a 404 type error, but since the typical visitor time at the station is about 5 minutes, and we have about 135-150mb of data stored, we don't think this is a major restriction. These sites are not ones that are updated that frequently, so about once/month we re-download and transfer the pages to the kiosk. The program I am currently using for this task is "WebSite Extractor" by Asona. Given that background information, here are my questions:

1) Is this the type of use that might cause problems for your site (assuming I were to visit it)? If so, how?

2) If this method does cause problems, are there things I can do to reduce or eliminate them? I obviously feel I have a legit reason for retrieving websites in this way. While I don't want to cause problems for the site owners (I honestly never thought about it before visiting here), I can't see changing the kiosk to an open access webstation - the cost, maintenance, and other reasons would prohibit that change.

3) If what I am doing is totally fine and others of you do similar things, I'd like your feedback on retrieval programs that you really like, and the reasons why. WebSite Extractor works pretty well for what I need, but I'm always open to suggestions for making things work even better.

Again, if I am totally off-base in my understanding of this, and my use of the site downloading program is not a problem at all, then I apologize for wasting your time. You can then ignore my first two questions, and let me know what you can about other good programs.

If my use DOES cause problems, then I'd love to know what can be done that helps the both of us (I'm sure I'm not the only person in this situation). Thanks!

[end of novella],

OpeRabbit

wilderness

8:40 pm on May 29, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



OpeRabbit
Welcome to Webmasrer World and thanks for your gracious and thoughful inquiry.

I can't claim to speak for every webmaster on the web. Each website like yours has individual needs and purpose.

Yours is one of the rare exceptions of these software applications being used positively. I have a vauled wesbite visitor which uses a similar spider in a similar situation as yours which I make allowances for.

The major problems with these types of software is that the website being visited has NO clue to your intent of use or application. Since the only UA specified is that of the software. This results in unnecessary expense and vists from either commercial, research or just plain theft for which some action or prevention must be taken.

In your special instance?
I personally would take it as compliment if I were emailed PRIOR to your caching asking permission and explaining what software you were using and perhaps providing a date and time of which you read the robots.txt. This would provide an IP number so that allownces could be made.

Hope this helps.