Forum Moderators: open

Message Too Old, No Replies

how to gently spider

         

alexmg

9:53 am on Dec 5, 2007 (gmt 0)

10+ Year Member



Greetings,
I am very sorry if I'm posting in the wrong forum - or if the answer is available somewhere else - but I was unable to find any other resource.
Funny as it may seem, I resort to asking for help to the webmaster community (trying to limit the amount of resource spidered away) on behalf of a little spidering project of mine.

The problem: I have a LARGE list of books, that I would like to index by author/title/isbn/reviews...
as you can imagine all this info is available on Amazon.
I read something about "Amazon web services", but it seems to me they're not appropriate to this project.

So I checked their robots.txt (message bottom), and it seems to me it doesnt require me to stay away from their resources...
so I wrote a spider in Perl and I'm currently running it.

My question for you is about its fairness:
1) am I allowed to do this?
2) currently I run a search every 3+int(rand(30)) seconds:
you think it's too low? Too high? Am I worrying too much?

sorry again if it's the wrong question/wrong place, but I found it hard to find info (apart from the good O'Reilly Spidering Hacks of course)

Thank you all!

Alessandro

--- amazon's robots.txt

# Disallow all crawlers access to certain pages.

User-agent: *
Disallow: /exec/obidos/account-access-login
Disallow: /exec/obidos/change-style
Disallow: /exec/obidos/flex-sign-in
Disallow: /exec/obidos/handle-buy-box
Disallow: /exec/obidos/tg/cm/member
Disallow: /gp/cart
Disallow: /gp/flex
Disallow: /gp/product/e-mail-friend
Disallow: /gp/product/product-availability
Disallow: /gp/product/rate-this-item
Disallow: /gp/sign-in
Disallow: /gp/reader
Disallow: /gp/sitbv3/reader
Disallow: /gp/richpub/syltguides/create
Disallow: /gp/customer-media
Disallow: /gp/gfix
Disallow: /gp/associations/wizard.html
Disallow: /gp/dmusic/order
Disallow: /gp/legacy-handle-buy-box.html
Disallow: /gp/aws/ssop
Disallow: /gp/yourstore
Disallow: /gp/gift-central/organizer/add-wishlist
Disallow: /gp/gurupamacro
Disallow: /gp/vote
Disallow: /gp/music/wma-pop-up
Disallow: /gp/customer-images

wilderness

12:17 am on Dec 8, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



replied previously in alt.internet.search-engines