Forum Moderators: open

Message Too Old, No Replies

Fast using PycURL UA? Not fetching robots.txt

         

jazzguy

12:23 am on Feb 22, 2003 (gmt 0)

10+ Year Member



Hi everybody. This is my first post to the forums here, although I've been visiting for a while.

I found an entry in my logs from "ws.lon4.fastsearch.net" with a user-agent string of "PycURL". It fetched one of my pages without fetching either robots.txt or "/". It appears to be a bot because it only grabbed the html page, not the images, linked styleheets or linked javascripts.

Does anybody know if Fast is using a new robot or why they aren't fetching robots.txt? From a Google search, I found that "PycURL is a Python interface to libcurl. PycURL can be used to fetch objects identified by a URL from a Python program, similar to the urllib Python module."

wilderness

2:46 am on Feb 22, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



jazzguy,
Welcome to Webmaster World.
Thanks for the heads up :)

Fast offers a variety of custom services.
http: //www.fastsearch.net/products/

I had them denied at one time. Allowed them back in recently on a suggestion I read here [Quite an omission for me :) Not that I read anything here. That I allowed a bot on somebody's suggestion].
I don't recall seeing (at least in my logs) any UA other than "FAST-WebCrawler/3.6 (atw-crawler at fast dot no; [fast.no...]

I recall some brief reading on Python and it's compact programming methods. As a result we might expect to see more Python in the future.

Were it I in your shoes? I'd just watch for re-visits with some kind of regularity.
I've denied bots, UA's and IP ranges on a solitary probe/visit just because I didn't like what I found on either their webpage or the capabilities of the software they were using.

In the end it MUST be your choice as to the sensitivity of your data and what you desire it to be used for.
If that sensitivity encompasses unknown use and that is not your desire (be it plarisim, infringment or research which all use your bandwidth to retrieve your data,) than just do a SetEnv on the UA.
As long as you don't deny their regular bot that shouldn't create any problems for the normal Fast spidering and listings.

Don

JuniorHarris

7:10 pm on Feb 23, 2003 (gmt 0)

10+ Year Member



I'm like wilderness, I've blocked bots, UA's, and IP ranges simply because I did not like their activity. And sometimes I've done it just because I'm in a bad mood! Seriously, if I don't have any idea who/what a bot is about, and they do not provide any indication, then it is blocked.

If the bots don't provide any identification/indication of their intent, then I must assume it cannot be of value. After all, there really are only a handful of bots which can offer any value, and they [for the most part] clearly identify the bot and intent. Even the bandwidth consuming brand bots do this much.

If a bot cannot respect robots.txt, then that should be an indication on how it treats your content!~

jazzguy

8:04 pm on Feb 23, 2003 (gmt 0)

10+ Year Member



Thanks guys. I decided to go ahead and ban the UA "PycURL" yesterday. It seems like PycURL could be used for site ripping and I already ban "curl", "wget" and any other downloaders I find out about.

I agree that any legitimate bot should identify and provide information about itself and also fetch and respect robots.txt. If it had not come from fastsearch.net where the ATW crawler also comes from, I probably would have banned the IP (or IP range) too.

My policy is similar to what JuniorHarris said. If a bot's a mystery or it acts suspicious, it gets blocked until I discover something positive about it.