Forum Moderators: open
I found an entry in my logs from "ws.lon4.fastsearch.net" with a user-agent string of "PycURL". It fetched one of my pages without fetching either robots.txt or "/". It appears to be a bot because it only grabbed the html page, not the images, linked styleheets or linked javascripts.
Does anybody know if Fast is using a new robot or why they aren't fetching robots.txt? From a Google search, I found that "PycURL is a Python interface to libcurl. PycURL can be used to fetch objects identified by a URL from a Python program, similar to the urllib Python module."
Fast offers a variety of custom services.
http: //www.fastsearch.net/products/
I had them denied at one time. Allowed them back in recently on a suggestion I read here [Quite an omission for me :) Not that I read anything here. That I allowed a bot on somebody's suggestion].
I don't recall seeing (at least in my logs) any UA other than "FAST-WebCrawler/3.6 (atw-crawler at fast dot no; [fast.no...]
I recall some brief reading on Python and it's compact programming methods. As a result we might expect to see more Python in the future.
Were it I in your shoes? I'd just watch for re-visits with some kind of regularity.
I've denied bots, UA's and IP ranges on a solitary probe/visit just because I didn't like what I found on either their webpage or the capabilities of the software they were using.
In the end it MUST be your choice as to the sensitivity of your data and what you desire it to be used for.
If that sensitivity encompasses unknown use and that is not your desire (be it plarisim, infringment or research which all use your bandwidth to retrieve your data,) than just do a SetEnv on the UA.
As long as you don't deny their regular bot that shouldn't create any problems for the normal Fast spidering and listings.
Don
If the bots don't provide any identification/indication of their intent, then I must assume it cannot be of value. After all, there really are only a handful of bots which can offer any value, and they [for the most part] clearly identify the bot and intent. Even the bandwidth consuming brand bots do this much.
If a bot cannot respect robots.txt, then that should be an indication on how it treats your content!~
I agree that any legitimate bot should identify and provide information about itself and also fetch and respect robots.txt. If it had not come from fastsearch.net where the ATW crawler also comes from, I probably would have banned the IP (or IP range) too.
My policy is similar to what JuniorHarris said. If a bot's a mystery or it acts suspicious, it gets blocked until I discover something positive about it.