Forum Moderators: open
I understand... I was once also of the mind that the more search engines indexed your site the better etc. But you can take 90% of the bots out and lose maybe 2% of your traffic. There is *alot* of click fraud, spyware redirects, email harvesters etc. associated with many pseudo wannabe engines. Once you see your content cached and cloaked to the bots for domains that are selling casinos, adult content and bogus male enlargement vitamins etc... well it changed my perspective. You might wonder why I keep an eye on this thread then... it's just to be proactive. ;)
Again, the bot which is the thread topic... I don't know what their intentions are... They could intend to be the next google for all I know... if so maybe they might put up at least a single page saying what it is they do intend.
Your point about blocking the next google is harder for me to agree with. If I were starting out building a new search engine, would I be anxious to tip my hand to world before it was ready to go live? I don't think so.
I run a regional directory site and in its early days I gathered data from a variety of sources in order to avoid starting with an empty directory. I did not alert all of the sources of my data to my intentions. All of the businesses that were quietly added to the directory (at no charge) are very happy to be there. Now the trick is to get them to start paying, but that's a topic for a different thread.
Here's an even bigger twist. I tried to get permission from a trade association to add their members to my directory and could never get a straight answer from them, either yes or no. They made their membership info public on the Internet, so that told me that I could use it. Google doesn't ask permission before adding new information to its site, so why should I?
Most usually I visually scan my log file in wordpad just scrolling down until I notice any new bots or unusual entries.
Upon finding any new bots, I will look up their ip and check out any potential domain either contained in the UA and/or the reverse DNS.
If I find a company with information which explains their bot, either a specific description or the site is an engine I then make the decision whether to allow or block. If, by odds, it won't generate any useful traffic then I block the bot to keep down bandwidth.
If no description of the bot or explanation, or if it is a private bot, then I presume it is for some sort of personal gain to them, email harvesting, code theft, theft of pages or images (and I do consider my images copyright, my work, my property, and copyright is stated), or the bot could be used to find a few complaints (legitimate) that I have against a few corporations, then I just block either by UA and/or by IP (and I watch for it to sneak back via another IP).
Simply, if they can't explain who they are and the purpose, then I consider that it is probably of no good for me and may not be of best interest to the general public (suppression of freedom of speech, email spam, etc). Consistency of use of IP's by bots with proper identification (of corporation or individual) and purpose would eliminate many hassles and could simplify work for webmasters.
Standards were "somewhat" established for bots, eg, format of UA with usage of robots.txt file, why couldn't there have been a requirement for proper identification of owner/user of the bot and intended purpose?
Just my views.
[jeteye.com...]
Notice - [cloud.he.net...] as the default error page. - See company involved.
Hollywood
After reviewing over 3 full months of logs from a half dozen sites, the only thing that Jetbot seems guilty of is excessively grabbing robots.txt. That was in August when it was still being run from Gigablast IPs; since moving to their own IPs (within a block owned by Hurricane Electric - think that might have anything to do with that 404 URL a few messages back?), requests for robots.txt have been more reasonable.
The first file request comes a maximum of 5 seconds after requesting robots.txt. Jetbot has not requested anything it shouldn't; that is, they completely respect robots.txt. Jetbot also has not hammered my servers. Jetbot has not followed dynamic links (even though it's welcome to). Jetbot does follow 301's. I have received no traffic from the (known) JetEye IP range, except for Jetbot, who has always identified itself.
Of course, none of this surprises me since, as I correctly suspected, Jetbot is rebranded, licensed technology from Gigablast (who has also been squeaky clean).
Hollywood, if you know something, do us all a favor and bloody-well spit it out. So far, your air of mystery is more of a stink of...