Forum Moderators: open
Found this in my logs today:
Dumbot(version 0.1 beta - dumbfind.com)
Verizon Internet Services
OrgID: VRIS
Address: 1880 Campus Commons Dr
City: Reston
StateProv: VA
PostalCode: 20191
Country: US
NetRange: 68.236.0.0 - 68.239.255.255
CIDR: 68.236.0.0/14
NetName: VIS-X-Y
NetHandle: NET-68-236-0-0-1
Parent: NET-68-0-0-0-0
NetType: Direct Allocation
NameServer: NSDC.BA-DSG.NET
NameServer: GTEPH.BA-DSG.NET
Comment:
Anyone else seeing this? The domain is a static page with just an email link.
Goober
You might try putting up a page to explain your project, and including the address in your user-agent string. This is far more efficient than an e-mail address. If you want to be seen as trustworthy by the Webmaster community, then that is the way to go.
I don't have time to research every user-agent I find in my logs and send an e-mail, and then coordinate responses with my access control efforts. A log entry like this, on the other hand, only takes a second to check:
Googlebot/2.1 (+http://www.google.com/bot.html)
There is a lot of abuse on the 'net, and many of the participants here have battle fatigue. We will fire a 403 at the first sight of unauthorized intruders.
I look forward to seeing your explanatory Web page, so I can decide whether it would be prudent or beneficial to allow access. Until then, your project is a big unknown, and therefore, a risk by default. I say this sincerely and without any antagonism, but you will be more successful by building trust through full disclosure.
Jim
thanks for the input, I will do as you suggest. I really do fail to see the "risk" you refer to though. It is less than trivial for someone to spoof a popular browser's user-agent string and hide their tracks to all but the most curious of webmasters. I am clearly not trying to hide in any way. But obviously you are correct because it has in fact caused some confusion, as evidenced by the comments in these forums.
Thanks
Thanks so much.
The risks are plentiful:
The bad guys do all of these and more, so our troops are trigger-happy, sir...
Switching metaphors, sometimes a new robot knocks on the door, comes in quietly, looks around politely, and then leaves, appearing to be well-behaved. Then it comes back at 2:00 AM local time, tears the place apart, and steals everything. So, it's not good (in your robot's case) to be an anonymous, unknown visitor -- naturally, the occupants are wary.
Many of the above 'risks' can be handled by prudent security measures around a competently-hosted site. But many Webmasters have little control over things like firewalls and mod_throttle, and server-level controls; They are left with denying access based on user-agent, IP address, and a few behavioural-pattern-detection solutions available in simple scripts.
Sometimes the biggest problems *are* the really unsophisticated harvesting robots, anyway -- the ones that *can* be taken care of by a simple user-agent ban. They're often the ones that go out of control, ignoring robots.txt, and hitting sites with multiple, unrelenting simultaneous requests. Others (including UA spoofers) need stronger, but more cpu-intensive, behaviour-based measures. UA testing is a simple first-step filter, the trip-wire at the perimeter, as it were.
HTH,
Jim
jdmorgan,
I am definitely not anonymous, the user-agent string points right to my website. It says on the website "If you would like information about what will be the greatest search engine ever then HEY! email us.". It grabs robots.txt, and doesn't disobey it, which should be evidenced by your logs. It seemed me that the risks you might assess would be allayed by these facts, and thus the crawler would be treated about the same as from another search engine. But I guess from your perspective you would consider microsoft's new crawler a risk too. Good to know.
Some other people seem to be more worried about the risk that I am an evil bot, and that seems to me ridiculous given how easy it is to find my website, my email address, my home address, my phone number, and my shoe size(12). It is very easy for anyone to send me an email, call me on the phone, or send me a new pair of nikes that fit properly. Actually nikes tend to run small for my shape foot so please send size 13. thanks.
fiestagirl,
>How can we keep it out of our websites if we choose to do so?
You can ask me to remove your sites. You can use the robots.txt file. You may be able to block it at the firewall, depends on what kind of firewall you have. I am sure there are other methods. If you need help on how to send email, I am sure there are forums somewhere on the net for that.
thanks
First of all, this is "business", so don't take it too personally. I have attempted to explain some of the problems that Webmasters face in general, and none of those risks listed above are to be construed as accusations leveled at you. Either your robot is well-behaved, follows robots.txt, and can handle multiple user-agents per robots.txt record, or it would have banned itself from my sites. Its name does not appear in my banned user-agent logs, so either it hasn't visited or it's well-behaved, and that's as far as I can go on dumbot-specific statements.
However, the Webmaster of a site may be faced with hundreds of questionable access "sessions" per day, so there is simply no time to go research them all, and email the operators, and hope to get responses, and the boss said last week, "one more outside phone call, and you're fired!" Maybe the *only purpose* of the 'bot is to collect Webmaster e-mail addresses from responses to the user-agent log entries! So, you're dealing with Webmasters who are overworked, harried by malicious user-agents *and* the boss, out of time, and yes, even lazy. So that's why I suggested to you (and yes, also to MSN) that they add the web page link to the user-agent string.
As an aside, MSN is quite responsive, and fixed a major problem I reported to them very quickly. Alas, no change to msnbot's UA string yet.
In Web marketing, the more clicks a potential client has to do to find what he/she is looking for, the less chance that they will actually buy or stay on your site. The same applies here. It takes six minutes to research the domain name and send an e-mail, but only two minutes to simply block an unrecognized user-agent... The good old days of assuming that every academic-sounding robot from a .edu domain is legitimate are gone, unfortunately. The Web has been commercialized, and exploited, and abused. People are also increasingly (and understandably) wary about giving out their e-mail addresses these days. So the explanatory Web page URL in the user-agent string is the way to go.
In general, that page should include the purpose of the 'bot and the project, the standard boilerplate robots.txt exclusion and "meta name=robots" explanation, and the spider name to be used in robots.txt. Also, mainly black text/white or light background for us old-timers, please... ;)
Good luck with your project,
Jim
thanks again.
I am going as fast as I can! Please shoot me an email with your site address so I can add it and figure out why I missed it.
Nahh... can't do that, man. It has to be found fair and square.
they do have good advice on the crawler side, now if they could only tell me how to make a search engine...
The easiest way would be to have every search give the site in my profile as the first 100 serps, and then toss in a few other ones, pulled out of a hat, at the end. Just a suggestion... ;-)