Forum Moderators: open

Message Too Old, No Replies

New spider?

Kinda funny in a way

         

Goober

9:51 am on Jul 12, 2004 (gmt 0)

10+ Year Member



Howdy,

Found this in my logs today:

Dumbot(version 0.1 beta - dumbfind.com)

Verizon Internet Services
OrgID: VRIS
Address: 1880 Campus Commons Dr
City: Reston
StateProv: VA
PostalCode: 20191
Country: US

NetRange: 68.236.0.0 - 68.239.255.255
CIDR: 68.236.0.0/14
NetName: VIS-X-Y
NetHandle: NET-68-236-0-0-1
Parent: NET-68-0-0-0-0
NetType: Direct Allocation
NameServer: NSDC.BA-DSG.NET
NameServer: GTEPH.BA-DSG.NET
Comment:

Anyone else seeing this? The domain is a static page with just an email link.

Goober

volatilegx

2:40 pm on Jul 12, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Here's what I have on this one.

# Dumbfind
# UA "Dumbot(version 0.1 beta - dumbfind.com)"
68.167.196.88
68.239.122.138
151.200.115.249

fiestagirl

2:50 pm on Jul 12, 2004 (gmt 0)

10+ Year Member



The administrative contact listed resolves to Cyveillance FWIW.

dumbfounder

9:41 pm on Jul 14, 2004 (gmt 0)

10+ Year Member



FWIW, the administrative contact does not resolve to Cyveillance. It resolves to my actual name and address. FiestaGirl, please verify your information before spreading rumors. If anyone has any questions about dumbfind.com then feel free to email me at info@dumbfind.com. The website is left intentionally vague to encourage questions from inquisitive webmasters and others interested in the project. Thanks.

Romeo

10:04 pm on Jul 14, 2004 (gmt 0)

10+ Year Member



I have seen it, and it came from 2x different IP addresses pool-138-88-****-xxx.esr.east.verizon.net and pool-68-239-xxx-xxx.res.east.verizon.net.
This first looked strange to me, but the first request was for /robots.txt, which then was respected, so it is formally OK for me.

Regards,
R.

Romeo

10:11 pm on Jul 14, 2004 (gmt 0)

10+ Year Member



... another thread about dumbot found here:
[webmasterworld.com...]

Regards,
R.

jdMorgan

11:04 pm on Jul 14, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



dumbfounder,

You might try putting up a page to explain your project, and including the address in your user-agent string. This is far more efficient than an e-mail address. If you want to be seen as trustworthy by the Webmaster community, then that is the way to go.

I don't have time to research every user-agent I find in my logs and send an e-mail, and then coordinate responses with my access control efforts. A log entry like this, on the other hand, only takes a second to check:

Googlebot/2.1 (+http://www.google.com/bot.html)

There is a lot of abuse on the 'net, and many of the participants here have battle fatigue. We will fire a 403 at the first sight of unauthorized intruders.

I look forward to seeing your explanatory Web page, so I can decide whether it would be prudent or beneficial to allow access. Until then, your project is a big unknown, and therefore, a risk by default. I say this sincerely and without any antagonism, but you will be more successful by building trust through full disclosure.

Jim

dumbfounder

11:28 pm on Jul 14, 2004 (gmt 0)

10+ Year Member



jdMorgan,

thanks for the input, I will do as you suggest. I really do fail to see the "risk" you refer to though. It is less than trivial for someone to spoof a popular browser's user-agent string and hide their tracks to all but the most curious of webmasters. I am clearly not trying to hide in any way. But obviously you are correct because it has in fact caused some confusion, as evidenced by the comments in these forums.

Thanks

fiestagirl

11:49 pm on Jul 14, 2004 (gmt 0)

10+ Year Member



Please, the very least you could do is let us know what you are doing and whether or not your robot obeys the exclusion protocol.
How can we keep it out of our websites if we choose to do so? (short of giving it a 403)
I notice that the robot doesn't ask for my robots.txt very often.

Thanks so much.

jdMorgan

12:08 am on Jul 15, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> I really do fail to see the "risk" you refer to though.

The risks are plentiful:

  • e-mail address harvesting
  • Download and duplication of entire Web site on another domain in order to divert traffic
  • Uncontrolled downloading of "infinite" url-spaces on database-driven sites - bandwidth overage penalites
  • Request flooding, resulting in denial of service to customers
  • Annoying vulnerabilty probes, log file spamming, and log pollution
  • etc.

    The bad guys do all of these and more, so our troops are trigger-happy, sir...

    Switching metaphors, sometimes a new robot knocks on the door, comes in quietly, looks around politely, and then leaves, appearing to be well-behaved. Then it comes back at 2:00 AM local time, tears the place apart, and steals everything. So, it's not good (in your robot's case) to be an anonymous, unknown visitor -- naturally, the occupants are wary.

    Many of the above 'risks' can be handled by prudent security measures around a competently-hosted site. But many Webmasters have little control over things like firewalls and mod_throttle, and server-level controls; They are left with denying access based on user-agent, IP address, and a few behavioural-pattern-detection solutions available in simple scripts.

    Sometimes the biggest problems *are* the really unsophisticated harvesting robots, anyway -- the ones that *can* be taken care of by a simple user-agent ban. They're often the ones that go out of control, ignoring robots.txt, and hitting sites with multiple, unrelenting simultaneous requests. Others (including UA spoofers) need stronger, but more cpu-intensive, behaviour-based measures. UA testing is a simple first-step filter, the trip-wire at the perimeter, as it were.

    HTH,
    Jim

  • dumbfounder

    1:16 am on Jul 15, 2004 (gmt 0)

    10+ Year Member



    here is my first attempt at an info page: [dumbfind.com...]

    jdmorgan,

    I am definitely not anonymous, the user-agent string points right to my website. It says on the website "If you would like information about what will be the greatest search engine ever then HEY! email us.". It grabs robots.txt, and doesn't disobey it, which should be evidenced by your logs. It seemed me that the risks you might assess would be allayed by these facts, and thus the crawler would be treated about the same as from another search engine. But I guess from your perspective you would consider microsoft's new crawler a risk too. Good to know.

    Some other people seem to be more worried about the risk that I am an evil bot, and that seems to me ridiculous given how easy it is to find my website, my email address, my home address, my phone number, and my shoe size(12). It is very easy for anyone to send me an email, call me on the phone, or send me a new pair of nikes that fit properly. Actually nikes tend to run small for my shape foot so please send size 13. thanks.

    fiestagirl,

    >How can we keep it out of our websites if we choose to do so?

    You can ask me to remove your sites. You can use the robots.txt file. You may be able to block it at the firewall, depends on what kind of firewall you have. I am sure there are other methods. If you need help on how to send email, I am sure there are forums somewhere on the net for that.

    thanks

    fiestagirl

    1:23 am on Jul 15, 2004 (gmt 0)

    10+ Year Member



    So you are stating that this will work to keep the bot out?

    User-agent: Dumbbot
    Disallow: /

    This was what I was trying to get out of you, the irony was lost I guess. Of course I know how to keep robots out. Puhleese.

    dumbfounder

    1:38 am on Jul 15, 2004 (gmt 0)

    10+ Year Member



    fiestagirl,

    yes that will work excellently to keep out Dumbbot. However, if you would like to keep my bot from spidering your site you need to change it to Dumbot.

    (fyi, that's irony)

    thanks

    jdMorgan

    1:53 am on Jul 15, 2004 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    dumbfounder,

    First of all, this is "business", so don't take it too personally. I have attempted to explain some of the problems that Webmasters face in general, and none of those risks listed above are to be construed as accusations leveled at you. Either your robot is well-behaved, follows robots.txt, and can handle multiple user-agents per robots.txt record, or it would have banned itself from my sites. Its name does not appear in my banned user-agent logs, so either it hasn't visited or it's well-behaved, and that's as far as I can go on dumbot-specific statements.

    However, the Webmaster of a site may be faced with hundreds of questionable access "sessions" per day, so there is simply no time to go research them all, and email the operators, and hope to get responses, and the boss said last week, "one more outside phone call, and you're fired!" Maybe the *only purpose* of the 'bot is to collect Webmaster e-mail addresses from responses to the user-agent log entries! So, you're dealing with Webmasters who are overworked, harried by malicious user-agents *and* the boss, out of time, and yes, even lazy. So that's why I suggested to you (and yes, also to MSN) that they add the web page link to the user-agent string.

    As an aside, MSN is quite responsive, and fixed a major problem I reported to them very quickly. Alas, no change to msnbot's UA string yet.

    In Web marketing, the more clicks a potential client has to do to find what he/she is looking for, the less chance that they will actually buy or stay on your site. The same applies here. It takes six minutes to research the domain name and send an e-mail, but only two minutes to simply block an unrecognized user-agent... The good old days of assuming that every academic-sounding robot from a .edu domain is legitimate are gone, unfortunately. The Web has been commercialized, and exploited, and abused. People are also increasingly (and understandably) wary about giving out their e-mail addresses these days. So the explanatory Web page URL in the user-agent string is the way to go.

    In general, that page should include the purpose of the 'bot and the project, the standard boilerplate robots.txt exclusion and "meta name=robots" explanation, and the spider name to be used in robots.txt. Also, mainly black text/white or light background for us old-timers, please... ;)

    Good luck with your project,
    Jim

    volatilegx

    2:21 am on Jul 15, 2004 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    dumbfounder,

    I just wanted to take the opportunity to thank you for posting about your spider here. We appreciate seeing posts from the owners of search engines/spiders.

    Dan

    dumbfounder

    2:35 am on Jul 15, 2004 (gmt 0)

    10+ Year Member



    thanks jd, I appreciate your help and level-headed moderation of the discussion. I wasn't actually referring to you when speaking of the accusations of wrongdoing. Someone put a link earlier to another discussion that had those: [webmasterworld.com...] . It would have taken far less time to follow the link to my site and send me an email than go to the website, do a whois, do a search for my name, find out i worked for 2wrongs, then find out that 2wrongs was acquired by cyveillance, find out what cyveillance does, and then post in the forum. That really pissed me off, and I guess I carried my attitude over to this discussion so I apologize if I directed my agression towards you or others undeserving.

    thanks again.

    dumbfounder

    2:40 am on Jul 15, 2004 (gmt 0)

    10+ Year Member



    thanks Dan!

    Stefan

    3:05 am on Jul 15, 2004 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Geez... I haven't seen Dumbot in the logs. What gives, Dumbfounder? I'm in two cat's in the ODP...

    Anyway, the more SE's the better. If you take advantage of the good advice you've gotten here, you might have a future.

    dumbfounder

    3:27 am on Jul 15, 2004 (gmt 0)

    10+ Year Member



    I am going as fast as I can! Please shoot me an email with your site address so I can add it and figure out why I missed it.

    they do have good advice on the crawler side, now if they could only tell me how to make a search engine...

    Stefan

    3:47 am on Jul 15, 2004 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    I am going as fast as I can! Please shoot me an email with your site address so I can add it and figure out why I missed it.

    Nahh... can't do that, man. It has to be found fair and square.

    they do have good advice on the crawler side, now if they could only tell me how to make a search engine...

    The easiest way would be to have every search give the site in my profile as the first 100 serps, and then toss in a few other ones, pulled out of a hat, at the end. Just a suggestion... ;-)

    wilderness

    2:31 pm on Jul 19, 2004 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    I've had so many unidntified crawls from Verizon users over a perriod of time that I have a large range of Verizon subscribers denied as a result.
    Verizon's automatic responses to reported violations of their own UAG only add kindling to the fire on my end.

    fiestagirl

    4:28 pm on Jul 19, 2004 (gmt 0)

    10+ Year Member



    Glad to see you posting again wilderness. Thanks

    wilderness

    7:18 pm on Jul 19, 2004 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    Glad to see you posting again

    fiestagirl,
    this is the only forum at Webamster World which held/holds my interest.
    Occassionally I followed Jim's Apache forum while this one was down, however not with any regularity.

    I see you been keeping up on things :-)