Thanks for the heads-up about the 2wrongs status code 302 redirect to cyveillance. Dumbfind now looks like a wolf in sheep's clothing. I will keep a close watch on their robot.
Nice find bcolflesh. :)
I always wondered about these scumbags. They usually seem to originate on US dialup ISP connections which is probably an attempt to evade detection.
I read the article reference by bcolflesh.
|This transaction will allow Cyveillance to better respond to the increasing need for highly-customized data mining projects... |
Dumbot is now Deadbot. Squish!
Hi. dumbfind.com is not in any way affiliated with CyveillanceInc. If I was trying to be sneaky I wouldn't put dumbfind.com in the user-agent string, and I certainly wouldn't put my name as the registration contact. I am actually making a search engine. I do not sell my data or let anyone else have access to it in any shape or form. I use dsl lines with dynamic IP's because that is the cheapest source of bandwidth available to me. It would have been nice to get an email from any of you before implying any ill-intent on my part. I have an exclude list so if any of you want me to remove your sites please just send my a list of hosts you don't want spidered. If anyone has any questions feel free to email me at firstname.lastname@example.org.
Forums are a great medium for communication, but posters need to be sure of their facts before jumping to (potentially libellous) conclusions.
(Moderator - this thread should probably be deleted.)
|I have an exclude list so if any of you want me to remove your sites please just send my a list of hosts you don't want spidered. If anyone has any questions feel free to email me at email@example.com. |
I think you are getting backlash because any legitimate bot should follow the robots.txt exclusion protocol - if they don't, well they are just dumb.
it does follow robots.txt, but I also allow people to send me sites they don't want spidered if they don't feel like dealing with it.
I put up a little info page that the user-agent now points to:
|"the greatest search engine in the history of everything or something" |
- Vasco da Gama, circa 1490
Why am I filled with less than excitement over this guy and his scheme? :)
|it does follow robots.txt |
And what is the exact syntax since it doesn't seem to follow either of the below:
I am suspect of any crawler that hides behind dial-up IP addresses instead of actually coming from the site itself.
Think we can safely ignore this one. Banned from my sites, and the old early warning alarm system is on the lookout for him.
because you do not share my sense of humor.
please let me know what your web address is so I can track down the issue. It may just be a caching thing as since my spider doesn't visit very often it doesn't grab robots.txt very often. I am probably going to purge the entire cache today so that the angry hordes of hatemongers congregating at this site don't come and kill me.
The reason I come from different ip's is that I have 7 dsl lines running into my house. Which is where I work. For myself. Alone. Why 7 dsl lines do you ask? Because it is the cheapest source of bandwidth available to me. 6 of the lines are Verizon lines with dynamic ip's, one is a Covad line with a static ip. Again, if I wanted to hide I WOULDN'T PUT MY WEBSITE ADDRESS IN THE USER-AGENT STRING.
I'm not accusing you of trying to hide dumbfounder - just of not being very professional.
|<!-- saved from url=(0040)http://www.donkeycake.com/gunk/dumbfind/ --> |
Eh? (Quote from dumbots homepage sourcecode).
Also, what's all that about the mailto: being hidden behind the "Hey!" on your frontpage instead of the "email us"?
However, I do agree with points 1,2,5,6,7,8 (although not 3 and 4) of your manifesto :)
|The Contractor, |
please let me know what your web address is so I can track down the issue.
Sorry, not trying to be an idiot, but I am not giving out any website addresses. You need to run your bot on your sites and see why they do not adhere to robots.txt - should be a simple issue if you are the developer.
You never did say which of my two examples are correct for robots.txt. I have tried one or the other on multiple sites and neither seem to work. I am not against anyone building a bot, but they need to adhere to robots.txt to be legit for the same reason you have DSL - why do I want to burn up gigabytes of bandwidth and pay for it for bots I don't want (no offense).
Until you give the robots.txt syntax, get your bot to adhere to it, and explain it on your site to those that don't have a clue how to block it - you will never be legit IMHO.
As a reader of this thread, I'd like to see it turn more in the direction of reporting problems and helping to get them fixed, rather than a piling-on. Since the phrase "professional" has been mentioned, let's try to live up to it here.
Spotting a new 'bot in the server logs triggers anxiety in some Webmasters because of the massive abuse prevalent on the 'net today. This anxiety is further heightened if a new robot has a bug that causes it to misinterpret robots.txt, or even when the robots.txt file itself is to blame. Anything that makes it more difficult for a Webmaster to find out the intent of the new 'bot only exacerbates this anxiety, until in the end some Webmasters adopt an attitude of "If I haven't heard of you, stay off my site." This is unfortunate... I keep thinking of the first time I saw "Googlebot" in my logs and thought, "What is that? It's sure an unappealing name..."
Can we please give Dumbfounder a break here, and treat him with the respect that every WebmasterWorld member is entitled to? Personal attacks are a violation of the WebmasterWorld terms of service, BTW.
Dumbfounder, please take a look at the pages Google has put up for Webmasters. Note the depth of their explanations and reassurances. They are completely forthright about their robot and their quality policies. Google didn't do this just for fun; This totally-open approach is what is needed today to allay the fears you see expressed in this thread. Designing a robot and a ranking algorithm is one thing; Managing "Webmaster relations" is something else -- and a subject not covered in the technical manuals or in CS classes.
Anyway, let's all be polite here, and not rush to judgement. Each Webmaster may decide what to do with each user-agent that visits his/her site, but I'd argue that such judgements should be made rationally based on observed behaviour, and not on fear and unreasonable doubt. Business decisions should be made calmly and cooly.
If I offended anyone I didn't mean too. I agree with jdMorgan and I am not out to attack anyone. I have been trying to block the dumbot for a while now because all that was there in the UA was a link to a "basically" blank page.
If you want to be taken seriously and are working on a project indexing sites - the more info you can give on who you are and why you are crawling the better. There are enough site scrapers/downloaders out there where if someone doesn't say what they are doing and who they are - I block them instantly (at least I try with robots.txt and .htaccess).
Either is fine. I do a case-insensitive search for "dumbot" anywhere on any line that starts with "user-agent" (again, case insensitive). Then any disallows found after that, up until the next user-agent line, are attributed. Other than the pattern match, it is the same exact code I use for the user-agent "*". You are the only person to ever complain that I have not adhered to robots.txt and I have been doing this for 8 months now, which leads me to think that the problem is not a large one. The absolute best way to track down the problem is to use real examples from real websites, I cannot account for the infinite possibilities found out on the web. I think you are doing a disservice to others by refusing to give out your web address. You can email it to me if you wish. firstname.lastname@example.org. Or you can call me 703-683-3077, or you can drive to my house and hand me a letter (just do a whois on dumbfind.com).
thanks jd, how about if I put a link to google's bot information page at the end of mine? I would feel better about that than simply plagiarizing their content.
>I'm not accusing you of trying to hide dumbfounder - just of not being very professional.
ok, phew. I am definitely not professional, as evidenced by the fact that noone is paying me to do this.
><!-- saved from url=(0040)http://www.donkeycake.com/gunk/dumbfind/ -->
>Also, what's all that about the mailto: being hidden behind the "Hey!" on your frontpage instead of the "email us"?
my cousin designed the page. I try not to constrain his artistic nature so I left it as-is. I will make it bigger just to appease you though.
>However, I do agree with points 1,2,5,6,7,8 (although not 3 and 4) of your manifesto :)
does that mean if you think the developers are indeed funny that you won't support their idea? everytime you say that a code-writing clown dies somewhere.