Forum Moderators: open
nym.ivegotafang.com
DuckDuckBot/1.0; (+http://duckduckgo.com/duckduckbot.html)
robots.txt? NO
That bot page sez "It should respect your robots.txt file" but it didn't. Also, it's wasn't the only UA to come from that Host (which states it's "partnered" with the UA's site). Another was:
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)
robots.txt? NO
The other user-agent may have something to do with their "page preview" function in the SERPs, but I'm not sure -- It has been blocked because of invalid request headers, and no preview is available for my site, so this is just an inference. It could have also been the first version of the DDBot before it was given a name.
I actually kind of like their search results... Very clean and simple presentation.
Jim
Jim
- They don't request, or respect, robots.txt.
- They don't have authorization to retrieve any file other than robots.txt, yet they try repeatedly.
- They don't use a self-identified UA -- they cloak their hits.
- They don't send any identifiable results-referred (or any) traffic.
To me, those add up to yet another bad bot.
One last bit about favicons:
On the two sites they hit, they've never requested favicons, only the base .html file. One hit, then gone. For example, here they are doing exactly what robots.txt disallows -- if they'd bothered to read/heed robots.txt:
tmbg.duckduckgo.com - - [16/Dec/2009:**:01:30 -0800] "GET / HTTP/1.1" 200 14761 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
Nothing says bad bot like having to add a special rule for a Host/IP saying, "Robots.txt or Bust."
Based on my testing, this appears to be the reason that the 'home page' gets fetched. The root URL of the site is the standard place to declare your favicon files (which you may chose to change based on the user-agent's capabilities, IE being the 'lowest common denominator' in feature support).
Now whether you want to allow this fetch is of course entirely up to you, but sometimes the misbehavior of user-agents is due to "bumbling" rather than malicious intent.
For example, I just got about 64 favicon.ico requests in a row from a different host -- an on-line 'link-sharing and synchronization' service that was too stupid to understand a 403. It either had a limit on the number of fetches it was willing to do to get the favicon, or it timed out after a few seconds... It was simply stupid, not malicious.
I'm not trying to talk you into or out of blocking the Duck, just trying to throw some light on what it's doing.
From what I can find, Duck is just 'a guy' who's put together what seems to be a fairly good search service based on Yahoo's "BOSS" which is an 'open hook' into Yahoo's search data. His company name as listed in WHOIS is what I'd term "unfortunate" in that it sounds slightly threatening if not read as an intentional miss-spell of the word "thing" morphed to "thang."
He used to fetch with either a blank user-agent or a standard browser UA, but has now given two of his three user-agents a proper name and 'info link' -- If he fixes this last script to send "DuckDuckFavicon" then perhaps some will forgive the robots.txt-less index page fetch... Caching the fetched favicons on his server might also help.
Jim