Forum Moderators: open

Message Too Old, No Replies

DuckDuckBot

         

keyplyr

11:32 am on Nov 6, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



72.94.249.**
Verizon IP address assigned to private residence
UA: DuckDuckBot/1.1; (+http://duckduckgo.com/duckduckbot.html)

Requested robots.txt only

Pfui

7:54 pm on Nov 6, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



From last March (believe it or not, I don't report every bot I see or I'd never get anything done:) Host is rDNS for OP IP:

nym.ivegotafang.com
DuckDuckBot/1.0; (+http://duckduckgo.com/duckduckbot.html)

robots.txt? NO

That bot page sez "It should respect your robots.txt file" but it didn't. Also, it's wasn't the only UA to come from that Host (which states it's "partnered" with the UA's site). Another was:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)

robots.txt? NO

jdMorgan

8:27 pm on Nov 6, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The changelog notes (admits) that early versions of DuckDuckBot were not correctly respecting robots.txt.

The other user-agent may have something to do with their "page preview" function in the SERPs, but I'm not sure -- It has been blocked because of invalid request headers, and no preview is available for my site, so this is just an inference. It could have also been the first version of the DDBot before it was given a name.

I actually kind of like their search results... Very clean and simple presentation.

Jim

Pfui

9:23 pm on Nov 16, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Still no robots.txt and now cloaked, too:

tmbg.duckduckgo.com
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)

robots.txt? NO

duckduckgo.com = 72.94.249.36
(see OP; now registered to "Ive Got A Phang Inc"; ditto...)
ivegotafang.com = 72.94.249.38

keyplyr

11:17 pm on Nov 16, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I've Got A Fang:

72.94.249.32 - 72.94.249.39

- OR -

72.94.249.32/29

Pfui

7:19 am on Dec 12, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



tmbg.duckduckgo.com
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)

robots.txt? NO (again/still)

jdMorgan

5:26 am on Dec 18, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The "windows" UA is triggered when your site is in the search results and the DDG server tries to show your favicon to the left of your listing. The other DDG user-agents seem to work as advertised now, with DuckDuckBot doing some crawling, and DuckDuckPreview generating the 'page preview' if you hover over them in the results. I haven't had any problem with robots.txt violations, but I allow these crawlers, and I've made an exception for that brain-dead favicon-finder-function.

Jim

Pfui

7:59 am on Dec 18, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I feel a bit (okay, a lot:) less charitable toward these guys because here's what I continue to see:

- They don't request, or respect, robots.txt.
- They don't have authorization to retrieve any file other than robots.txt, yet they try repeatedly.
- They don't use a self-identified UA -- they cloak their hits.
- They don't send any identifiable results-referred (or any) traffic.

To me, those add up to yet another bad bot.

One last bit about favicons:

On the two sites they hit, they've never requested favicons, only the base .html file. One hit, then gone. For example, here they are doing exactly what robots.txt disallows -- if they'd bothered to read/heed robots.txt:

tmbg.duckduckgo.com - - [16/Dec/2009:**:01:30 -0800] "GET / HTTP/1.1" 200 14761 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"

Nothing says bad bot like having to add a special rule for a Host/IP saying, "Robots.txt or Bust."

jdMorgan

4:54 pm on Dec 18, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



In order to get the correct URL of your favicon, which is not necessarily "favicon.ico", user-agents need to fetch a page where it may be referenced in a '<link rel="shortcut icon" type="image/x-icon" href="images/example.ico">' directive. Failing this, the next step is to try the default 'favicon.ico'

Based on my testing, this appears to be the reason that the 'home page' gets fetched. The root URL of the site is the standard place to declare your favicon files (which you may chose to change based on the user-agent's capabilities, IE being the 'lowest common denominator' in feature support).

Now whether you want to allow this fetch is of course entirely up to you, but sometimes the misbehavior of user-agents is due to "bumbling" rather than malicious intent.

For example, I just got about 64 favicon.ico requests in a row from a different host -- an on-line 'link-sharing and synchronization' service that was too stupid to understand a 403. It either had a limit on the number of fetches it was willing to do to get the favicon, or it timed out after a few seconds... It was simply stupid, not malicious.

I'm not trying to talk you into or out of blocking the Duck, just trying to throw some light on what it's doing.

From what I can find, Duck is just 'a guy' who's put together what seems to be a fairly good search service based on Yahoo's "BOSS" which is an 'open hook' into Yahoo's search data. His company name as listed in WHOIS is what I'd term "unfortunate" in that it sounds slightly threatening if not read as an intentional miss-spell of the word "thing" morphed to "thang."

He used to fetch with either a blank user-agent or a standard browser UA, but has now given two of his three user-agents a proper name and 'info link' -- If he fixes this last script to send "DuckDuckFavicon" then perhaps some will forgive the robots.txt-less index page fetch... Caching the fetched favicons on his server might also help.

Jim