Welcome to WebmasterWorld Guest from 23.20.230.24

Forum Moderators: not2easy & rumbas

Message Too Old, No Replies

Twitter's Real Time URL Fetcher: SpiderDuck

     
10:32 am on Nov 16, 2011 (gmt 0)

WebmasterWorld Administrator engine is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month Best Post Of The Month



Twitter's Real Time URL Fetcher: SpiderDuck [engineering.twitter.com]
SpiderDuck is a service at Twitter that fetches all URLs shared in Tweets in real-time, parses the downloaded content to extract metadata of interest and makes that metadata available for other Twitter services to consume within seconds.

Several teams at Twitter need to access the linked content, typically in real-time, to improve Twitter products. For example:

  • Search to index resolved URLs and improve relevance
  • Clients to display certain types of media, such as photos, next to the Tweet
  • Tweet Button to count how many times each URL has been shared on Twitter
  • Trust & Safety to aid in detecting malware and spam
  • Analytics to surface a variety of aggregated statistics about links shared on Twitter
  • 11:42 am on Nov 16, 2011 (gmt 0)

    WebmasterWorld Senior Member 5+ Year Member



    I've seen Twitter's "spiderduck" subdomain/bot since at least the beginning of August. Here's what it looks like, with two different UAs from two different domains always hitting simultaneously on Nov. 12th --

    spiderduck01.dmz1.twitter.com [projecthoneypot.org...]
    Twitterbot/1.0

    09:57:34 /robots.txt

    -- BUT --

    User-agent: *
    Disallow: /

    -- is promptly ignored by its fellow traveler(s):

    r-199-59-149-10.twttr.com [projecthoneypot.org...]
    Twitterbot/0.1

    09:57:34 /filename.html
    10:34:57 /filename.html

    Thee-plus months' of hits show the exact same one-two punch pattern where Twitterbot/1.0 only requests robots.txt and Twitterbot/0.1 never does (& always ignores same).

    FWIW: I'm content to leave my Disallows and bot-blocks as-is because I've yet to see any benefit from Twitter crawling/extracting/whatevering my content "to improve Twitter products."
    11:56 am on Nov 16, 2011 (gmt 0)

    WebmasterWorld Administrator engine is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month Best Post Of The Month



    I guess, until we see how the data is used or presented it's difficult to say if it's worthwhile to allow access. I would have thought that if you're a site such as WSJ or BBC you'd want to allow access to the public side of the site.
    11:29 pm on Nov 17, 2011 (gmt 0)

    WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



    Posting a URL to Twitter these days usually sees about ten different bot visits to the posted URL within 1 to 2 seconds of posting.
     

    Featured Threads

    Hot Threads This Week

    Hot Threads This Month