Welcome to WebmasterWorld Guest from 54.226.47.198

Forum Moderators: not2easy & rumbas

Message Too Old, No Replies

Twitter's Real Time URL Fetcher: SpiderDuck

     
10:32 am on Nov 16, 2011 (gmt 0)

Administrator from GB 

WebmasterWorld Administrator engine is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month Best Post Of The Month

joined:May 9, 2000
posts:23245
votes: 357


Twitter's Real Time URL Fetcher: SpiderDuck [engineering.twitter.com]
SpiderDuck is a service at Twitter that fetches all URLs shared in Tweets in real-time, parses the downloaded content to extract metadata of interest and makes that metadata available for other Twitter services to consume within seconds.

Several teams at Twitter need to access the linked content, typically in real-time, to improve Twitter products. For example:

  • Search to index resolved URLs and improve relevance
  • Clients to display certain types of media, such as photos, next to the Tweet
  • Tweet Button to count how many times each URL has been shared on Twitter
  • Trust & Safety to aid in detecting malware and spam
  • Analytics to surface a variety of aggregated statistics about links shared on Twitter
  • 11:42 am on Nov 16, 2011 (gmt 0)

    Senior Member

    WebmasterWorld Senior Member 10+ Year Member

    joined:Nov 5, 2005
    posts: 2038
    votes: 1


    I've seen Twitter's "spiderduck" subdomain/bot since at least the beginning of August. Here's what it looks like, with two different UAs from two different domains always hitting simultaneously on Nov. 12th --

    spiderduck01.dmz1.twitter.com [projecthoneypot.org...]
    Twitterbot/1.0

    09:57:34 /robots.txt

    -- BUT --

    User-agent: *
    Disallow: /

    -- is promptly ignored by its fellow traveler(s):

    r-199-59-149-10.twttr.com [projecthoneypot.org...]
    Twitterbot/0.1

    09:57:34 /filename.html
    10:34:57 /filename.html

    Thee-plus months' of hits show the exact same one-two punch pattern where Twitterbot/1.0 only requests robots.txt and Twitterbot/0.1 never does (& always ignores same).

    FWIW: I'm content to leave my Disallows and bot-blocks as-is because I've yet to see any benefit from Twitter crawling/extracting/whatevering my content "to improve Twitter products."
    11:56 am on Nov 16, 2011 (gmt 0)

    Administrator from GB 

    WebmasterWorld Administrator engine is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month Best Post Of The Month

    joined:May 9, 2000
    posts:23245
    votes: 357


    I guess, until we see how the data is used or presented it's difficult to say if it's worthwhile to allow access. I would have thought that if you're a site such as WSJ or BBC you'd want to allow access to the public side of the site.
    11:29 pm on Nov 17, 2011 (gmt 0)

    Senior Member

    WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

    joined:July 3, 2002
    posts:18903
    votes: 0


    Posting a URL to Twitter these days usually sees about ten different bot visits to the posted URL within 1 to 2 seconds of posting.
     

    Join The Conversation

    Moderators and Top Contributors

    Hot Threads This Week

    Featured Threads

    Free SEO Tools

    Hire Expert Members