Welcome to WebmasterWorld Guest from 54.196.243.192

Forum Moderators: martinibuster

Message Too Old, No Replies

Yahoo Webmap: Roughly 1 Trillion Links

Yahoo Implements Apache Hadoop To Process Webmap

     
2:33 pm on Feb 21, 2008 (gmt 0)

Administrator from GB 

WebmasterWorld Administrator engine is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month Best Post Of The Month

joined:May 9, 2000
posts:22305
votes: 239


On a very related note, we're announcing today that we implemented what we believe is the world's largest commercial application of Apache Hadoop. We are now using Hadoop to process the Webmap -- the application which produces the index from the billions of pages crawled by Yahoo! Search.

Yahoo Implements Apache Hadoop To Process Webmap [ysearchblog.com]

More about Hadoop running in production on the Yahoo! Search Webmap [developer.yahoo.net]

Some Webmap size data:

    Number of links between pages in the index: roughly 1 trillion links
      Size of output: over 300 TB, compressed!
        Number of cores used to run a single Map-Reduce job: over 10,000
          Raw disk used in the production cluster: over 5 Petabytes
          3:59 pm on Feb 21, 2008 (gmt 0)

          Senior Member

          WebmasterWorld Senior Member jimbeetle is a WebmasterWorld Top Contributor of All Time 10+ Year Member

          joined:Oct 26, 2002
          posts:3292
          votes: 6


          Very interesting interview by Jeremy Zawadony of two of the Y! engineers (since Inktomi days) about the Y! search infrastructure on that second linked page.
          12:47 am on Feb 22, 2008 (gmt 0)

          Senior Member

          WebmasterWorld Senior Member 10+ Year Member

          joined:Feb 13, 2005
          posts:1077
          votes: 0


          1 Trillion links and the best Yahoo can return is:

          select top 10 *
          from links
          where id = newId()

          lol

          2:03 pm on Feb 22, 2008 (gmt 0)

          Preferred Member

          5+ Year Member

          joined:Sept 28, 2007
          posts:487
          votes: 0


          Any time Yahoo tries to boast about how good it is, I go and do a search for some basic stuff -- disappointed yet again with nothing but crap. Hey Yahoo guys, you are in trouble either way. All by yourself you are useless and with Microsoft you are marrying another loser in the search business.
          9:58 pm on Feb 22, 2008 (gmt 0)

          Senior Member

          WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

          joined:Dec 27, 2004
          posts:1666
          votes: 35


          I think this would double the quality and split the Trillion in half though:

          select *
          from links
          where id not in(select id from links where link is not like '%.info%')

          .

          9:09 pm on Feb 23, 2008 (gmt 0)

          Junior Member

          5+ Year Member

          joined:Sept 24, 2007
          posts:51
          votes: 0


          Yes, Maybe they can now clear old Inktomi penalties that they seem lost on how to go about doing as well.