Welcome to WebmasterWorld Guest from 23.20.241.155

Forum Moderators: martinibuster

Message Too Old, No Replies

Yahoo Webmap: Roughly 1 Trillion Links

Yahoo Implements Apache Hadoop To Process Webmap

     
2:33 pm on Feb 21, 2008 (gmt 0)

WebmasterWorld Administrator engine is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month Best Post Of The Month



On a very related note, we're announcing today that we implemented what we believe is the world's largest commercial application of Apache Hadoop. We are now using Hadoop to process the Webmap -- the application which produces the index from the billions of pages crawled by Yahoo! Search.

Yahoo Implements Apache Hadoop To Process Webmap [ysearchblog.com]

More about Hadoop running in production on the Yahoo! Search Webmap [developer.yahoo.net]

Some Webmap size data:

    Number of links between pages in the index: roughly 1 trillion links
      Size of output: over 300 TB, compressed!
        Number of cores used to run a single Map-Reduce job: over 10,000
          Raw disk used in the production cluster: over 5 Petabytes
          3:59 pm on Feb 21, 2008 (gmt 0)

          WebmasterWorld Senior Member jimbeetle is a WebmasterWorld Top Contributor of All Time 10+ Year Member



          Very interesting interview by Jeremy Zawadony of two of the Y! engineers (since Inktomi days) about the Y! search infrastructure on that second linked page.
          12:47 am on Feb 22, 2008 (gmt 0)

          WebmasterWorld Senior Member 10+ Year Member



          1 Trillion links and the best Yahoo can return is:

          select top 10 *
          from links
          where id = newId()

          lol

          2:03 pm on Feb 22, 2008 (gmt 0)

          5+ Year Member



          Any time Yahoo tries to boast about how good it is, I go and do a search for some basic stuff -- disappointed yet again with nothing but crap. Hey Yahoo guys, you are in trouble either way. All by yourself you are useless and with Microsoft you are marrying another loser in the search business.
          9:58 pm on Feb 22, 2008 (gmt 0)

          WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



          I think this would double the quality and split the Trillion in half though:

          select *
          from links
          where id not in(select id from links where link is not like '%.info%')

          .

          9:09 pm on Feb 23, 2008 (gmt 0)

          5+ Year Member



          Yes, Maybe they can now clear old Inktomi penalties that they seem lost on how to go about doing as well.
           

          Featured Threads

          Hot Threads This Week

          Hot Threads This Month