homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Yahoo / Yahoo Search Engine and Directory
Forum Library, Charter, Moderators: martinibuster

Yahoo Search Engine and Directory Forum

Yahoo Webmap: Roughly 1 Trillion Links
Yahoo Implements Apache Hadoop To Process Webmap

 2:33 pm on Feb 21, 2008 (gmt 0)

On a very related note, we're announcing today that we implemented what we believe is the world's largest commercial application of Apache Hadoop. We are now using Hadoop to process the Webmap -- the application which produces the index from the billions of pages crawled by Yahoo! Search.

Yahoo Implements Apache Hadoop To Process Webmap [ysearchblog.com]

More about Hadoop running in production on the Yahoo! Search Webmap [developer.yahoo.net]

Some Webmap size data:

    Number of links between pages in the index: roughly 1 trillion links
      Size of output: over 300 TB, compressed!
        Number of cores used to run a single Map-Reduce job: over 10,000
          Raw disk used in the production cluster: over 5 Petabytes



           3:59 pm on Feb 21, 2008 (gmt 0)

          Very interesting interview by Jeremy Zawadony of two of the Y! engineers (since Inktomi days) about the Y! search infrastructure on that second linked page.


           12:47 am on Feb 22, 2008 (gmt 0)

          1 Trillion links and the best Yahoo can return is:

          select top 10 *
          from links
          where id = newId()



           2:03 pm on Feb 22, 2008 (gmt 0)

          Any time Yahoo tries to boast about how good it is, I go and do a search for some basic stuff -- disappointed yet again with nothing but crap. Hey Yahoo guys, you are in trouble either way. All by yourself you are useless and with Microsoft you are marrying another loser in the search business.


           9:58 pm on Feb 22, 2008 (gmt 0)

          I think this would double the quality and split the Trillion in half though:

          select *
          from links
          where id not in(select id from links where link is not like '%.info%')



           9:09 pm on Feb 23, 2008 (gmt 0)

          Yes, Maybe they can now clear old Inktomi penalties that they seem lost on how to go about doing as well.

          Global Options:
           top home search open messages active posts  

          Home / Forums Index / Yahoo / Yahoo Search Engine and Directory
          rss feed

          All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
          Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
          WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
          © Webmaster World 1996-2014 all rights reserved