Two examples of some gems I'm appreciating from that collection:
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms by Monika Henzinger
This paper was not new to me - i belive Marcia pointed it out a while ago. It really opened up my eyes to the challenge of attributing a document properly, and filtering out the secondary versions. Some of the urls that hide behind "omitted results" links owe their hiding place to this kind of logic.
Structured Models for Fine-to-Coarse Sentiment Analysis by Ryan McDonald,et. al
Sentiment Analysis is a particular interst of mine. It's a kind of semantic processing that works to determine the "sentiment" of a document. That can mean many things, but two key areas for Google would be where a document falls on a postive to negative scale in its approach to a topic -- or where it falls on a spectrum of subjective (opinion-based) to objective (fact-based).
You can see how Google would be very interested in this kind of challenge. In fact, I thought I saw signs of Sentiment Analysis in the first page results last year. But I asked some Google staff about it at PubCon and was told it's not currently in use - and that it is definitely a "hard problem." If you just think about an algorithm trying to make sense of irony, you can quickly appreciate how hard the problem can be.
Those two words currently being tossed around by top staff - "diversity" and "serendipity" - certainly could Incorporate some sentiment factors in the future.