Forum Moderators: open
I feel very strongly that until we have a good grasp on why it exists, it will be very hard to beat.
I don't buy the explanation that it's intended to be a method of stopping spam. Why? One, there's too much collateral damage it is doing. Two, if you accept the 80/20 principle (20% of spammers are doing 80% of the spamming), and you realize that there are multiple ways already of beating the sandbox that all of those spammers are aware of, it doesn't make sense anymore.
So, why does the sandbox exist?
The most obvious effect of the sandbox is that it prevents new domains (not pages) from ranking for any relatively competitive term. So, start thinking like a search engine - what would be the benefit of this?
entry. Most likely the URL's identifier is the question mark portion.
I read this argument about a year ago. It wasn't impressive then, and it's not impressive now.
Where does it say anything about 32 bit integers in there?
Point is, even *if* there was some 32 bit value for document ids (dubious) it would take *nothing* to assign a version number to files, update the id to a longer value, and update the SE and crawl programs to recognize the version of the files as they served pages/crawled.
When you are networking large numbers of computers there have to be factors that normally don't come into play that many of us would not even think about (I can't speak for you of course).
"Each chunk is identified by an immutable and globally unique 64 bit chunk handle assigned by the master at the time of chunk creation. Chunk servers store chunks on local disks as Linux files and read or write chunk data specified by a chunk handle and byte range."
"The Google File System," by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
4 bytes: 300 * 4 billion * 4 * 2 = 9.6 terabytes
The first thing that happens when a search is requested is a lookup in the inverted index. To distribute this load, multiple copies of this index probably exist in each data center. Multiply the above by some unknown number.
This is a lot of space. That's just the space issue connected with the docID, whether it's all in memory or all on hard disk. Everything I've ever read about inverted indexes mentions the importance of compression. You cannot compress further and get more than 4.29 billion unique ones and zeros in 4 bytes (32 bits).
Now add the performance issue of the extra CPU cycles to fetch an expanded docID. (We're not even talking about calculating PageRank, because I think Google realized this was dead 18 months ago.)
Moving to 64-bit computing makes a lot of sense. They can define a new 5-byte integer type in the math library if they want to save space. But the point is, a 64-bit CPU could fetch this new type in one pass instead of two, and you don't take a performance hit.
The lookups have to be painfully slow no matter which way you slice it, because of the numbers you get at the top of the results page (typical values fall in the 0.15 to 0.4 seconds range). The Google searches are so slow they *have* to distribute things around just to keep up with the requests. At 0.25 seconds average for a lookup and the probably 4,000 searches per second they get during peak periods they'll need 1,000 machines just to handle the load.
In my view a reported 10,000 machines at the plex and elsewhere easily handles a distributed inverted index/repository/etc with a larger DocId.
Oh, critter, you really need to go back and reread the thing before making the types of comments you're making, your memory is playing tricks on you, or you just skimmed over this:
Our compact encoding uses two bytes for every hit. There are two types of hits: fancy hits and plain hits.
Is there something about that sentence that is unclear? Two types of hits, each 2 bytes. Thats 2+2Bytes, that's 4 bytes.