Forum Moderators: open
I feel very strongly that until we have a good grasp on why it exists, it will be very hard to beat.
I don't buy the explanation that it's intended to be a method of stopping spam. Why? One, there's too much collateral damage it is doing. Two, if you accept the 80/20 principle (20% of spammers are doing 80% of the spamming), and you realize that there are multiple ways already of beating the sandbox that all of those spammers are aware of, it doesn't make sense anymore.
So, why does the sandbox exist?
The most obvious effect of the sandbox is that it prevents new domains (not pages) from ranking for any relatively competitive term. So, start thinking like a search engine - what would be the benefit of this?
Data mining is all about "trending" the data...looking for patterns and anomalies
So anyone who uploads a large amount of new content, say 30% in one hit, is taking the risk the new content will be viewed as anomalous.
If the new content is examined for LSI and is thought to look like puff and wind it gets stuffed into the sandbox,yes?
We're all at it aren't we? Creating new content to impress Google that we are an authority site, but not being prepared to commission work from known authorities but plagiarising other peoples work, modifying it, and chucking it on the web. I think Google has the measure of that little wheeze and has tightened up on LSI recently.
But what is PageRank in the larger scheme of ranking? It's a number that ranks the importance of the page, that is assigned without respect to the search terms that may be used to pull up the page. The key thing about this number is that it can be precomputed. Then the docIDs in the inverted indexes can be sorted by this number. That means you only have to scrape off the top of the docIDs for a search term -- just deep enough to satisfy the searcher's request for 10 to 100 links. You don't have to look at 99 percent of your index for most searches.
After you scrape off the top docIDs for a search, then you look at how each document relates to the search terms, using other algorithms. But this initial sort in the inverted indexes is probably the most crucial efficiency algorithm in Google's entire system.
Now this initial "PageRank" number certainly does not have to be the pure link calculation it was originally. Links are an obvious indication of importance, but the calculation doesn't have to be pure or recursive. If you did a seat-of-your-pants link calculation, you might want to consider other factors also. Remember, all these factors would blend into a number that is precomputed -- before you even construct the inverted indexes for searching. The inverted indexes are sorted on this number.
One thing that comes to mind is some measurement of the quality of a page in the context of the site. The original PageRank never looked at the site as a whole. But the more you know about the site, the more you know about the quality of pages that make up the site. Is the site spammy? Is it a .gov, .edu, or .org where the spam problem is less? Is it a new site? If new, does it have thousands of pages already? Is the site commercial or informational? If commercial, is it an affiliate site?
What if Google started keeping information on the nature of sites, and used this to weight the "PageRank" of the pages on that site? This would probably be the best approach to fighting spam.
In the Florida update, they tried to do something on the other end of the pipeline. Florida was an on-the-fly filter that was applied after the search terms were collected from the searcher. It didn't work too well. Maybe the semantic stuff was overrated internally at Google, by some engineers who had influence.
Now they may be working on the pre-computed part of the algorithm. I think they'll still call it "PageRank" (at least until all the lockups expire in five months and they all dump their stock), but it's going to be something more than PageRank. I suspect the logical direction is to evaluate the page as a member of a site. There are many fewer sites than there are pages, and it might be workable.
Something else I'll throw in here. My site, a 129,000 page nonprofit site, got a special crawl over the Labor Day weekend. It was special because it was manually dispatched. I know this because they grabbed all the pages, didn't ask for anything that was 404, and didn't ask for any of the sitemap pages. Every crawled page was sorted -- they crawled from the shortest URL to the longest URL. The only way they could have done a crawl this clean would be to either study my sitemap pages, or take my CSV dump of the deep page URLs, parsed out that field, and resorted.
I've never seen a crawl like this in four years. They crawled for 36 hours. Only two IP addresses were used. About every 25 minutes, they'd hit the site for around 2 minutes only. It was very methodical. The peak fetch rate I recorded was 40 pages per second. Yes, per second -- even though almost all pages are very small, and are all static, this tripped my load alarms. I survived and let them do their thing.
Why did I go off-topic to mention this? Because I'm not sure it's off-topic. I think it might be evidence that Google is no longer exclusively looking at the web as a bunch of pages, but as pages that belong to sites.
This could explain the sandbox effect.
So far, by the way, there is no evidence that this special crawl has kicked in.
If Google were to go into content analysis and site themes, how would searchers find the granular information they're looking for? And wouldn't everyone's response be to harmonize their site for their core keyword phrases, and be resistant to developing new content? That seems like a dead end.
Jake's original post was about why the sandbox exists and there's been some interesting suppositions:
* new pages are held off to stabilize the database - meaning Google thinks it's weak
* adding too many new pages or site wide links from other web sites could result in longer term sandbox time
* that the sandbox has some sort of dynamic activity going within it - pages disappear or increase ranking even if it is way down in the SERPs
* some report that they had a new site show a PR
* that Google is spidering (panic mode) for a completely new index coming in 2005
* that Google is assessing algorithm criteria against one another
* and one controversial comment that the sandbox doesn't apply to internal links.
I'd say the best speculation is that Google has lost confidence in the index quality and is developing/testing a new algorithm on a fresh set of data.
With multidomain ownership, link buying, swapping, automated page generation and javascript linking they have to do something to gain control of their product.
I'd like to hear from someone whose pages have come out of the sandbox and are now doing well.
and one controversial comment that the sandbox doesn't apply to internal links
How is my comment controversial. Nobody else even achknowleged that I said it. I can add new spam pages all day to an existing site and they get ranked real fast. I have had over a thousand spam pages get ranked in one week. I put the same pages on a new site and they may never get indexed.
[edited by: ogletree at 9:43 pm (utc) on Sep. 29, 2004]
Here's why the "sandbox" word isn't liked post #22
[webmasterworld.com...]
(supporters forum)