Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Is google using old link data to rank sites?

         

kneukm03

7:50 pm on May 21, 2006 (gmt 0)

10+ Year Member



I'd like to start off by saying I'm not the person to originate this theory, I think I read it from ClintFC or someone else as a tangent in one of the datacenter watch threads. But because people have started posting panicky threads about what to do in response to the recent dropping of pages, along with some semi-smug "you just need to stop spamming" advice, I think this theory deserves a more detailed look because I think it explains what happened last week as well.

I started building sites in September, so all my personal observations are based on sites that are still in the sandbox to some degree - however, two of them came out of the sandbox on a set of datacenters until last week, and appear to be back in now. I did get to observe how Google was ranking on those sites, however. I also should note that every site I have could be classified as either participating in a link scheme, excessive reciprocal linking, bad outlinks, or none of those three but not many links to the site (the reasons cited by Matt Cutts).

So why don't I think that's a satisfactory explanation of why I, like lots of other people, lost a ton of pages this week? Because what seems to fit me better is the theory that Google is using old backlink data from roughly late October (the Jagger update).

On the datacenters I was able to see my sites ranking, they appeared to rank exactly as I would expect based on a combination of on-page factors and the links I had in late October. Several individual pages ranked highly (top 10) for terms where the pages had deep-links as of the beginning of Jagger. Other pages that targetted very similar terms, but were added on later, could rank in the 30's, 50's, 70's, but not in the top 10, even if they had the exact same kind of link added later on. I had two sites clearly de-sandboxed - one started two weeks earlier than the other, so it had more links. Its new pages ranked obviously higher than pages on the second site, which had fewer links in October (the earlier site generally got pages in the 50's, the later site generally in the 70's, although it varied depending on competitiveness of the terms). However, both sites ranked well for terms where they had links to a specific page prior to Jagger.

What does this have to do with Google dropping a bunch of pages? Because if Google was RANKING sites based on Jagger data, it is likely also CRAWLING sites based on that data as well. We know Google was testing the BigDaddy crawler in November or so, adding pages to the regular index. We also know that the basis for the BigDaddy index was data from the old crawler, and that new data was sort of grafted on. My working theory is that Google is currently recrawling its entire index from scratch using the new infrastructure. I think the "dropped pages" we saw last week were actually the replacement of information crawled using the old, amalgamated index with new data crawled since they got BigDaddy online.

This has hit tons of site owners, ranging from white hat to black hat to in between. It's possible that, as Matt Cutts said, a lot of this is because of different crawl priorities due to spam detection. However, if the "Jagger backlinks" theory is right, then much of it is also due to sites not being crawled according to the number of links they actually currently have. My sites certainly seem to correspond to this as well - sites that had links prior to Jagger lost pages, but still have some (generally stuff that is linked from the main page or that had links prior to Jagger). One site that may or may not be sandboxed was a directory, had links before Jagger, had the spammiest possible outbound links (because I accepted anything into it without checking), uses a link scheme, and uses reciprocals, but it's got the most pages of any of my sites currently indexed. Why? I think because it got a couple of good links just prior to Jagger. If my sites were crawled according to Jagger links, it would be the one that got the most activity - and it does.

The test of any theory is whether it can successfully predict outcomes, however, so I predict the following:

1) If, as I believe, Google is recrawling its entire index, then what we saw last week was the first wave of new information. Google will at some point update its backlink data and start to rank sites based on the May 16th or so crawl data - and it will also start to crawl sites according to that, as well. Webmasters will see a sudden change in rankings when this happens, and afterwards many of the sites that have lost pages this week will regain them (as they are crawled under priorities that more closely reflect their actual links). This may take an iteration or two to get current.

2) People who have lost lots of pages this week and who do not think their sites are completely spammy would currently have the crawl depth they could expect based on their links in early October (when the Jagger data was based on). People who currently only have a couple of pages or a homepage only probably had few or no links then. People who losts lots of pages, but still have lots of pages, probably had new, better links added since October that increased their crawl priority, and the current crawl does not reflect this.

I apologize for this being long and rambling, but I'm hoping others can jump in with observations based on their sites, namely whether they think their sites fit this theory or not.

1984bb

9:16 pm on May 21, 2006 (gmt 0)

10+ Year Member



kneukm03 I believe the same theory and as I posted somewhere else though many fellow webmaster believe that PR has nothing to do or does not play a significant role with SERPS positioning of a page I expect massive changes on the current SERPS due to many factors but mostly the new IBL count will be for others less and for others more since that Spam pages and directories have been drop (probably forever)
blog spamming comes to an and for Google see "nofollow" [googleblog.blogspot.com...]
pages with rich content I believe they will get a High PR only from a few quality links and not from tones of spam link exchange all of this and more will make interesting after the new BL PR update that will take probably some time (maybe late July?) until as you mention Google will finish the recrawl .