|Google Indexed 100k Pages with a Test IP instead of Domain Name|
We have a 10 year old ecommerce site that was well ranked in Google with some 130k plus pages returned from the site: command.
About three months ago we noticed that Google was indexing our pages by a test ip address, not by domain. The only way that google could have possibly got the ip address was through the google toolbar, or by google crawling random ip addresses.
We implemented a 301 redirect and after a month the ip indexed pages went down to around 85k, but the domain indexed urls, went down to around 30k.
We decided to make a reinclusion request, explaining the problem in some detail. A week later we had exactly 5k pages indexed by domain from the site: command and lost around 10k pages a day on the ip indexed pages. [It seems a reinclusion request was probably not the best thing to have done at this stage]
After 3 months plus of this we see 5.5k ip indexed pages and a very slow increase in domain indexed pages to around 7k and it seems to be increasing, fingers crossed but by 500 pages every now and then.
No site navigation was changed prior or during this time, only transient product content which probably accounts for 10k pages added and subtracted. Webmaster tools currently shows we have 90k pages in the index, but they are not returned by the site: command.
Have we done something wrong? Has something bad happened? Can we do anything to speed up the recovery if indeed there is going to be a recovery?
it is possible that your IP indexing problem was coincidental with the Mayday Algorithm Update [webmasterworld.com] and your indexed page count is not the result of the canonicalization problems or the reinclusion request.
> Have we done something wrong?
The take-home message here is that if it is at all possible for a search engine to index a site by IP address or by non-canonical hostname, then it may indeed happen. I hesitate to say "it will happen," but when developing sites, that is exactly how I think of it -- The very first code that goes onto my new server is hostname canonicalization code, whether that server is for development, test, or production.
Development and test servers should always block spider access using 301-redirection of all requests from non-tester IP address ranges back to the canonical hostname, login-required-for access, inclusion of the <meta name="robots" content=noindex,nofollow"> tag on every page and exclusion of non-page objects in robots.txt, or a combination of these.
The problem in this case is that the only way to be sure that Googlebot sees your 301 redirects in a timely manner is to provide at least one link to each of the most-important 'bad' URLs... and I don't know that I'd recommend doing that from the production site itself. If I were you, I certainly wouldn't rush into doing that -- let's see if others here have had more experience with this first...
Not much consolation or help for now but for next time, remember Murphy's law and Poor RIchard's advice that an ounce of prevention is worth a pound of cure.
|We decided to make a reinclusion request, explaining the problem in some detail. A week later we had exactly 5k pages indexed by domain from the site: command and lost around 10k pages a day on the ip indexed pages. [It seems a reinclusion request was probably not the best thing to have done at this stage] |
It seems to me that a reconsideration request WAS the right thing - and I'd even suggest trying another request as soon as you are sure that everything on the site is technically sound. To begin with, I assume you have verified that the IP address redirect is working properly, and returning a true 301 status in the http header that the server sends.
The situation you are describing sounds like there may be some technical challenge for Google, and that is making googlebot's re-crawl very slow. So I highly suggest going over your Webmaster Tools account with a very detailed eye to any crawl problems.
In addition, try crawling your own site for yourself with a tool like Xenu and see how it goes for you. You might turn up some interesting troubles.
A ten year old e-commerce site that previously ranked well and been indexed properly in the past should not take many months to recover from this kind of thing. And at the pace you are describing a full recovery would take years, so do into hyperdrive, nail down every technical issue you can, and then go back to Google. Something is still wrong, either on your end or theirs.