Forum Moderators: open
A thread was started a looong time ago:
"Google refuses to spider site. It has been more than a year!"
and I had a ton of wonderful advice and suggestions. Unfortunately, the problem persisted, and I posed the question again in the thread:
"Google thinks old server = new server!" in early December.
A brief background: Google would not spider a site previously in its index that had been moved to a new server (different IP, different DNS). It was a huge site from an educational institute with a good prior PR, and lots of links from good PR sites (such as the parent university site).
You can get a good idea of the problem by looking up the original posts, but I wanted to post the solution if it can help anyone else.
Someone left me sticky mail suggesting that the original server be resurrected (with its original DNS). I took their advice, and observed the traffic. Google begin spidering it like crazy within minutes, even though the server had been gone for a year. It would seem that Google thought that the old and new servers were the same; the presence of the new server kept Google coming back, but it would continue to try to hit the old, and failing that, would never add pages to its index. ie. it would look to see if the new site was there by hitting the root index page, and then try to access pages on the old server.
Once we had a good robots page in place telling Google that the old server was dead and never to return, it finally straightened out the mess. It didn't matter that the old server hadn't been there for a year, or that there were a thousand links pointing to the correct server. It wasn't the javascript in the code. It wasn't the ASP pages. Google had to hit the old server one last time to figure it out.
I want to thank all of the kind people who took the time to help me with this very frustrating problem. Without them - and this invaluable forum - I may never have gotten to the bottom of the issue.
Google refuses to spider site. It has been more than a year! - Google hits the index page and goes no further. [webmasterworld.com]
Google thinks old server = new server! Google is messed up... [webmasterworld.com]
I'll mention this to a crawl person the next time I run into one, so we can check if there's anything we can do at our end to make that work better if anyone else is in that situation.
Tell them to use DNS the way it was designed
to be used. It has been quite obvious for some
time that:
a/ the spiders rely too much on using ip addresses
stored in the indexes rather than hostnames
b/ the interpretation of ANAME and CNAME records
is suspect.
Yes, I harp on this every chance I get in the
hope that someone is actually going to do
something about it. Especially point B.
+++
I moved my (information) site from a personal hosting server to its own domain on a new server a year ago. I didn't have access to the old server's robots.txt file so I used META redirects and links. When I understood that might have caused me to incorrectly trigger a duplicate content penalty, I just removed the old site. Problem is, the old server does not properly serve 404 pages, just a generic 200 redirect "this page does not exist".
Google does spider the new site, but PR is 0 and no backlink show (there are more than 200). Needless to say, the site is perfectly clean, no SEO tricks (I see long term and focus on quality content). It's been about a year now.
Hi Ulkari,
For Google it's also possible to put a 'robots.txt' on a lower level then the root.
If you do not have access to the root level of your server, you may place a robots.txt file at the same level as the files you want to remove. Doing this and submitting via the automatic URL removal system will cause a temporary, 90 day removal of your site from the Google index. (Keeping the robots.txt file at the same level would require you to return to the URL removal system every 90 days to reissue the removal.)Source: Remove Content from Google's Index [google.com]
Problem is, for me the "old IP address" is out of my control, because it was bought from a domain company.
Is there any other way round this? I have tried contacting the domain company, to no avail so far.
Therefore I cannot close this back door. Or can I? Do we have to go to google with this?
For me, the only way I could figure out what was going on was to resurrect the old server and watch the logs. I couldn't believe it when - a year after the server had been taken offline - Google crawled all over it within minutes of putting it back up.
Part of the problem in our case may have been in the change of names: we changed from old.server.name.ca to server.name.ca. The old.server.name.ca was actually a node of the zone server.name.ca, and as far as Google was concerned, the node answered for the zone. The zone then disappeared and the new site took the name server.name.ca - but it was now a node and not a zone. Google must have thought that server.name.ca was still a zone, and continued to try to contact old.server, which used to answer for the zone.
I'm sure I just explained this horribly! In a nutshell, I agree with plumsauce about Google's interpretation of ANAME records.
In my case though, DNS has nothing to do with the problem, since the new server is in a different domain.
Google is not eternal, and since I have no short-term commercial pressure, I prefer to focus on building quality content for the users I get via links or other SEs, rather than trying too hard to understand and work around the beast's mistakes.
Now, this is a systematic error, and it should be looked upon.
Google has had problems with "URLs that are in the index but can't be validated for some reason" - for a long time. In this thread [webmasterworld.com] Yidaki made me aware of previous threads on the subject:
1) Indexed AlltheWeb pages causing Google duplicates - Aug 14, 2003 [webmasterworld.com]
2) click.fastsearch.com shows instead of my url? - Oct 8, 2002 [webmasterworld.com]
This might not appear to be the exact same situation, but from a "spider viewpoint" it is the same - some URLs are indexed and the spider is not able to go back and validate them, as the linking page is not spiderable. What happens then, is that these "Ghost URLs" remain in the index, and in some cases this leads to de-indexing of the "real" sites (the ones being linked to) somehow - aka. "slow death".
In this thread from december 2003 (msg #37) [webmasterworld.com] i dubbed it a "302 Google bug" for lack of better words. As shown by Crow_Song in this thread, it's more general than just 302 redirect links.
So, Gbot must be told to ignore (ie. forget about) links and domains that it can not spider (or aren't allowed to spider). These types of data should be removed from the index. Erroneous links and domains should not be allowed to corrupt other data.
Many a times, when I face a problem I serach at webmasterworld and get some nice discussions which not only upgrade my knowledge and but also lets you solve problems fast. Thanks for your post. I just Bookmarked your thread.