|Google refused to spider...FIXED!|
A resolution to a problem posted in two threads
For those interested in finding out how the soap opera was finally resolved, I thought I'd post my good news.
A thread was started a looong time ago:
"Google refuses to spider site. It has been more than a year!"
and I had a ton of wonderful advice and suggestions. Unfortunately, the problem persisted, and I posed the question again in the thread:
"Google thinks old server = new server!" in early December.
A brief background: Google would not spider a site previously in its index that had been moved to a new server (different IP, different DNS). It was a huge site from an educational institute with a good prior PR, and lots of links from good PR sites (such as the parent university site).
You can get a good idea of the problem by looking up the original posts, but I wanted to post the solution if it can help anyone else.
Someone left me sticky mail suggesting that the original server be resurrected (with its original DNS). I took their advice, and observed the traffic. Google begin spidering it like crazy within minutes, even though the server had been gone for a year. It would seem that Google thought that the old and new servers were the same; the presence of the new server kept Google coming back, but it would continue to try to hit the old, and failing that, would never add pages to its index. ie. it would look to see if the new site was there by hitting the root index page, and then try to access pages on the old server.
I want to thank all of the kind people who took the time to help me with this very frustrating problem. Without them - and this invaluable forum - I may never have gotten to the bottom of the issue.
Glad you found out the answer--I remember being stumped by this one. I'll mention this to a crawl person the next time I run into one, so we can check if there's anything we can do at our end to make that work better if anyone else is in that situation. Think you can hunt down links to the other two threads? In case I can point a crawl person here, it would help if they can see everything from one page..
|Mr Bo Jangles|
I see that at Googleplex they're known as 'crawl person' and 'crawl people' and not 'crawlers' - for obvious reasons *_*
wow, cool post Crow_Song!
Your method might have just answered my friends question! I'll re-direct him to this thread..
What would one do if it was impossible to re-create the old site due to moving from the old isp to a new one, etc? I think I could be having this affect me also, but am not able to do what you were able to do [resurrect old site]. I bet many sites over time have this situation too.
Someone once suggested a page to me on the Google site that would request for indexed pages to be forcibly removed from the index (you had to be the owner, of course). I can't find that link now but you could possibly leverage that to remove your home page thus restarting the entire process on the next crawl.
>> Think you can hunt down links to the other two threads?
Google refuses to spider site. It has been more than a year! - Google hits the index page and goes no further. [webmasterworld.com]
Google thinks old server = new server! Google is messed up... [webmasterworld.com]
I'll mention this to a crawl person the next time I run into one, so we can check if there's anything we can do at our end to make that work better if anyone else is in that situation.
Tell them to use DNS the way it was designed
to be used. It has been quite obvious for some
a/ the spiders rely too much on using ip addresses
stored in the indexes rather than hostnames
b/ the interpretation of ANAME and CNAME records
Yes, I harp on this every chance I get in the
hope that someone is actually going to do
something about it. Especially point B.
How to put "a good robots page in place telling Google that the old server was dead"?
put /disallow at robot.txt?
Johnlim: that's exactly what I did. For a month or so after reviving the DNS name, I just observed the traffic on the box. Then I put redirects in place to the other (real) server. This didn't actually change the fact that Google was confused, so I put a robots file up that disallowed everything. It was almost exactly two months later that suddenly, one day, Google began spidering the real server like crazy. It hit the site about 90 times the first day, and 400 times the next! After that, we were golden. A couple of weeks later, pages began showing in the index.
Now I move one site to new ISP (IP are changed, DNS also changed)
Then need I put /disallow at robot.txt at the old IP (ISP)?
Hi wonder if I have a similar problem.
I moved my (information) site from a personal hosting server to its own domain on a new server a year ago. I didn't have access to the old server's robots.txt file so I used META redirects and links. When I understood that might have caused me to incorrectly trigger a duplicate content penalty, I just removed the old site. Problem is, the old server does not properly serve 404 pages, just a generic 200 redirect "this page does not exist".
Google does spider the new site, but PR is 0 and no backlink show (there are more than 200). Needless to say, the site is perfectly clean, no SEO tricks (I see long term and focus on quality content). It's been about a year now.
>> Problem is, the old server does not properly serve 404 pages, just a generic 200 redirect "this page does not exist".
For Google it's also possible to put a 'robots.txt' on a lower level then the root.
Source: Remove Content from Google's Index [google.com]
|If you do not have access to the root level of your server, you may place a robots.txt file at the same level as the files you want to remove. Doing this and submitting via the automatic URL removal system will cause a temporary, 90 day removal of your site from the Google index. (Keeping the robots.txt file at the same level would require you to return to the URL removal system every 90 days to reissue the removal.) |
Yes, I did just that some months ago, but it did not seem to help. I still seem to have a penalty and I don't know why (no answer from Google to my mails). So my best guess is about this mess of changing server.
Thanks for the post, flagged it for future references.
This whole place is full of great advice!
Wow, this is a symptom I have been puzzling over for a month or so. Fantastic post!
Problem is, for me the "old IP address" is out of my control, because it was bought from a domain company.
Is there any other way round this? I have tried contacting the domain company, to no avail so far.
Therefore I cannot close this back door. Or can I? Do we have to go to google with this?
Can anyone take a guess why this doesn't happen consistently?
I just moved some sites to a different IP and DNS and I seem to have no problem with googlebot.
Great thread. It's one for the library.
problem with dns config, not g
Ulkari: if the old server is still serving pages (even page not found errors) I think you may end up in the same boat I did. Google may still think the server exists, and continue to hit it.
For me, the only way I could figure out what was going on was to resurrect the old server and watch the logs. I couldn't believe it when - a year after the server had been taken offline - Google crawled all over it within minutes of putting it back up.
Part of the problem in our case may have been in the change of names: we changed from old.server.name.ca to server.name.ca. The old.server.name.ca was actually a node of the zone server.name.ca, and as far as Google was concerned, the node answered for the zone. The zone then disappeared and the new site took the name server.name.ca - but it was now a node and not a zone. Google must have thought that server.name.ca was still a zone, and continued to try to contact old.server, which used to answer for the zone.
I'm sure I just explained this horribly! In a nutshell, I agree with plumsauce about Google's interpretation of ANAME records.
Thanks Crow_Song for sharing what you learned. I too believe I'm having a similar problem as your did, but unfortunately the old server is still serving dummy pages without error code, and I have not control over it. I cannot even have a look at its logs (it was a basic hosting service bundled with dialup Internet connection offer).
In my case though, DNS has nothing to do with the problem, since the new server is in a different domain.
Google is not eternal, and since I have no short-term commercial pressure, I prefer to focus on building quality content for the users I get via links or other SEs, rather than trying too hard to understand and work around the beast's mistakes.
As far as i recall, Google had pages indexed from the old server. These ghost pages showed up in the SERPS as "headline = url only" and clicking them gave a 404 error.
Now, this is a systematic error, and it should be looked upon.
Google has had problems with "URLs that are in the index but can't be validated for some reason" - for a long time. In this thread [webmasterworld.com] Yidaki made me aware of previous threads on the subject:
1) Indexed AlltheWeb pages causing Google duplicates - Aug 14, 2003 [webmasterworld.com]
2) click.fastsearch.com shows instead of my url? - Oct 8, 2002 [webmasterworld.com]
This might not appear to be the exact same situation, but from a "spider viewpoint" it is the same - some URLs are indexed and the spider is not able to go back and validate them, as the linking page is not spiderable. What happens then, is that these "Ghost URLs" remain in the index, and in some cases this leads to de-indexing of the "real" sites (the ones being linked to) somehow - aka. "slow death".
In this thread from december 2003 (msg #37) [webmasterworld.com] i dubbed it a "302 Google bug" for lack of better words. As shown by Crow_Song in this thread, it's more general than just 302 redirect links.
So, Gbot must be told to ignore (ie. forget about) links and domains that it can not spider (or aren't allowed to spider). These types of data should be removed from the index. Erroneous links and domains should not be allowed to corrupt other data.
Crow_Song, It must have been a hard time for you then and I am glad that you have shared all this info here that may help many of us from strange sometimes situations like these.
Many a times, when I face a problem I serach at webmasterworld and get some nice discussions which not only upgrade my knowledge and but also lets you solve problems fast. Thanks for your post. I just Bookmarked your thread.