homepage Welcome to WebmasterWorld Guest from 54.166.62.226
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Google / Google News Archive
Forum Library, Charter, Moderator: open

Google News Archive Forum

    
Google refused to spider...FIXED!
A resolution to a problem posted in two threads
Crow_Song

10+ Year Member



 
Msg#: 21480 posted 8:02 pm on Jan 27, 2004 (gmt 0)

For those interested in finding out how the soap opera was finally resolved, I thought I'd post my good news.

A thread was started a looong time ago:
"Google refuses to spider site. It has been more than a year!"
and I had a ton of wonderful advice and suggestions. Unfortunately, the problem persisted, and I posed the question again in the thread:
"Google thinks old server = new server!" in early December.

A brief background: Google would not spider a site previously in its index that had been moved to a new server (different IP, different DNS). It was a huge site from an educational institute with a good prior PR, and lots of links from good PR sites (such as the parent university site).

You can get a good idea of the problem by looking up the original posts, but I wanted to post the solution if it can help anyone else.

Someone left me sticky mail suggesting that the original server be resurrected (with its original DNS). I took their advice, and observed the traffic. Google begin spidering it like crazy within minutes, even though the server had been gone for a year. It would seem that Google thought that the old and new servers were the same; the presence of the new server kept Google coming back, but it would continue to try to hit the old, and failing that, would never add pages to its index. ie. it would look to see if the new site was there by hitting the root index page, and then try to access pages on the old server.

Once we had a good robots page in place telling Google that the old server was dead and never to return, it finally straightened out the mess. It didn't matter that the old server hadn't been there for a year, or that there were a thousand links pointing to the correct server. It wasn't the javascript in the code. It wasn't the ASP pages. Google had to hit the old server one last time to figure it out.

I want to thank all of the kind people who took the time to help me with this very frustrating problem. Without them - and this invaluable forum - I may never have gotten to the bottom of the issue.

 

GoogleGuy

WebmasterWorld Senior Member googleguy us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 21480 posted 4:54 am on Jan 28, 2004 (gmt 0)

Glad you found out the answer--I remember being stumped by this one. I'll mention this to a crawl person the next time I run into one, so we can check if there's anything we can do at our end to make that work better if anyone else is in that situation. Think you can hunt down links to the other two threads? In case I can point a crawl person here, it would help if they can see everything from one page..

Mr Bo Jangles

10+ Year Member



 
Msg#: 21480 posted 4:58 am on Jan 28, 2004 (gmt 0)

I see that at Googleplex they're known as 'crawl person' and 'crawl people' and not 'crawlers' - for obvious reasons *_*

sidyadav

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 21480 posted 5:09 am on Jan 28, 2004 (gmt 0)

wow, cool post Crow_Song!

Your method might have just answered my friends question! I'll re-direct him to this thread..

Sid

hteeteepee

10+ Year Member



 
Msg#: 21480 posted 5:12 am on Jan 28, 2004 (gmt 0)

What would one do if it was impossible to re-create the old site due to moving from the old isp to a new one, etc? I think I could be having this affect me also, but am not able to do what you were able to do [resurrect old site]. I bet many sites over time have this situation too.

Oaf357

10+ Year Member



 
Msg#: 21480 posted 8:47 am on Jan 28, 2004 (gmt 0)

Someone once suggested a page to me on the Google site that would request for indexed pages to be forcibly removed from the index (you had to be the owner, of course). I can't find that link now but you could possibly leverage that to remove your home page thus restarting the entire process on the next crawl.

takagi

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 21480 posted 9:20 am on Jan 28, 2004 (gmt 0)

>> Think you can hunt down links to the other two threads?

Google refuses to spider site. It has been more than a year! - Google hits the index page and goes no further. [webmasterworld.com]

Google thinks old server = new server! Google is messed up... [webmasterworld.com]

plumsauce

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 21480 posted 11:44 am on Jan 28, 2004 (gmt 0)


I'll mention this to a crawl person the next time I run into one, so we can check if there's anything we can do at our end to make that work better if anyone else is in that situation.

Tell them to use DNS the way it was designed
to be used. It has been quite obvious for some
time that:

a/ the spiders rely too much on using ip addresses
stored in the indexes rather than hostnames

b/ the interpretation of ANAME and CNAME records
is suspect.

Yes, I harp on this every chance I get in the
hope that someone is actually going to do
something about it. Especially point B.

+++

johnlim

10+ Year Member



 
Msg#: 21480 posted 5:53 am on Jan 29, 2004 (gmt 0)

How to put "a good robots page in place telling Google that the old server was dead"?

put /disallow at robot.txt?

Crow_Song

10+ Year Member



 
Msg#: 21480 posted 4:25 pm on Jan 29, 2004 (gmt 0)

Johnlim: that's exactly what I did. For a month or so after reviving the DNS name, I just observed the traffic on the box. Then I put redirects in place to the other (real) server. This didn't actually change the fact that Google was confused, so I put a robots file up that disallowed everything. It was almost exactly two months later that suddenly, one day, Google began spidering the real server like crazy. It hit the site about 90 times the first day, and 400 times the next! After that, we were golden. A couple of weeks later, pages began showing in the index.
:)

johnlim

10+ Year Member



 
Msg#: 21480 posted 7:48 am on Jan 30, 2004 (gmt 0)

Now I move one site to new ISP (IP are changed, DNS also changed)
Then need I put /disallow at robot.txt at the old IP (ISP)?

Ulkari

10+ Year Member



 
Msg#: 21480 posted 8:21 am on Jan 30, 2004 (gmt 0)

Hi wonder if I have a similar problem.

I moved my (information) site from a personal hosting server to its own domain on a new server a year ago. I didn't have access to the old server's robots.txt file so I used META redirects and links. When I understood that might have caused me to incorrectly trigger a duplicate content penalty, I just removed the old site. Problem is, the old server does not properly serve 404 pages, just a generic 200 redirect "this page does not exist".

Google does spider the new site, but PR is 0 and no backlink show (there are more than 200). Needless to say, the site is perfectly clean, no SEO tricks (I see long term and focus on quality content). It's been about a year now.

takagi

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 21480 posted 9:34 am on Jan 30, 2004 (gmt 0)

>> Problem is, the old server does not properly serve 404 pages, just a generic 200 redirect "this page does not exist".

Hi Ulkari,

For Google it's also possible to put a 'robots.txt' on a lower level then the root.

If you do not have access to the root level of your server, you may place a robots.txt file at the same level as the files you want to remove. Doing this and submitting via the automatic URL removal system will cause a temporary, 90 day removal of your site from the Google index. (Keeping the robots.txt file at the same level would require you to return to the URL removal system every 90 days to reissue the removal.)
Source: Remove Content from Google's Index [google.com]

Ulkari

10+ Year Member



 
Msg#: 21480 posted 11:11 am on Jan 30, 2004 (gmt 0)

Hi Takagi,

Yes, I did just that some months ago, but it did not seem to help. I still seem to have a penalty and I don't know why (no answer from Google to my mails). So my best guess is about this mess of changing server.

adfree

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 21480 posted 11:15 am on Jan 30, 2004 (gmt 0)

Thanks for the post, flagged it for future references.
This whole place is full of great advice!
Jens

George

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 21480 posted 12:03 pm on Jan 30, 2004 (gmt 0)

Wow, this is a symptom I have been puzzling over for a month or so. Fantastic post!

Problem is, for me the "old IP address" is out of my control, because it was bought from a domain company.
Is there any other way round this? I have tried contacting the domain company, to no avail so far.

Therefore I cannot close this back door. Or can I? Do we have to go to google with this?

MrSpeed

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 21480 posted 1:45 pm on Jan 30, 2004 (gmt 0)

Can anyone take a guess why this doesn't happen consistently?

I just moved some sites to a different IP and DNS and I seem to have no problem with googlebot.

Great thread. It's one for the library.

bignet

10+ Year Member



 
Msg#: 21480 posted 2:03 pm on Jan 30, 2004 (gmt 0)

problem with dns config, not g

Crow_Song

10+ Year Member



 
Msg#: 21480 posted 2:28 pm on Jan 30, 2004 (gmt 0)

Ulkari: if the old server is still serving pages (even page not found errors) I think you may end up in the same boat I did. Google may still think the server exists, and continue to hit it.

For me, the only way I could figure out what was going on was to resurrect the old server and watch the logs. I couldn't believe it when - a year after the server had been taken offline - Google crawled all over it within minutes of putting it back up.

Part of the problem in our case may have been in the change of names: we changed from old.server.name.ca to server.name.ca. The old.server.name.ca was actually a node of the zone server.name.ca, and as far as Google was concerned, the node answered for the zone. The zone then disappeared and the new site took the name server.name.ca - but it was now a node and not a zone. Google must have thought that server.name.ca was still a zone, and continued to try to contact old.server, which used to answer for the zone.

I'm sure I just explained this horribly! In a nutshell, I agree with plumsauce about Google's interpretation of ANAME records.

Ulkari

10+ Year Member



 
Msg#: 21480 posted 7:43 pm on Jan 30, 2004 (gmt 0)

Thanks Crow_Song for sharing what you learned. I too believe I'm having a similar problem as your did, but unfortunately the old server is still serving dummy pages without error code, and I have not control over it. I cannot even have a look at its logs (it was a basic hosting service bundled with dialup Internet connection offer).

In my case though, DNS has nothing to do with the problem, since the new server is in a different domain.

Google is not eternal, and since I have no short-term commercial pressure, I prefer to focus on building quality content for the users I get via links or other SEs, rather than trying too hard to understand and work around the beast's mistakes.

claus

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 21480 posted 10:36 pm on Jan 30, 2004 (gmt 0)

As far as i recall, Google had pages indexed from the old server. These ghost pages showed up in the SERPS as "headline = url only" and clicking them gave a 404 error.

Now, this is a systematic error, and it should be looked upon.

Google has had problems with "URLs that are in the index but can't be validated for some reason" - for a long time. In this thread [webmasterworld.com] Yidaki made me aware of previous threads on the subject:

1) Indexed AlltheWeb pages causing Google duplicates - Aug 14, 2003 [webmasterworld.com]

2) click.fastsearch.com shows instead of my url? - Oct 8, 2002 [webmasterworld.com]

This might not appear to be the exact same situation, but from a "spider viewpoint" it is the same - some URLs are indexed and the spider is not able to go back and validate them, as the linking page is not spiderable. What happens then, is that these "Ghost URLs" remain in the index, and in some cases this leads to de-indexing of the "real" sites (the ones being linked to) somehow - aka. "slow death".

In this thread from december 2003 (msg #37) [webmasterworld.com] i dubbed it a "302 Google bug" for lack of better words. As shown by Crow_Song in this thread, it's more general than just 302 redirect links.

So, Gbot must be told to ignore (ie. forget about) links and domains that it can not spider (or aren't allowed to spider). These types of data should be removed from the index. Erroneous links and domains should not be allowed to corrupt other data.

nakulgoyal

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 21480 posted 11:20 pm on Jan 30, 2004 (gmt 0)

Crow_Song, It must have been a hard time for you then and I am glad that you have shared all this info here that may help many of us from strange sometimes situations like these.

Many a times, when I face a problem I serach at webmasterworld and get some nice discussions which not only upgrade my knowledge and but also lets you solve problems fast. Thanks for your post. I just Bookmarked your thread.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google News Archive
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved