Welcome to WebmasterWorld Guest from 54.198.93.179

Message Too Old, No Replies

Best way to tell Googlebot a page doesn't exist anymore

     
11:45 am on Feb 1, 2011 (gmt 0)

5+ Year Member



What do you think it's the best way to tell googlebot that a page does not exist anymore. Simply delete it and let it 404 until googlebot gets bored and stops trying? or return a 410 code everytime it tries to download it?

Thanks for your opinion
12:55 pm on Feb 1, 2011 (gmt 0)

WebmasterWorld Administrator goodroi is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



it depends on the situation. if you don't care and have no link juice to lose or worried about your crawl budget then you can 404 it. this also assumes you dont care about a possibly flooding your 404 log file with new entries. if you are in this situation where you have nothing to lose i doubt you would be visiting WebmasterWorld.

when a page of mine no longer exists, i would:
1) make sure all of my internal links pointing to it are changed to another url or deleted.

2) contact all external sites linking to it and ask them to change the link to another one of my urls

3) add a 301 redirect to take care of any external links that couldnt be updated

4) make sure the url is not blocked by robots.txt or a noindex tag so google sees the 301 redirect

5) sit back and wait for googlebot to crawl the page and notice that it went bye-bye.
1:03 pm on Feb 1, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Very few sites use the 410 response code. If I recall correctly, Google has said that they don't treat it any differently from a 404.
1:19 pm on Feb 1, 2011 (gmt 0)

WebmasterWorld Senior Member pageoneresults is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Very few sites use the 410 response code. If I recall correctly, Google has said that they don't treat it any differently from a 404.


I might have to disagree with that. Google appears to handle 410 Gone exactly as it says on the tin. I've implemented it recently and have seen pages "Gone" within 24-48 hours.

I'm sure there are others who will chime in with similar findings. If a document does not exist anymore and there is no viable replacement for a 301, then 410 Gone is the suggested server response. A 404 is too vague and Googlebot will continue to request the document forever as long as there are external links to it.

This is where finite error reporting comes into play. You should have few, if any 404s. At some point, you'll capture those repetitive 404s and redirect them to an appropriate document. If no replacement exists, drop a 410 Gone in there.

I'd like to point out that a 410 Gone is probably your last resort. You'll want to preserve whatever equity may have been associated with the document that no longer exists, especially if there are inbound links that you have little to no control over. You'll of course 301 in this instance to the most appropriate document.

Note: I was surprised when I did the 410 implementation a little while back. Within 24-48 hours Google removed those pages from its index, it acted just like a URL Removal request without all the paperwork. :)
2:58 pm on Feb 1, 2011 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Yes - Google used to handle 404 and 410 the same way but they did change that last year. There was even a public mention of the change from a Googler last year, I think John Mueller. I'll see if I can find the link.
3:13 pm on Feb 1, 2011 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



This wasn't the first mention of the change, but here, on 2010-04-15 JohnMu does confirm different handling:

If you are certain that the URLs will no longer have content on them, you could also use a 410 HTTP result code, to signal that they are "gone" forever. We may still crawl the URLs (especially when we find new links), but we generally see a 410 HTTP result code as being more permanent than a 404 HTTP result code (which can be transient by definition).

[google.com...]
3:50 pm on Feb 1, 2011 (gmt 0)

5+ Year Member



If I cannot control the server how can I implement a 410?
Is there a htacess / rewrite statement?
4:54 pm on Feb 1, 2011 (gmt 0)

10+ Year Member



htacess for 410's:

# 410 Permanently Removed
Redirect gone /filename.ext
5:37 pm on Feb 1, 2011 (gmt 0)

5+ Year Member



A 301 redirect yes, but to where? the page simply does not exists anymore.

I fact, this is about a section of a website that was recently removed. it had no link juice and was simply attracting all kinds of spam, so we removed it.

I guess the 410 gone is a more appropriate thing to do. However, I was wondering if any of you are aware of any negative impacts on using a 410 gone on a website (from googlebot's point of view that is...)
5:55 pm on Feb 1, 2011 (gmt 0)

WebmasterWorld Senior Member themadscientist is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



410 Gone

# The Mod_Rewrite Version
RewriteEngine on
RewriteRule ^thepage\.ext$ - [G]
6:40 pm on Feb 1, 2011 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



any negative impacts on using a 410 gone on a website

Only if you want to re-use the URL. Make sure it stays "gone".
7:59 pm on Feb 2, 2011 (gmt 0)

WebmasterWorld Administrator rogerd is a WebmasterWorld Top Contributor of All Time 10+ Year Member



OK, here's a variation on the original question. I'm working with a site that due to a site malfunction had wrong URLs in place for long enough that they got spidered. The malfunction was corrected, the correct URLs were put back in place, and the wrong ones 301ed to the correct locations.

Oddly, though, Google is spidering the new pages but still has the old pages in its index. At one point, I even tried putting some prominent links to a portion of the old URLs, thinking that Google would follow the link, discover the 301, and drop the old URL. Hasn't happened, though, despite otherwise aggressive spidering. The old pages live on like zombies in Google's index.

Thoughts?
8:07 pm on Feb 2, 2011 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



I've got a similar situation right now, Roger, with a new site launch that went technically wrong for more than a week. In the past, after a number of weeks, I'd only see the wrong URLs with a site: operator and never in an ordinary SERP. I'll see if that still holds, because I'm deep into a parallel situation right now.

The most critical thing is to remove ALL occurrences of the wrong URLs in the site. I would definitely not intentionally link to a URL that will redirect. That only compounds the chaos.
8:23 pm on Feb 2, 2011 (gmt 0)

WebmasterWorld Senior Member pageoneresults is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Google is spidering the new pages but still has the old pages in its index.


That's one of those times where I might say noarchive comes into play. There's something with that cache and redirects that noarchive "appears" to address. It's just a hunch. ;)
10:02 pm on Feb 2, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If I cannot control the server how can I implement a 410?
Is there a htacess / rewrite statement?


You can also do it with PHP and other server side scripting languages which is very useful for dynamic content.

[php.net...]
11:02 pm on Feb 2, 2011 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



That's one of those times where I might say noarchive comes into play.

I don't follow, p1r. If there's already a 301 for that URL, then Google would never see the noarchive. Or did you mean having noarchive there from the start?
2:40 am on Feb 3, 2011 (gmt 0)

WebmasterWorld Senior Member pageoneresults is a WebmasterWorld Top Contributor of All Time 10+ Year Member



If there's already a 301 for that URL, then Google would never see the noarchive.


Understood. It's just a hunch tedster. Since implementing noarchive years ago across all sites that I manage, many of the issues discussed in the fora don't affect us. From scraping to all sorts of other things that people discuss about cache.

Something is wrong somewhere in the process for Google to hold onto old URIs when there is a 301 in place. Apparently it is not seeing the 301? Or is not seeing properly? I dunno. I just "think" that noarchive sends a different set of signals to Googlebot and causes things to happen faster and more efficiently. I may be totally off my SEO rocker too. ;)
2:57 am on Feb 3, 2011 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



If you can ever pull together a pile of data on that I'd be very interested.
4:34 am on Feb 3, 2011 (gmt 0)

WebmasterWorld Administrator 5+ Year Member Top Contributors Of The Month



I am quite interested in this area, I have some questions.

@rogerd, can you see the old pages spidered? I.e. has 301 been discovered and page still not dropped from the index or is it the question of Google crawling old pages and finding 301 but still holding to the page in its index? Also, if they are both in index, which one is ranking more prominently, the new or the old 301-ed page (or better - can you see 301-ed page only with the site: command?)

@tedster, the same question with regards to spidering, from what you wrote above I am presuming you are seeing both, new and old URLs in regular SERPs? Are old 301-ed pages ranking better than new ones or is it a mix?

I am also wondering if with a brand new site Google is even slower to drop 301-ed URLs - almost like "This is a new site, I am not sure if this is what you really want or are you still messing around with URLs.."

I have noticed that in the last 6 months or so (perhaps something to do with Caffeine going live), the whole process of dropping redirected URLs is longer than before. It is almost as "we have a capacity so we can now hold onto the stuff in index longer."

I could be wrong but this is my observation.
5:22 am on Feb 3, 2011 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



@aakk9999 - no, I don't usually see the 301'd URLs in the regular keyword rankings, although sometimes they do seem to get stuck in the site: operator results for a while. So I was hoping to clarify what rogerd meant when he described "still has the old pages in its index".
9:34 am on Feb 3, 2011 (gmt 0)

5+ Year Member



the 301 works only to a certain point. I recently messed around with putting a mobile version of my website. Because of my incompetence, googlebot (and not googlebot mobile) got to spidering the mobile content, which resulted in my whole site getting duplicate content. This just took less than 48 hours!

I pulled out the mobile content and 301 all the pages to their desktop versions. After 24 hours, around 80% of the duplicate content was out of the index. This happened around 3 month ago, and there are still around 10% of those pages in the index.
1:07 pm on Feb 3, 2011 (gmt 0)

WebmasterWorld Senior Member pageoneresults is a WebmasterWorld Top Contributor of All Time 10+ Year Member



If you can ever pull together a pile of data on that I'd be very interested.


Both you and rogerd have the opportunity to put it to the test. :)
1:20 pm on Feb 3, 2011 (gmt 0)

WebmasterWorld Administrator rogerd is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Ted, the pages in question show up using the site: operator but not in any of the keyword searches I tried. The issue is complicated by the fact that during the period of site malfunction that created the bad URLs, there was also a spam link injection hack. Hence, these nonexistent pages have unrelated spammy links, which could also explain why they don't show up for keyword searches. That's also why I'm anxious to get them out of the index, even though they likely get no search traffic.

A little more investigation on one content page shows that neither the new version nor the old, bogus version of the page is showing up even in exact phrase searches. Multiple pages linking to the correct URL are in the index, so I'd expect the new link to be spidered readily.

Most of the site is indexed correctly and is ranking for relevant keywords. I have to think ig Googlebot would just visit the URL in its index & find the 301 to the new URL, it would be fixed.
1:31 pm on Feb 3, 2011 (gmt 0)

WebmasterWorld Administrator rogerd is a WebmasterWorld Top Contributor of All Time 10+ Year Member



I'm a believer in hunches based on experience, p1r. One experiment found that people playing a game with two decks of cards, one riskier and less profitable than the other, responded subconsciously (measured with biometrics) before they could consciously identify one deck as being worse than the other.

You are likely in better tune with your subsconscious than the rest of us!
5:56 pm on Feb 3, 2011 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



One of the challenges is that Google is (and needs to be) complicated about how they deal with 301 redirects. They used to be a spammer haven, so all kinds of trust checking needs to occur.

Also technical errors are extremely common, so they can't just abandon a URL once they've verified that a 301 is in place - the website may easily change their mind. In other words, a "permanent" redirect is not truly permanent in the practical world of today's web.

So Google does have a challenge with 301s - and as long as the legacy URL is only showing in the site: operator results and not sending actual search traffic because it is the version that is ranking.
11:13 pm on Feb 3, 2011 (gmt 0)

WebmasterWorld Administrator 5+ Year Member Top Contributors Of The Month



@rogerd I have not had such experience, but just a thought... you had spam link injection to wrong URLs, then you redirected these wrong URLs to correct ones. So perhaps the correct URLs would now "inherit" these spammy links via 301 redirect? Could this be the reason why (if I understood well) the "good" URLs are not ranking either?

Normally 301 is a solution with leaked unwanted URLs that result as a technical error, but perhaps because of spammy links maybe this is not the best solution here? Mind you, I do not know if these wrong URLs have gained other "good" links whilst they were exposed.

As tedster said above, legacy URLs that are 301-ed are not just abandoned after a while - which is what I noticed even more so in the last 6-8 months.

E.g. I have a case where large number of URLs were redirected 2 years ago, and this redirection went really well and old URLs were dropped from index completely (no reports in site:operator, no reference to them anywhere in WMT). But then another technical mistake was made in September and a small subset of previously redirected URL "lost" the redirect for a couple of weeks - even though they were NOT referenced from within the site and as far as we could see there were no links to them. Despite that, in my case they were back in index pronto. Fortunately, WMT has reported duplicate titles and this is how we found about these and re-installed the redirect. They now mostly disappeared again (although it took longer than instating the original redirect), and there are few that still hang around somewhere at the end of the list that site: operator produces.

I have noticed that as of last 6-8 months Google is exposing to us via WMT much larger set of legacy data it knows about. My opinion (which may be wrong) is that whilst it had this data all along maybe the data was "archived" somewhere because maybe the old infrastructure might not have supported easy access to such data volume. And now the new infrastructure perhaps allowed for this data to be included more readily. This is just a speculation though.
10:10 pm on Feb 20, 2011 (gmt 0)



Block the URL in robots.txt file and submit this URL in Google webmaster tools.
 

Featured Threads

Hot Threads This Week

Hot Threads This Month