homepage Welcome to WebmasterWorld Guest from 23.20.63.27
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Lots of indexed content that expires
Onders




msg:4328902
 6:16 pm on Jun 21, 2011 (gmt 0)

I'm currently looking into indexing a lot of unique content pages which have a high turnover. Every month we get a couple of hundred new ones, and a couple of hundred need to be archived / removed.

Wanted to share my experiences and what I've learnt so far, as well as see if you guys have any suggestions.

Some of our pages need to be archived after a month.. some longer. What I am going to do is set up an expiry meta so that Google knows how long the page should be live for.

When the page is archived on our site (should correspond to the meta), then I'm going to put a 410 on the page (I can't find a huge amount of information on implementing 410's but Google seems to suggest it for permanently removed content)

I hope to have the 410 page dynamically pull in data about the page that was archived so that it shows something relevant for the user (i.e. if the archived page was about blue widgets, the 410 will have something like dark blue widgets).

I'm not going to put all these in the htaccess as it would become huge, and I'm not going to submit removal requests to get the pages removed from the index.

I think I'm doing enough - and complying with the guidelines / suggestions... well, I hope I am!

Thanks!

 

tedster




msg:4328948
 8:14 pm on Jun 21, 2011 (gmt 0)

Yes, it sounds like you're doing enough. However, I don't think Google is going to pay attention to the "expires" meta tag.

And sometimes, if you are expiring a very popular bit of content, a 301 redirect to another relevant page is a good idea - or even just changing the content on the existing URL to be informative and letting it stay as a 200 OK page.

It all depends on what you are offering and staying in touch with how it is performing.

g1smd




msg:4328953
 8:29 pm on Jun 21, 2011 (gmt 0)

I would add the 410 status code to the real content page and continue to show that for a short while (a week or so) before removing the content. That should overcome the time lag that occurs between the 410 being added, and the page dropping out of the SERPs. That is, you should return real content while the page is still findable in search results, even if there's a "410 Gone" status in the page header.

lucy24




msg:4328989
 9:18 pm on Jun 21, 2011 (gmt 0)

When the page is archived on our site (should correspond to the meta), then I'm going to put a 410 on the page (I can't find a huge amount of information on implementing 410's but Google seems to suggest it for permanently removed content)

The 410 is done in a counter-intuitive way, by redirecting or rewriting to nowhere:
[httpd.apache.org...]
or
[httpd.apache.org...] (scroll down to "gone|G")

Here it sounds as if you want to send users to a different 410 page for each no-longer-used primary page, so a 301 may be safer.

Make sure search engines do not get one single 404 before you get the 410 working, or they will keep trying to crawl the no-longer-there page forever. If they get a 410 right away, they will learn pretty fast.

g1smd




msg:4329000
 9:28 pm on Jun 21, 2011 (gmt 0)

I'm going through something like this at the moment: implementing a way to safely and sanely remove discontinued catalog items from the site and from the SERPs, with as little inconvenience to site visitors as possible.

Most off the shelf systems seemingly make little or no provision for this.

emikey




msg:4329048
 10:53 pm on Jun 21, 2011 (gmt 0)

The one issue I see with putting in a status 410 is essentially eliminating any of the link juice your articles earned. If your articles provide a benefit, people will most likely link to them. The last thing you want to do is take those natural links and throw them in the trash. My advice is to keep an archive of your content unless you have server space limitations. At the very least you should redirect your old content to similar pages on your website.

Onders




msg:4329057
 11:17 pm on Jun 21, 2011 (gmt 0)

Thanks for all your input - tedster, you're right in that it depends on what we are doing. Effectively we have a type of catalogue, with a high turnover of product! A 301 is not an option as all the content is different, I can't get away with saying it has moved..

I like the idea of keeping the content for a short while.. or potentially forever as long as there are related items and / or a search option above.

The link juice is something that could be annoying. If I have a couple of links on the 410 would they count.. I think not!

Robert Charlton




msg:4330146
 7:54 pm on Jun 23, 2011 (gmt 0)

The link juice is something that could be annoying. If I have a couple of links on the 410 would they count.. I think not!

Yes, that pesky link juice gets all over everything. ;)

The crux of the situation would be to determine which of the discontinued product pages has in fact attracted external inbound links. I'd use a combination of link databases (those that are fresh within 24 hours) along with your server logs, to determine roughly which pages have external inbounds and which have received outside traffic via search. These conditions are likely to identify the pages worth 301 redirects or archiving.

Queries for the urls of the remaining pages should return a 410 (Gone) status code... [w3.org...] ...meaning that: "The requested resource is no longer available at the server and no forwarding address is known".

With your site's fast turnover, you'd need to do this on an ongoing basis, which would probably require a custom system, and it might be costly to do regularly. You would also have to update nav links within your remaining content, along with your XML sitemaps.

It's going to take a chunk of overhead to do even this on a test basis, so initially I'd set up a test for a limited window of time... and I'd determine what percentage of your pages have links worth preserving, and also evaluate what kinds of links those are (eg, if they're blog links that are ultimately going to drop off the blog first page, they're likely to be of limited time value and may not be worth the overhead of preserving).

You'd then need to look at the tradeoffs of...

For pages that have no external inbounds...

- 410s, which would deliver a custom error page, not unlike a 404 custom error page, but more permanent. You would need to remove internal nav links to any pages you 410 or simply let go 404.

For pages that do have external inbounds...

- direct 301s to either related currently active pages, or perhaps to appropriate category pages...

or...

- archiving... either by keeping in them in place... or by moving to a separate section, which would entail 301s because of the new location.

In either case, you'd need to modify the content of your old pages to make them useful to visitors who find them, thus maximizing benefit to your site. The onpage modifications to your archived pages, whether moved or in place, would entail, among other things, advising users of the status of the pages, removing links to pages that would also be changing, and adding links to new related content you want to send users to.

Chances are that an archived page will only have a limited life unless it was immensely popular, in which case there may be replacement products coming. An archive might in effect be used as a buffer, to preserve inbound link juice until a more appropriate use of the page can be determined.

I don't understand what's meant using a 410 for an archive. It doesn't make sense to me. When you 410 a page, it's gone.

Needless to say, the CMS requirements of the above aren't trivial. As g1smd points out, off-the-shelf systems generally don't address this.

g1smd




msg:4330169
 8:16 pm on Jun 23, 2011 (gmt 0)

I don't understand what's meant using a 410 for an archive. It doesn't make sense to me. When you 410 a page, it's gone.

The "410 Gone" is a status code delivered in the HTTP header, and has no relationship to what is shown to humans or bots on the HTML page. You can deliver any 2xx, 4xx, or 5xx status code in the HTTP header before the HTML page is served.

In this case, the 410 status code is delivered at the originally requested URL. The 410 status code tells search engines to de-index the page. That will take days to weeks to happen. You will therefore still get visitors to that page from the SERPs until the page is de-indexed. You can continue showing content on that page to the visitors that continue to arrive. Once the page drops from the SERPs that page can be changed to deliver a standard error message.

Onders




msg:4330175
 8:26 pm on Jun 23, 2011 (gmt 0)

Could you actually then do a 410 for a certain amount of time (enough for it to be de-indexed by Google) and then after lets say 3 weeks put a 301 redirect on it. The page URL can ultimately still be there (it's just had a 410 status header but info etc is still in the DB).

Links will still be pointing to that page, but benefit may be reducing due to 410, but if you then do a 301 it should already be unindexed, plus you may then get some benefit to another URL.

Not sure if this logic follows - once again, thanks for the input and your thoughts!

g1smd




msg:4330183
 8:42 pm on Jun 23, 2011 (gmt 0)

You could do that but I would not advise it. Once you have served "410" don't change to some other status code.

If you are going to redirect, just set up the redirect immediately the page content is gone. The redirect will cause the URL to be de-indexed soon enough.

Onders




msg:4330185
 8:45 pm on Jun 23, 2011 (gmt 0)

Ok - so if we stick with 410 (which is what I'm intending) I think I may just have to cope with the lost links.

londrum




msg:4330190
 8:51 pm on Jun 23, 2011 (gmt 0)

i wouldn't remove them at all, because you might still get a bit of traffic out of them.

what i would do is give them a "noindex" tag after the date so google can remove them from the index, and then include a little header at the top of the page which points people to your new content.

i dont think that automatically redirecting people to new content when they are expecting the old content is a great idea. what if they are visiting the page just one day after it has "expired"? they might have visited the page a few times already and know what they want. just stick a header at the top explaining that its now out of date and point them to the new place.

Robert Charlton




msg:4330215
 9:28 pm on Jun 23, 2011 (gmt 0)

One dilemma with keeping the pages but noindexing them is that, particularly with a high turnover rate, noindexed pages will accumulate over time and ultimately dilute the link juice flowing to fresh content.

I'd either 410 them all and lose possible link credits, or take the trouble to identify which have inbound links and traffic worth preserving and 301 them.

I should add that in this type of site, it's likely that you will not lose many inbound links to these short-lived pages... but I couldn't be sure of that without checking the backlinks to those pages to get a sense of how many there are.

g1smd




msg:4330217
 9:32 pm on Jun 23, 2011 (gmt 0)

It is important to be aware of the wide range of options available, and then to compare and contrast them all before deciding on the correct solution.

The correct solution will vary from site to site.

lucy24




msg:4330248
 10:52 pm on Jun 23, 2011 (gmt 0)

The "410 Gone" is a status code delivered in the HTTP header, and has no relationship to what is shown to humans or bots on the HTML page. You can deliver any 2xx, 4xx, or 5xx status code in the HTTP header before the HTML page is served.

Are you saying that a 410 in the page header is entirely different from and unrelated to a 410 generated in the .htaccess? I hope so, because otherwise it becomes another of those "it's at the whim of your server" issues.

The OP did specifically say they didn't want to go the .htaccess route.

g1smd




msg:4330278
 11:41 pm on Jun 23, 2011 (gmt 0)

The 410 status code is sent inside the HTTP header that precedes the sending of any visible HTML content.

That 410 header can be generated by mod_rewrite in the .htaccess file, by a few lines of PHP on the server, or in one of several other ways. It matters not how the HTTP status code is generated. The end result looks exactly the same to the browser or bot that made the initial request. The Live HTTP Headers extension for Firefox can show you what is going on.

lucy24




msg:4330311
 1:18 am on Jun 24, 2011 (gmt 0)

The end result looks exactly the same to the browser or bot that made the initial request.

The 410 status code tells search engines to de-index the page.... You can continue showing content on that page to the visitors that continue to arrive.

OK, then we are in "at the whim of your server" territory, because once I've got a 410 in place, I can't see the page any more. Instead I got handed a server-default 410 that was even scarier than the server-default 404. (Scarier to ordinary humans, that is. Robots don't feel fear.) Past tense, because as soon as I discovered it, I physically rerouted them to my custom 404 page. Most people don't care about the niceties of whether a page used to exist or not, though the ones in the OP's situation might.

EvilSaint




msg:4330321
 2:21 am on Jun 24, 2011 (gmt 0)

I would be very apprehensive about showing a 410 on a content rich page. Especially if its unique content and has inbound links coming to it.

If server space isn't an issue, would definitely advise you to keep the pages as a normal 200. If not, worst case scenario, a 301 redirect to a similar page to pass the PageRank & link juice value across would suffice.

g1smd




msg:4330370
 6:15 am on Jun 24, 2011 (gmt 0)

Instead I got handed a server-default 410 that was even scarier than the server-default 404.

If you're doing
RewriteRule ^thatpage\.html$ - [G] in .htaccess then yes, you are forcing the server default 410 ErrorDocument to be served for that request. Processing never reaches the PHP script running the site.

If you're modifying the CMS to say
if ($pagestatus = 'expired') {HEADER ('Status: 410 Gone';)} in th PHP code at some point before the HTML DOCTYPE is emitted, then you have complete control over what is served and what status code is served with it.

Both solutions emit the same "
Status: 410 Gone" HTTP status message in the HTTP Headers preceding the sending of any HTML page.
Onders




msg:4330446
 10:00 am on Jun 24, 2011 (gmt 0)

Yep - there is no need to have a server default 410 page, you can just easily retain the content on the page, whilst still telling search engines and bots that the page has expired.

We are going to go down the route of doing this on a page for page basis (not through htaccess) and automatically display a message on these pages for a weeks to let people know (who may still accidently come to it from SE's who haven't unindexed or old direct links in emails etc) that the page has expired and a search option to take them to new content. Eventually we will remove the content completely (3 / 4 weeks later).

It's hard to be specific about our situation without just saying our domain, but the content we are 410'ing is generated by clients, who want it removed after a certain date anyway. That's one of the main reasons why I don't want to leave it as a 200.. 410'ing and then potentially removing after a few weeks is the best trade-off.

Robert Charlton




msg:4330691
 6:11 pm on Jun 24, 2011 (gmt 0)

...then you have complete control over what is served and what status code is served with it.

Thanks for various clarifications, in spite of which I had assumed .htaccess behavior.

I'm still adjusting my thoughts to what feels like an inconsistency in the approach of continuing to display the page... though I now understand why you're wanting to do it that way. The approach does seem inconsistent with the section I'm highlighting in the first part of the w3c 410 specs below... [w3.org...] ... but at the same time, it appears to be in line with the intention of the spec in the second part, which I'm also quoting....

10.4.11 410 Gone
The requested resource is no longer available at the server and no forwarding address is known....

...The 410 response is primarily intended to assist the task of web maintenance by notifying the recipient that the resource is intentionally unavailable and that the server owners desire that remote links to that resource be removed. Such an event is common for limited-time, promotional services and for resources belonging to individuals no longer working at the server's site. It is not necessary to mark all permanently unavailable resources as "gone" or to keep the mark for any length of time -- that is left to the discretion of the server owner.

Onders - I assume your CMS will remove all internal navigation to these pages when the 410 is served. Ditto with regard to XML sitemaps. Is this to be done in discrete intervals on an ongoing basis? It seems to me it would be almost impossible to do continuously.

It's hard to be specific about our situation without just saying our domain, but the content we are 410'ing is generated by clients, who want it removed after a certain date anyway.

This also clarifies a bunch of things. It sounds like these may be classified ads or real estate listings, as opposed to, say, more traditional ecommerce (which is what I've been assuming) where you might control inventory or be able to offer similar or replacement products (and would thus have a user-based reason for forwarding inbound links, apart from just the link juice). In this case, it does sound like the 410 with the page displayed for a while would best for the user.

If these are classified ads or something analogous, btw, and you have input form fields for contact information, removing that contact info when you serve the 410s might be helpful to your clients who posted the content.

g1smd




msg:4330936
 6:27 am on Jun 25, 2011 (gmt 0)

I haven't studied what e-Bay do in detail but what I do see is that they leave the full details of an auction up for several months after it has finished (with links to similar live auctions and to the original sellers page and so on) and then e-Bay eventually changes the page content to say the auction has been removed.

I would assume that at some point in that process, the HTTP header status code changes from "200 OK" to either 404 or 410. What we are discussing here is changing it to 410 a few weeks before the page content is changed in preparation for it to be dropped by search engines.

Browsers don't care what the status code is, they will display the HTML content whatever the 2xx, 4xx or 5xx code (with a 3xx code they will follow the redirect to the new URL, not display any content at the originally requested URL).

Searchengines will obey the status code, and therefore not index the actual page content when a 4xx or 5xx code is delivered. They will also take steps to remove the page from their index once a 3xx or 4xx code is returned for that URL.

Onders




msg:4334747
 10:44 am on Jul 4, 2011 (gmt 0)

Robert - sorry, missed the last replies on the thread, but yes, we are doing the following

a) Retaining the actual content on the page indefinitely. Am currently measuring how many times the 410 pages are being served (distinguishing between actual users and bots). Dependent on how many times actual users see the 410'ed pages we'll decide if we want to serve different content (is the extra development time worth it?)

b) Internal navigation is removed as we 410.. so there is no way anyone can come to the page through the site - we still have links in emails, but again, am measuring how many times people are clicking and visiting the 410's.

c) We already remove all contact information when the page is 410'ed - totally agree. It is done on an ongoing basis. Effectively think of us having searches which generate results from a DB. Once the page has gone the result isn't generated.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved