|What eventually happens to orphaned pages?|
One of the sites I run is a WordPress blog. Each time a new post is added, WordPress creates a link in the resulting new page to an RSS Feed page. I have about 32 of those indexed in Google, similar to www.domain.com/345/feed/. I have recently amended the .htaccess file to remove trailing slashes on all URLs (as WordPress pages work either way, with or without). While I was at it I also removed all the links to RSS Feeds, as for me they are useless.
I assume what will now happen is that the feed pages will eventually go supplemental because there are no links to them from anywhere. If they were real pages I would delete them but in this case I can't, as there is nothing to delete. Should I be doing something more? What ultimately happens to a page that is no longer linked to from anywhere but is never actually deleted?
32 pages like www.domain.com/345/feed/ - all XML RSS Feed files indexed by Google, making up about a quarter of the number of pages on the whole site. They have no PageRank. They cannot be deleted because there is nothing to delete.
I would prefer to unlink them but I read somewhere that orphaned pages are ill-advised, and as I can't delete them I have made a new page that only links to all those RSS Feed 'pages', purely for them not to be orphaned.
It would still be interesting to know what eventually happens to pages indexed by Google but that become orphaned.
|They cannot be deleted because there is nothing to delete. |
|...orphaned pages are ill-advised, and as I can't delete them... |
...uh... er... what? :)
I don't have a clue about xml and rss and all this but...
Aren't RSS feeds XML files?
Delete them. ( heh... what's YOUR favourite button on the keyboard? )
If there'd be no PAGE to delete, ( i'm trying to imagine these are... URLs that trigger some kind of server response or pages generated dynamically on the fly... correct me if i'm wrong ) how would unlinking this "trigger" lead to an orphaned PAGE?
I don't get it... maybe i'm a bit slow.
Whatever you'd like to deindex so bad that supplemental results and gradual dropout won't do ( which would happen to orphaned pages veeeerry slowly ) you just have to take care of the URLs that would generate a 200 response. Meaning if G would still request the same URLs, even though the links to them are now gone from your site ( which it will do btw for a long time, although less and less frequently ) there would NOT be a status 200 as a reply. Not to mention if someone links to them out of fun, they'd be back anytime.
So there has to be a way to serve a 404 for these URLs otherwise they will not be deindexed. ( oh btw. why do you want to deindex them in the first place? Not that i didn't do the same with our phpbb2. )
I've read too that too many orphaned pages would lead to G thinking you're not maintaining your site well enough, but i'm yet to see this cause anything else than trash in the index... for site: searches. Which no one sees.
( Uh... not that we have a single orphaned url on our site, but indexes we shoved into images-only directories got picked up without a single link to them... must be the sitemap at G so do pay attention to that too ;)
They are not actual files. They are content indexed by Google, from links to them. I can only delete the links, as the content is dynamically generated by the CMS. If the links had never existed, the XML RSS Feed "files" (content indexed by Google) would never have existed.
If I delete the links, what then? (in the long term)
If you delete the link to a page/URL that is already indexed, it will lose its PR ( even if you thought it didn't have one, it probably had, only <1 ) and become supplemental, but stay indexed. It will be crawled less and less frequently, defaulting to a regular check every six months, or so i see on the site we are doing with a friend. If it was not linked to during that time, it may fall out within half a year / year. Pretty much like as if it was deleted.
Others please feel free to correct me, as i have limited experiences as a "webmaster" :P
IF it wasn't linked to by anyone that is, and that's including scrapers, others displaying your feed and so on... also it may come back anytime at a whim during data refreshes, rollbacks, etc.
You say this URL dynamically generates an XML feed, right?
It will generate it whenever G tries to access it.
It doesn't really matter if there's no link to it, the URL is already recorded at G.
But unless this is a rewritten URL...
there has to be a file, which has the script that generates it. Probably the index that defaults in this directory. If there's nothing else in there i'd delete it altogether although this might be completely wrong ;)
But at least it'd return a 404.
Also there has to be a setting in the CMS that would let you turn the rss feeds off. Then removing the link or not, it'd again... return a 404.
If it's a rewritten URL...
then i have no idea ;)
Do the links point to a directory like this?
Do you know what file defaults in there?
Ah but anyway... the point is that orphaned files/URLs won't likely hurt if not in the bulk, and if no one links to them. But if you want to get rid of them, you'll need to remove all links, and get a 404 response for the URL. And make sure there's no sitemap, residual links, archives, whatnot that'd keep G assuming it should be there.
Thanks for all of that, but it doesn't answer my question. I'm talking about content indexed by Google, and the only control I have is whether to continue to link to it or not. At present I am still linking, to avoid 32 orphaned URLs that may or may not cause a problem for the rest of the site in the future.
If I remove the links, the URLs will eventually go supplemental in Google but will not return a 404 because, as I said, there is nothing to delete and there is no way to prevent the CMS from continuing to serve up the content.
|the only control I have is whether to continue to link to it or not |
You can also use robots.txt to disallow Googlebot from those URLs.
Yeah, and that's the scenario i presented in the first paragraph... and my experiences of what happens to such orphaned pages.
...just ignore the second half.
... btw you could remove the links, and once you see the URLs becoming supplemental go to that URL removal page of Google and... have the URLs removed. Will do the exact same thing as doing nothing though as it will only not show or crawl the URLs for half a year. Then they might appear again.
But from experience if a URL is valid, and had been indexed by G at any given time, GBot will come back periodically to check it out, index it again, but if no links point to it, it will drop out over and over again. So it's gonna be a supplemental showing some old old cache dates, when in the index at all.
Orphaned pages are alive and well in G database :P
Doing cameos every once in a while on the index.
edit: ...yeah what tedster said :D
You could put up a disallow for all /feed/ directories.
I just don't like the "pages disallowed by robots.txt" column in webmaster tools.
|You can also use robots.txt to disallow Googlebot from those URLs. |
Well yes, but they are already indexed. What I'd like to know is what happens to an indexed URL that is 'unlinked' then goes into the supplemental index but can never actually be deleted. It becomes a 'particular type' of supplemental URL (to paraphrase g1smd) that presumably will remain so, or maybe it does eventually drop out of the index.
The URLs may remain in Supplemental for a while if you just remove links - and indefinitley if you don't get all the links removed from everywhere on the web they may occur. If your server is still giving a 200 OK response to the bot, then just creating an orphan is no guarantee of any change.
If you really want the URLs removed from the Google index, place the proper disallow rule in robots.txt and then use the Google automated url removal tool. There's one choice in the tool that forces a new fetch of your robots.txt and then takes action pretty quickly - in just a couple days.
But even without using the removal tool, robots.txt alone will handle it eventually - even if the URL was previously indexed.
[edited by: tedster at 5:00 pm (utc) on Dec. 29, 2006]
If I'm not mistaken, orphan pages will eventually get purged. Since there are no links to them, Googlebot can't see them. Once Google goes through a few iterations of its index, those orphans will most likely get purged.
|I would prefer to unlink them but I read somewhere that orphaned pages are ill-advised, and as I can't delete them I have made a new page that only links to all those RSS Feed 'pages', purely for them not to be orphaned. |
And by doing so, you've provided an entry point for Googlebot to continue to index those orphan URIs which is what you don't want.
|I have recently amended the .htaccess file to remove trailing slashes on all URLs (as WordPress pages work either way, with or without). |
That concerns me. /file and /file/ are two different locations. Did you 301 the /file/ to /file? If not, you will most likely have some issues to deal with from a dup content standpoint.
I've never checked server headers on feed pages. What type of server headers are being return when you check that URI...
Use robots.txt to get them out of the index.
It will take a long time for them to disappear, and some may even remain as URL-only entries for a very long time. However, by that time they will never rank for any keywords.
The /name vs. /name/ issue is also important for all of the other files of the site.
One URL should return a 301 to the other, preferably from /name to /name/ each time.
Thanks for the further advice. I will most likely remove the links, so that the URLs will become orphaned and one day drop from the index with no harm done.
With .htaccess I have ensured that URLs with a trailing slash redirect to ones without. A feature of WordPress is that all URLs work with or without, and even though there are no internal links to those with a trailing slash, for some time Googlebot has insisted on crawling a few of them with the trailing slash added. Strangely, the site: command lists one or two with the trailing slash (not the ones it insisted on crawling) but the SERPS always feature those without.
The RSS Feeds return a HTTP/1.1·200·OK.
[edited by: Patrick_Taylor at 5:24 pm (utc) on Dec. 29, 2006]
A trailing / URL that returns content with "200 OK" indicates that you are returning the index file in a folder.
A URL without a trailing / is often assumed to be a filename. However if it has no extension (like .html or .jpg etc) you really do need to make sure that the correct MIME type is specified in the HTTP header for each type of file returned.
If the MIME type is missing, IE makes a guess as to what the content is, by examining the first few bytes of the file, but other browsers often just fail to display the content, or display garbage.
I am aware of a site where all images are shown like JFIF:&\4b&3;d7s[37skq6;@~5w3^")@;4fj.,( etc in Mozilla because of an incorrect or missing MIME type when the content is served.
I use this on all pages on all sites:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
Look at the HTTP header; that is, the stuff that sits above the HTML page content.
Check what you get for images, CSS, JS, ZIP, and other files on your site too.
You mean with a typical online HTTP Header Viewer? I don't see a MIME type as such - only Content-Type:·text/css.