|Redirect subdirectory index pages?|
need best practices advice
| 6:42 am on Sep 17, 2012 (gmt 0)|
One of my sites started growing little by little and recently grew again by nearly double the number of pages to about 2,000. When I first started the site I stored images and includes for different areas/departments in their own subdirectories and tucked in a little noindexed index.html file to prevent server listing. When I had nearly twice as many pages all at once I decided to move these pages to the subdirectories where their images and includes had been and it all works fine - but my sitemap script adds the subdirectory folders to the sitemap as: http://www.example.com/subdirectory/ and if that URL is crawled, the dummy index file will show up. As I said, it is meta noindexed, just a useless placeholder page.
I can't prevent the sitemap script from listing the subdirectory without also preventing all pages there from being added to the sitemap. I have not seen any indication in GWT that there is an issue, but almost expect it to be a problem. Each subdirectory has this dummy index.html file, but it also now has an actual named page like http://www.example.com/subdirectory/what-is-here.html that serves the function of linking to the various pages in that subdirectory.
My question: Should I consider a redirect from the index file to the named page or will that get me duplicate content? I could rename the named page to index.html and fix the navigation links with find/replace and just let the named page return a 404. This page has only existed since lat June. All these major changes were done since late June, the page URLs would not change, just the navigation links. There are 3 subdirectories which each contain 3 - 5 sub subdirectories in them. My brain hurts. I am sure that it's all been done before and hope someone who has the experience can share some ideas. I have built a few dozen sites but never one with this kind of structure and if I had had time to spare before doing it this way I would have simply named these pages "index.html" and let them do their job. It is easy to see now what I should have done, but what's the best way to fix it now or should I leave things alone that "aren't broken"?
| 8:14 am on Sep 17, 2012 (gmt 0)|
So does mine. Why do you have the dummy index file at all? Wouldn't it be more practical to switch off auto-indexing
for the entire site?
Can't you tell your sitemap script to index only files with certain extensions? If there's nothing in there but images, there's no reason for it to be on the sitemap at all. G### will find the individual images; they're linked from pages.
| 9:20 am on Sep 17, 2012 (gmt 0)|
the index for a directory should be served from the directory's url which ends with a trailing slash:
the next step is that you should configure your server to specify the default directory index document - say index.html - and any requests for the default directory index document, for example http://www.example.com/subdirectory/index.html, should be externally redirected with a 301 status code to the trailing slash url:
from your description it sounds like the content in the what-is-here.html file should actually be in the index.html file.
if http://www.example.com/subdirectory/what-is-here.html has been indexed or if you are getting any requests for that file, for example a returning visitor who has bookmarked that file, those requests should also get 301 redirected to the trailing slash url:
regarding your meta noindexed content showing up in the index - that shouldn't happen unless you are excluding this content from crawling by disallowing in robots.txt.
in this case you would typically see this content in a search snippet with no description and in its place is text similar to this:
|A description for this result is not available because of this site's robots.txt – learn more |
in this case you should allow crawling so that the indexer can see the neta noindex element in the document.
the only way to noindex an image file or any other resource that is not an html document is to use the X-Robots-Tag HTTP Response header.
your site map script may help a search engine discover urls on your site but the absence of a url does nothing to exclude that resource from crawling nor from indexing as an incomplete snippet if the url is discovered elsewhere.
| 3:03 pm on Sep 17, 2012 (gmt 0)|
|Wouldn't it be more practical to switch off auto-indexing |
Thank you, but I guess I was not very clear. The directory used to contain only images so the dummy indexes were put in because the rest of the site does use a structure that is set up to serve an index file when a directory is accessed. It now contains pages and images. If I use -Indexes I would need to make less preferred changes to the directories that are using regular index.html for years. Currently as phranque said, when you access a directory an index.html file is served. For years I had the dummy index in there just so the server would not serve up a list of images. My only controls are via .htaccess and it seems more complicated than need be to turn off Indexes in some subdirectories only.
That is exactly right, just wondering if I am better off redirecting the index.html file to the what-is-here.html page in each subdirectory or renaming what-is-here.html to index.html
|from your description it sounds like the content in the what-is-here.html file should actually be in the index.html file |
I am not having this problem, but I can see where the confusion is. The index.html pages are not being indexed, they do not appear in the sitemap and I'm sure they are crawled and ignored. They are not blocked in robots.txt. My problem was that although the index.html files are not in my sitemap, the subdirectories are listed and if crawled, that URL would serve up a dummy index.html file. I can control what goes in the sitemap by file extension but folders/directories are only either on or off, I can't list every page in a subdirectory without listing the subdirectory as an URL too.
|regarding your meta noindexed content showing up in the index |
Yes, it is not images that concern me, but you just gave me an idea.. If I rename the index.html pages to index.php they will not be giving the subdirectory URLs to the sitemap script because .php files are ignored and I would not need to do anything else. My htaccess is only redirecting requests for index.html to the subdirectory URL. If I rename them to index.php that should keep the subdirectories out of my sitemap.
|Can't you tell your sitemap script to index only files with certain extensions? If there's nothing in there but images, there's no reason for it to be on the sitemap at all. G### will find the individual images; they're linked from pages. |
Thank you lucy24 and phranque for the help, just needed some different ideas on this because I was not seeing what is in front of me.
| 4:03 pm on Sep 17, 2012 (gmt 0)|
Yes! It worked. Replacing all those dummy index.html pages with index.php keeps the subdirectory URLs out of the sitemap. I fixed it so that the index.php files will work the same way but because .php files are ignored by my sitemap script, this prevents those URLs from being returned to the script. The subdirectories that actually do use an index.html file were left as is and it all works fine. I ran a new sitemap and only the subdirectory URLs that are pages show up. I am most happy. Thank you again!
| 7:08 pm on Sep 17, 2012 (gmt 0)|
|renaming what-is-here.html to index.html |
that and a redirect is the correct solution.
what happens when you request urls such as these?
what happens when googlebot requests http://www.example.com/subdirectory/ in a directory that has one of your dummy index.php files?
(it's going to index the content served by the index.php)
| 7:30 pm on Sep 17, 2012 (gmt 0)|
http://www.example.com/subdirectory/index.html would give you a 404 since I replaced it with index.php
http://www.example.com/subdirectory/index.php would let you see a page with nothing on it but a nofollowed link to the right page. It has not been indexed, anyone coming to that page is playing with the URL.
Index.php to the subdirectory URL and both pages are/were noindexed.
If I were to rename the what-is-here.html pages I would need to rewrite all the navigation for the site and every link on the site would use a redirect.
Subdirectory pages only have links to the home page and the what-is-here.html page for that subdirectory but I prefer not to have every link require a redirect. Ideally I would have thought much more prior to making these changes, but I had 4 days notice that all links had been changed and would cease to function when I rebuilt the site. All I wanted to do was to remove the subdirectory URLs from the sitemap, they were not indexed anyway, I imagine because #G found nothing to index at that URL. I do realize that your recommendation is Best Practice, I can redo the site a section at a time and do it without redirects now that I have taken care of inviting the unused URLs to be crawled.