root vs index.html vs index.htm

Forum Moderators: phranque

Message Too Old, No Replies

root vs index.html vs index.htm

nmjudy

9:11 pm on Oct 28, 2008 (gmt 0)

ARrrrrGH! Google Webmaster Tools is reporting broken links 'from my site' 'to my site' that don't exist.
A couple of months ago, I created a redirect in my .htaccess to redirect all index.html pages to the folder root. The code is below and live headers show that the code is working.

RewriteEngine on
#
# Redirect requests for index.html in any directory to "/" in the same directory
#
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /(.+/)?index\.html\ HTTP
RewriteRule ^(.+/)?index\.html$ http://www.example.com/$1 [R=301,L]
#

I only use .html for page extensions. It appears that Google Webmaster Tools is trying to find index.htm links. (?) It says that the source is from my own page - but there is no such link on the source page. Is there a way to force both an index.html AND index.htm redirect to the root folder with the rewrite code?

jdMorgan

11:42 pm on Oct 28, 2008 (gmt 0)

Sure,
Just make the trailing "l" optional by following it with the "?" regular-expressions quantifier:


RewriteEngine on
#
# Redirect requests for index.html in any directory to "/" in the same directory
#
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.html?\ HTTP/
RewriteRule ^(([^/]+/)*)index\.html?$ http://www.example.com/$1 [R=301,L]
#

The "?" quantifier means, "match zero or one of the preceding character, character-group (such as [a-z]), or parenthesized sub-pattern." Just in case, I also modified your subdirectory matching to allow for "zero or more" subdirectory levels.

Jim

g1smd

11:31 am on Oct 29, 2008 (gmt 0)

I use similar code but I also get it to redirect for index.php and several other common names as well as the ones you mentioned. That way I don't care what name people use in their links; those links will still work, and can never cause duplicate content issues.

jdMorgan

1:53 pm on Oct 29, 2008 (gmt 0)

The problem with that is that it makes those additional filetypes appear to be valid. That is, Mr. bone-head linker puts up a link to your site using index.php --a URL that has never been used on your site, and does not exist-- and you accept it and redirect it to "/".

So, Mr. bone-head linker clicks on his bogus link to check it, sees a page, and goes away happy about his new link, because he failed to notice that his address bar changed to "/".

So now, all the search engines have to go through an extra step to credit PR/link-pop from that bogus link, and humans have to wait for the redirect every time they click that link.

If you'd let that bogus link 404, then Mr. bone-head *might* have noticed something wrong.

As a result, I usually hold off on "handling all possible cases" until I see a situation where the above behavior can be balanced against the worth of the link. If it's from some obscure blog or forum, I might ignore it, post a corrective blog comment, or drop the Webmaster an e-mail. On the other hand, if the link was from CNN's home page, you can bet I'd redirect it! :)

There are valid arguments both ways; As with blocking user-agents and "countries," this is one of those decisions that should be made in an informed manner by each Webmaster individually.

Jim

g1smd

2:09 pm on Oct 29, 2008 (gmt 0)

What I have found, is that Mr. BoneHead doesn't actually check the link after publishing it. However if they later start paying attention and decide to check all their outgoing links using something like Xenu LinkSleuth they'll see their link listed in the report as a 301 along with the correct URL they should have used. At that point they can either fix it, or not. If the link had been showing as a 404, they are more likely to delete the link than find some way to correct it. That's been my experience. YMMV of course.

nmjudy

2:09 pm on Oct 29, 2008 (gmt 0)

In my case, there is NO link from my own pages to any .htm pages on my site. However, Google Webmaster Tools says that's what it's following (tells me that pages on my site are linking to index.htm pages).

For a long time, I had my index.html pages indexed without a redirect. A few months ago, I implemented the redirect code to the root to avoid duplicate content. It appears that Googlebot is "making up" or "fishing for" the .htm extension when the links I have are specifically for just the root.

One thing I don't have on my site is a base href. Is this still being used? If so, what's the format and where do you put it?

g1smd

2:10 pm on Oct 29, 2008 (gmt 0)

It goes within <head> ... </head> with code like

<base href="http://www.example.com/somepath">

nmjudy

2:53 pm on Oct 29, 2008 (gmt 0)

So if I'm in the process of changing all my relative links to root relative - would I just use <base href="http://www.example.com/"> ?
If I changed that now - how will it effect my relative links?

g1smd

2:55 pm on Oct 29, 2008 (gmt 0)

Relative links are <base + link> so the base should be the URL of the page you are on.

That doesn't affect URLs that begin "/whatever" because the "/" says "root of site".

nmjudy

3:51 am on Nov 4, 2008 (gmt 0)

I've been frantically trying to change all my links to root relative - and uploaded several folders I've been working on all day.

I haven't added the base href in the head just yet.

I just ran xml-sitemap generator to create my sitemap, and what used to work really well - just spun in circles. It was trying to spider a structure like this:
examplepage.html/http://www.example.com/reallylongpathnamerepeatingseveraltimes/http://www.example.com/reallylongpathnamerepeatingseveraltimes/http://www.example.com/reallylongpathnamerepeatingseveraltimes/

I use SSIs on my page that use root relative links. Will adding the base href in the head help or hurt this? Also...should the base be

http://www.example.com
or
http://www.example.com/ ?

I would think the first choice because if all other links begin with a slash (/), wouldn't using the domain with the slash (/) create a double slash in the URL?

jdMorgan

11:39 am on Nov 4, 2008 (gmt 0)

It sounds to me like the sitemap generator is broken, then.

You should not have to "do anything extra" to use root-relative links on your site, it should just work. The <base href> stuff is not required.

If you do use <base href>, then as posted above, the value is the URL of the page you are adding the <base href> to -- i.e. the <base href> of this page is this page's URL.

I've only ever used <base href> once -- on the home page of one site, to help speed up recovery from a bad indexing problem I had accidentally created with a typo... :o

Jim

[edited by: jdMorgan at 11:40 am (utc) on Nov. 4, 2008]

nmjudy

12:51 pm on Nov 4, 2008 (gmt 0)

I've tried using a couple of different spider simulators to see if it was my link structure or sitemap generator.

Prior to making any linking changes, I linked to all my index.html pages. A thread in the Google forum said it was best to link to the root of folders(directories). So I made that change and created an .htaccess file to redirect.

Using document relative links within a site folder back to the root of the folder created the following link: href="./"

The simulators (and Google) appear to have a problem with ./

I've had a long stand ranking in Google for several years and Sunday I noticed traffic cut in half because of what it thinks my linking structure is. I started these changes (using ./) about 3 weeks ago. Even though I had a sitemap...when it tries to follow the links internally, it drops the folder name of the URL so obviously gets a 404.

On the handful of folders that I changed the directory structure to include the full path seem to be spidering fine in the simulators. I won't know if it fixes the sitemap generator until all document relative links are changed.

The frustrating thing is the site works perfectly in the browsers.

jdMorgan

3:24 pm on Nov 4, 2008 (gmt 0)

I recommend the following forms:

Canonical home page link: <a href="http://www.example.com/">
Canonical page link: <a href="http://www.example.com/page.html">
Canonical subdirectory index link: <a href="http://www.example.com/subdir1/">
Canonical subdirectory page link: <a href="http://www.example.com/subdir1/page.html">

Server-relative home-page link: <a href="/">
Server-relative page link: <a href="/page.html">
Server-relative subdirectory index page link: <a href="/subdir1/">
Server-relative subdirectory page link: <a href="/subdir1/page.html">

Page-relative link: <a href="page-in-this-directory.html">
Page-relative link: <a href="../page-in-directory-above-this-subdirectory.html">
Page-relative link: <a href="../../page-in-directory-two-levels-above-this-subdirectory.html">
Page-relative link: <a href="subdir/page-in-subdirectory-below-this-directory.html">
Page-relative link: <a href="subdir1/subdir2/page-in-subdirectory-two-levels-below-this-directory.html">

Also, check to be sure you don't have any mod_rewrite or mod_alias directives which are interfering, and be sure to completely-flush your browser cache before testing any new code or newly-uploaded pages.

Jim