homepage Welcome to WebmasterWorld Guest from 54.234.2.88
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
document relative? root relative? absolute?
nmjudy




msg:3737761
 10:00 pm on Sep 3, 2008 (gmt 0)

I have a 7000+ page static site on a dedicated server. I recently figured out (thanks to this board) how to add lines to my .htaccess file to redirect all my index.html pages with SSIs to the root of the directories/subdirectories to avoid the Google duplicate content penalty.

Prior to posting the new .htaccess file "live", I want my internal linking to match. Currently, all internal links point to the index.html pages, so I know this has to change.

My question is which would be better to use in my internal linking structure? Should I use root relative links? Absolute links? Document relative links like the rest of my site? How do you setup a document relative link when it's an index page in the same folder? Would it be ./ ?

The bigger question is, if I have some of each on a page (absolute, document relative, and root relative) - does Google even know?

 

jdMorgan




msg:3737877
 12:09 am on Sep 4, 2008 (gmt 0)

Well, Google does "know," since it is the client (e.g. browser or robot) that resolves relative links. But Google doesn't really care if you mix-and-match them, since all of these forms are perfectly valid.

This often boils down to whether you review and test your site on your local non-server PC. If you click on a page-relative link to a local file, then your browser can find it if it's a named file. But the browser will show you a directory index (instead of the index file) when you request the relative link "./" because it doesn't know about DirectoryIndex.

If you click on a server-relative link, then since the OS doesn't know where DocumentRoot is, it can't find the file -- On a PC, for example, you end up looking for the page in "file:\\\" (which resolves to c:\\) instead of in your development directory.

And if you use canonical links, then any link you click on your development PC will take you to the page on your real server. So, it depends on which of these functional problems is the least bothersome to you.

Personally, I use page-relative links (e.g. href="otherpage.html" or href="../images/logo.gif") for most page-to-page and included-object linking, server-relative for links on error pages (href="/images/logo.gif") and for links to the index page (href="/"). I occasionally use canonical links on pages that are popular with scrapers, just so they have to make the effort to edit them to keep their stolen visitors on their own spammy sites.

Jim

nmjudy




msg:3737898
 12:51 am on Sep 4, 2008 (gmt 0)

For years, I've always used the same kind of linking structure that you prefer. However, what I'm trying to do is to eliminate direct linking to any page named "index.html."

Say this is my file structure...
Root

Directory

DirectoryStuff1
page1.html
page2.html
index.html

DirectoryStuff2
page1.html
page2.html
index.html

index.html

How would I do a page-relative link from

DirectoryStuff1/page1.html to
DirectoryStuff1/index.html

without using the page name "index.html"? To avoid having the directory index shown, is my only choice then to use root-relative links in this case?

jdMorgan




msg:3737903
 1:10 am on Sep 4, 2008 (gmt 0)

Use
<a href="./">Stuff</a>
-or-
<a href="../Directory1/">Stuff</a>

Jim

nmjudy




msg:3781384
 5:01 pm on Nov 6, 2008 (gmt 0)

I made the change to all my internal links 2 months ago to:
<a href="./">Stuff</a>
-or-
<a href="../Directory1/">Stuff</a>

Everything worked great in the browser. I was able to generate my xml sitemaps without any problem. I tested in the Lynx Viewer and paths appeared to be picked up correctly. I also tested with Firefox live http headers and redirects seemed to work perfectly. Traffic in October peaked.

Then November 2 hit and I could watch my live stats be cut in half. I've read through the November penalty thread, and think that my situation is a bit unique.

After reviewing Google Webmaster Tools Web crawl errors "Not Found", Googlebot appears to be having trouble going from one directory on my site to another (using document relative URLS) where I don't think it had that problem before. The document relative paths are correct on the page and work perfectly in the browser.

For giggles, I tried using a spider simulator and got totally different results from the Lynx viewer. The spider simulator was also having a problem with document relative paths.

I've been frantically going through each section of my site changing links to root relative. Googlebot is also looking for index.htm files that it says my own pages link to - but I've never used the .htm extension. Could something in my .htaccess file be causing the problem?

# Parse .html and .inc files for server-side includes
AddHandler server-parsed .html .inc
#
#
# Set up to enable mod_rewrite
Options +FollowSymlinks +Includes All -Indexes
RewriteEngine on
#
#
# Redirect requests for index.html in any directory to "/" in the same directory
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /(.+/)?index\.html\ HTTP
RewriteRule ^(.+/)?index\.html$ http://www.example.com/$1 [R=301,L]

jdMorgan




msg:3781398
 5:28 pm on Nov 6, 2008 (gmt 0)

There's nothing wrong with your code.

Be aware that the 404 errors in GWT can be weeks or even months old. Also, GWT has been quite buggy recently, even reporting 404s on pages which are in fact present.

As for the spider simulator, be aware that it is a simulator, and does not use the actual google spider code (big mistake, google). Therefore, it can't really be trusted.

In your situation, I would trust your browser and Lynx much more than any search engine's simulator.

Jim

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved