base href question about forward slash on paths

Forum Moderators: open

Message Too Old, No Replies

base href question about forward slash on paths

Patrick Taylor

9:25 pm on Dec 15, 2006 (gmt 0)

With <base href="http://www.domain.com/" /> in the document head, is there any reason why I shouldn't mix my paths from this document? Examples:

I could also use:
<a href="page.htm">link</a>
<img src="/images/image1.jpg" />

Technically, with a trailing slash on the base href does it matter whether or not there is a forward slash at the beginning of the path to another page or image? Either seems to work.

tedster

1:21 am on Dec 16, 2006 (gmt 0)

Potential for trouble comes up when the page is in an interior directory. I have definitely seen sites make errors here that generate spidering issues.

If the urls in a page's links are all root-relative [that is they all begin with a slash] then plugging just in the domain name on every page for a base href seems to work well. The troubles come in with completely relative urls that occur within a sub-directory

Let's look at what happens if you have a document at this address:
http://www.example.com/directory/page1.htm

If a link on that page points to page2.htm then the intended full address of such a target page is:
http://www.example.com/directory/page2.htm

But if you set the base href to be http://www.example.com/, then you are telling the spider to go to:
http://www.example.com/page2.htm. See what happened? The directory name got dropped from the url, and that means that the link breaks!

For this reason, the best practice for a base element is to make the href point to the full absolute url of the page itself. The W3C examples show this expicitly.

[w3.org...]

Second best would be pointing the base element's href attribute to the absolute url of the directory where the page lives, rather than just the domain root.

Patrick Taylor

10:39 am on Dec 16, 2006 (gmt 0)

Thanks for that.

My question was really about whether, when the base href has a trailing slash, it matters if paths to other resources (whether they be pages, images, stylesheets, etc) begin with a forward slash or not.

Relating to this, if everything works fine in the browser, does it follow that everything must be working fine for search engine crawlers?

kaled

1:11 am on Dec 17, 2006 (gmt 0)

Relating to this, if everything works fine in the browser, does it follow that everything must be working fine for search engine crawlers?

NO!

There is a safe and simple solution - include the page in the base url thus:-
<base href="http://www.domain.com/index.html" />

Kaled.

tedster

2:04 am on Dec 17, 2006 (gmt 0)

I completely agree, kaled.

I don't know if either browsers or spiders have any error routines around the base element -- but there's no reason to risk anything. And assuming that different user agents are working the same way in any area of code is a dangerous assumption.

Patrick Taylor

11:19 am on Dec 17, 2006 (gmt 0)

include the page in the base url thus

Well... I am - actually <base href="http://www.domain.com/" /> with a trailing slash on the URL - but I'd still like to understand if it makes any difference whether or not I use a forward slash at the start of the path to other resources.

Patrick Taylor

12:47 pm on Dec 17, 2006 (gmt 0)

To put it another way, if the internal links are written:

... beginning with a forward slash, would the correct use of base href be:

... with no trailing slash on the URL? Otherwise it seems to me that there would be two forward slashes in the path to page.htm. But what about an index page in a subfolder linking to another page:

<a href="/anotherfolder/anotherpage.htm">anotherpage</a>

from a page containing:

<base href="http://www.domain.com/folder/" /> (there has to be the trailing slash here)

... has the two forward slashes in the path.

kaled

4:03 am on Dec 18, 2006 (gmt 0)

In order for this to be ok
<a href="/page.htm">page</a>
The user agent simply needs to be able to determine the protocol and domain parts of the url - this should not be a problem. The remainder of the <base href> is irrelevant.

However, theoretical perfection is one thing, where there is scope for bugs, generally you should choose the safest option. Writing fault-tolerant code is always a good thing.

Kaled.

Patrick Taylor

11:26 am on Dec 18, 2006 (gmt 0)

generally you should choose the safest option

That's what I want to do, but I still don't understand this. If <base href="http://www.domain.com/" /> (with a trailing slash) defines the starting point for all linked resources and an example resource is <a href="/page.htm">page</a> (with a leading forward slash) why does this not lead to the presence of two forward slashes in the path? eg:

ht*p://www.domain.com//page.htm (one added to the other)

If the internal links have a leading forward slash, should the base href be <base href="http://www.domain.com" /> (with no trailing slash)?

And if a user clicking a link from a page opens the correct new page in their browser (no matter which way the link is coded) why doesn't it follow that a search engine crawler do the same?

kaled

2:53 pm on Dec 18, 2006 (gmt 0)

And if a user clicking a link from a page opens the correct new page in their browser (no matter which way the link is coded) why doesn't it follow that a search engine crawler do the same?

Different programmers, often writing in different languages, may write very different code and consequently, that code may behave differently at the margins.

The <base href> should always be split into two parts by user agents at the first / after the domain name. Consequently, if all links begin with / then what follows in the <base href> should be ignored. For instance, if you used <base href="ht*p://www.domain.com/////"> everything should still work correctly if all links begin with /.

However, there is a difference between "should work correctly" and "will work correctly".

Kaled.

Patrick Taylor

5:05 pm on Dec 18, 2006 (gmt 0)

Kaled, thanks.

In Firefox I can use Tools -> User Agent Switcher and surf as Googlebot. Presumably that would verify whether a base href and link configuration is causing any crawling problems.

Patrick

kaled

5:55 pm on Dec 18, 2006 (gmt 0)

NO.

Switching the user agent in no way emulates the behaviour of the Google spiders. The user-agent string is just an item of data that is typically sent with each http request. Some sites may deliver content based partially on that data.

Kaled.

tedster

5:58 pm on Dec 18, 2006 (gmt 0)

No, that won't prove it, Patrick.

If your server is using user agent delivery, then by spoofing you will get the googlebot version of the code from the server. But you will still be seeing Firefox's programming in the way that code is actually executed. Only googlebot has googlebot's programming.

<added>
I see kaled beat me - but we used different language so I'll let my post stand.

Patrick Taylor

8:01 pm on Dec 18, 2006 (gmt 0)

Okay... I can't test anything! But I can rely on good advice - thanks. I still don't really understand this exactly, but from what Kaled said, I'm assuming I'll be fine with:

together with

Patrick Taylor

9:41 pm on Dec 19, 2006 (gmt 0)

Sorry, but I'm still not clear on this. In another WW thread on the subject I read that "the base href element simply lets the search engines know the absolute URL for resolving any relative links." This suggests that the base href for a particular page in a particular directory should specify not the page itself but only the directory in which the page exists.

So any page in the root would have:

and any page in a subdirectory would have:

... not:

Does this seem correct? Because a relative link to:

<a href="subdirectory/page2.htm">page 2</a> or <a href="page2.htm">page 2</a> (from ht*p://www.domain.com/subdirectory/page1.htm)

should be resolved from:

ht*p://www.domain.com/subdirectory/

[edited by: Patrick_Taylor at 9:49 pm (utc) on Dec. 19, 2006]

kaled

10:45 pm on Dec 19, 2006 (gmt 0)

The base tag represents the true url of a page. For instance, take a look at how it is used by Google in cached pages.

As I said above, the best method is to always include the index.html part - this leaves no room for ambiguity.

Unless you're concerned about page-hijacking, there is usually no point using a <base> tag.

Kaled.

Patrick Taylor

10:56 pm on Dec 19, 2006 (gmt 0)

Kaled, I'm really just interested in understanding the logic of base href. I've noticed Google's use of it for cached pages, and as you say, they use the full URL as required by W3C. It still doesn't quite make sense (to me) to include the full URL when the 'base' directory seems to be what counts - not the actual page.

I had (and still have) similar problems grasping NULL and NOT NULL in the context of MySQL, but that's another subject.

kaled

3:59 am on Dec 20, 2006 (gmt 0)

Provided the <base href> contains, as a minimum, a trailing / on index pages, all should be ok. However, I would strongly recommend against omitting the page.html part for any url other that is not an index page - browsers might be ok but search engines could become seriously confused (mistaking other content pages for the index page).

Kaled.

Patrick Taylor

11:21 am on Dec 20, 2006 (gmt 0)

The most unexpected aspect of all this is that a search engine crawler behaves differently to a normal browser. I always thought that Googlebot etc travelled the same route as humans.

Thanks for all.

Patrick

tedster

1:00 pm on Dec 20, 2006 (gmt 0)

a search engine crawler behaves differently to a normal browser

That's a critical insight that many webmaster's don't get, Patrick. Certainly spiders want to detect what's happening to human visitors, but the first and biggest difference of all is this -- spiders don't need a visual representation of a page. So from the ground up, they can be coded quite differently.

Since even two different visual browsers can behave differently, and they are trying to do the exact same thing, we cannot afford to assume anything about search engine bots. Not only that, but I'll bet that spiders get code updates a lot more frequently than at least some of our visual browsers.

Hence our reliance on standards makes the most sense. That's the glue that can hold this whole new-fangled web thing together.