Forum Moderators: open
<a href="/page.htm">link</a>
<img src="images/image1.jpg" />
I could also use:
<a href="page.htm">link</a>
<img src="/images/image1.jpg" />
Technically, with a trailing slash on the base href does it matter whether or not there is a forward slash at the beginning of the path to another page or image? Either seems to work.
If the urls in a page's links are all root-relative [that is they all begin with a slash] then plugging just in the domain name on every page for a base href seems to work well. The troubles come in with completely relative urls that occur within a sub-directory
Let's look at what happens if you have a document at this address:
http://www.example.com/directory/page1.htm
If a link on that page points to page2.htm then the intended full address of such a target page is:
http://www.example.com/directory/page2.htm
But if you set the base href to be http://www.example.com/, then you are telling the spider to go to:
http://www.example.com/page2.htm. See what happened? The directory name got dropped from the url, and that means that the link breaks!
For this reason, the best practice for a base element is to make the href point to the full absolute url of the page itself. The W3C examples show this expicitly.
[w3.org...]
Second best would be pointing the base element's href attribute to the absolute url of the directory where the page lives, rather than just the domain root.
My question was really about whether, when the base href has a trailing slash, it matters if paths to other resources (whether they be pages, images, stylesheets, etc) begin with a forward slash or not.
Relating to this, if everything works fine in the browser, does it follow that everything must be working fine for search engine crawlers?
<a href="/page.htm">page</a>
... beginning with a forward slash, would the correct use of base href be:
<base href="http://www.domain.com" />
... with no trailing slash on the URL? Otherwise it seems to me that there would be two forward slashes in the path to page.htm. But what about an index page in a subfolder linking to another page:
<a href="/anotherfolder/anotherpage.htm">anotherpage</a>
from a page containing:
<base href="http://www.domain.com/folder/" /> (there has to be the trailing slash here)
... has the two forward slashes in the path.
However, theoretical perfection is one thing, where there is scope for bugs, generally you should choose the safest option. Writing fault-tolerant code is always a good thing.
Kaled.
generally you should choose the safest option
That's what I want to do, but I still don't understand this. If <base href="http://www.domain.com/" /> (with a trailing slash) defines the starting point for all linked resources and an example resource is <a href="/page.htm">page</a> (with a leading forward slash) why does this not lead to the presence of two forward slashes in the path? eg:
ht*p://www.domain.com//page.htm (one added to the other)
If the internal links have a leading forward slash, should the base href be <base href="http://www.domain.com" /> (with no trailing slash)?
And if a user clicking a link from a page opens the correct new page in their browser (no matter which way the link is coded) why doesn't it follow that a search engine crawler do the same?
And if a user clicking a link from a page opens the correct new page in their browser (no matter which way the link is coded) why doesn't it follow that a search engine crawler do the same?
The <base href> should always be split into two parts by user agents at the first / after the domain name. Consequently, if all links begin with / then what follows in the <base href> should be ignored. For instance, if you used <base href="ht*p://www.domain.com/////"> everything should still work correctly if all links begin with /.
However, there is a difference between "should work correctly" and "will work correctly".
Kaled.
If your server is using user agent delivery, then by spoofing you will get the googlebot version of the code from the server. But you will still be seeing Firefox's programming in the way that code is actually executed. Only googlebot has googlebot's programming.
<added>
I see kaled beat me - but we used different language so I'll let my post stand.
So any page in the root would have:
<base href="http://www.domain.com/">
and any page in a subdirectory would have:
<base href="http://www.domain.com/subdirectory/">
... not:
<base href="http://www.domain.com/page.htm">
<base href="http://www.domain.com/subdirectory/page.htm">
Does this seem correct? Because a relative link to:
<a href="subdirectory/page2.htm">page 2</a> or <a href="page2.htm">page 2</a> (from ht*p://www.domain.com/subdirectory/page1.htm)
should be resolved from:
ht*p://www.domain.com/subdirectory/
[edited by: Patrick_Taylor at 9:49 pm (utc) on Dec. 19, 2006]
As I said above, the best method is to always include the index.html part - this leaves no room for ambiguity.
Unless you're concerned about page-hijacking, there is usually no point using a <base> tag.
Kaled.
I had (and still have) similar problems grasping NULL and NOT NULL in the context of MySQL, but that's another subject.
Kaled.
a search engine crawler behaves differently to a normal browser
That's a critical insight that many webmaster's don't get, Patrick. Certainly spiders want to detect what's happening to human visitors, but the first and biggest difference of all is this -- spiders don't need a visual representation of a page. So from the ground up, they can be coded quite differently.
Since even two different visual browsers can behave differently, and they are trying to do the exact same thing, we cannot afford to assume anything about search engine bots. Not only that, but I'll bet that spiders get code updates a lot more frequently than at least some of our visual browsers.
Hence our reliance on standards makes the most sense. That's the glue that can hold this whole new-fangled web thing together.