Forum Moderators: open
1. Have all links on the page like this:
href="/page.html"
href="/folder/page.html"
href="/images/image.gif"
with base tag in the header:
<head>
<base href="http://www.example.com">
</head>
2. Have all links on the page like this:
href="../page.html"
href="../folder/page.html"
href="../images/image.gif"
with no base tag.
3. Have all links on the page like this:
href="http://www.example.com/page.html"
href="http://www.example.com/folder/page.html"
href="http://www.example.com/images/image.gif"
Obviously No.1 would be best for file size as some pages have a lot of links/images and the code can get huge just on domain names alone. But we are trying to find out what errors can arise from search engine spidering and browser bugs for each and which is best overall.
Any help would be appreciated.
Thanks
The base element is intended to be the reference from which relative links in the document are calculated. Therefore, in the best practice, the url in the base element should be the full, intended url of the document itself. At a minimum, the base href needs to include the path through the last subdirectory, but as I said, ideally it contains the full, absolute url of the page it appears on.
Most of what I do is mostly option 3. When people save things to their local system, I've caught quite a few who forgot to take out URI references before posting their work. Lot's of copycats out there. ;)
The base element is intended to be the reference from which relative links in the document are calculated.
Please go on. It seems as though I may have mis-understood the base tag. Does the base tag have anything to do with the links on the page?
And do you mean that on the page:
http://www.example.com/folder/page.html
you should have the base tag:
<base href="http://www.example.com/folder/page.html">
And do you mean that on the page:
http://www.example.com/folder/page.htmlyou should have the base tag:
<base href="http://www.example.com/folder/page.html">
Yes, that's it exactly. See Path information: the BASE element [w3.org] at the W3C website.
The base element has everything to do with how relative links in the document are understaood by a user agent. (I avoid using the word "page" because it isn't technically exact - consider iframes, for example. So I would rather use words such as "url" or "document" instead of "page".)
2. Have all links on the page like this:
href="../page.html"
href="../folder/page.html"
href="../images/image.gif"
with base tag:
<base href="http://www.example.com/folder/page.html">
So if that's the standard why do no sites use it? And is that the standard that crawlers and browsers adhere to?
[ietf.org...]
I don't think there are many who use the base element because it can be somewhat confusing. Even after reading the spec many times, I still find some things a little confusing that is why I don't use it.
I use a combination of 1 and 3.
I should have stated that I use a combination of 1 and 3 without the base element.
2. Check the header for Google's cache pages -- they also state paths for the base element this way.
The RFC that pageoneresults linked to ends this way:
The term "relative URL" implies that there exists some absolute "base
URL" against which the relative reference is applied. Indeed, the
base URL is necessary to define the semantics of any embedded
relative URLs; without it, a relative reference is meaningless. In
order for relative URLs to be usable within a document, the base URL
of that document must be known to the parser.
The main point being that if you use relative urls and don't explicitly state the base url, then search enignes will assume that the url they requested actually IS the base. The base tag gives you a chance to change their mind!
With approach #3, you're not using relative urls, so the base element is superfluous. But with approach #1, you're still open to canonical troubles via "www" and "https:", as examples.
I use option 1, usually without a base element (but using mod_rewrite to sort out www/non-www confusion). A site-wide base element is useful if you think your page is going to be ripped, as the internal links on a copy will resolve back to your original site.
But with approach #1, you're still open to canonical troubles via "www" and "https:", as examples.
But if using the base tag is what you are "supposed" to do, then surely all search engine crawlers should be set up to recognise this format. So using a base tag "SHOULD" be the same as using absolutes throughout the page, right?
I know of several sites that have sorted their "canonical problems" with Google by using the base tag as described in that link.
That is what I would expect, but as you can tell, not everyone is in agreement which is why our office debate is getting us no-where. Both docs that I have seen now lead me to believe that using:
<a href="../page.html"
with base tag:
<base href="http://www.example.com/folder/page.html"
is the correct way and I have to assume that crawlers and browser operators have read those exact same documents. So why do people still disagree?
A site-wide base element is useful if you think your page is going to be ripped
A ripper would simply change the base tag in the coding automatically though I would have thought.
<head>
<base href="http://www.example.com/">
</head>
href="page.html"
href="folder/page.html"
href="images/image.gif"
(slightly different from OPs #1).
Interresting note: the W3C page with information on the BASE tag (http://www.w3.org/TR/html4/struct/links.html) does NOT HAVE A BASE TAG!
The following (supposedly correct) method seems like double handling to me - "go to this folder, then jump back one and grab the doc from the next folder below you where you started from"
<base href="http://www.example.com/page1.html">
href="../page.html"
href="../folder/page.html"
href="../images/image.gif"
<a href="../page.html"
for all internal links on the page with base tag:
<base href="http://www.example.com/folder/page.html"
in the head tags. I've just ran a linkchecker on one of the pages and it says that it checked the links:
http://www.example.com/folder/../page.html
http://www.example.com/folder/../page2.html
http://www.example.com/folder/../page3.html
If you go to the URL showns (including the dots) the page shows up fine. But, if I use WW Sim Crawler it shows the links on the page to be
http://www.example.com/folder/page.html
http://www.example.com/folder/page2.html
http://www.example.com/folder/page3.html
So now I'm really, really confused. What on earth is going on and why isn't there a standard format? What will search engine crawlers such as Google/Yahoo see when they index the links on my page?
Have you been to Webmonkey or About?
I've been all over the place which is why I have this question. Some things are so clear cut such as meta tags, validating coding, robots.txt etc. why is this such a ridiculously debated subject. In this thread everyone has pitched in with "what they do" but still there is no answer to "what is right".
Obviously unless we can all agree we are always going to run into problems. Why is that the case? With crawling such a big, big part of all major search engines why don't they list on the webmaster help pages EXACTLY what they expect to see?
base element is just another variable for the UA to consider. Your tests are pretty conclusive - relative URLs can confuse spiders. The vast majority of sites link from the root URL, ie.:
<a href="[b]/directory/page.html[/b]">link</a> You may include a
base element, but it is not vital. This approach is the best for simplicity and consistency, with a slight decrease in portability.
There is no "right" way,
Why not? Aren't there several authorative bodies that issue "rules"? Isn't this a big one that could use a single rule?
Your tests are pretty conclusive - relative URLs can confuse spiders.
Because there is no rule. If we had a rule, we wouldn't have the problem. Instead of threads discussing what we would like to see in HTML 5, how about we kick up a fuss until somone finally puts this to rest. IT IS A HUGE PROBLEM!
It is also useful when dealing with a file to be included on documents that may lie in nested folders that calls other files. Let's say you make a header and put it in the root for all your pages to use. But you have some pages in a folder, and others in a subfolder. To make sure the header always calls the right files, without having to have copies of it in each folder, simply add enough dots to reference the furthest subfolder used, eg:
2 folders deep = ../../file.txt
I found that if the header is called from the root, it will ignore the dots (as you can't go back beyond the domain itself) and include the file correctly. If it is called from a folder, it will still find the file. It may be a hack, but it works!
The only drawback I have found is when testing the files locally. Because my local server is within subfolders, the links break. Oh well.
Otherwise I have to make copies of the header for each folder, with the links changed to suit, and update each one when I make any changes. My way you only need the one header file.
There is no "right" way,Why not?
... because in real life, sometimes "There's more than one way to do it." (TMTOWTDI, usually pronounced 'Tim Toady'), quoting one of the mottos of Perl here.
As Encyclo already said, there is no "right" way, but some ways may be harder (and thus more error-prone) than others.
It's on your choice.
Because there is no rule. If we had a rule, we wouldn't have the problem.
Rulemaking does not solve third party implementation problems per se. I have no hope, that SE-spider progammers, who can't get simple rules (like handling a simple 301 or 404) correctly, would not screw up this one, too ...
Kind regards,
R.
There is no "right" way,
Why not?
... because in real life, sometimes "There's more than one way to do it."
Are you serious?! You're going to take "this is real life" stance? Your think I'm out of touch with reality to not understand why there is a clear robots.txt rule but no URL rule? You can't be serious, surely? No-one could possibly reply "this is real life" to a topic that CAN have a clear-cut rule unless they were some complete ... well, TOS prohibits ...
This is computer programming, there can always be a rule, this isn't lawyer eithics counselling.
This is computer programming, there can always be a rule,
Yes, I know, and the 'Perl' I mentioned is infact a computer programming language used in the web environment ... which people use "to get something done" (another motto's quote).
Back to your first message in this thread:
... where we could find some authority on the subject of Relative over Absolute links. Here are the three options:
Your think I'm out of touch with reality
IT IS A HUGE PROBLEM!I don't see it like that, but YMMV.
this isn't lawyer eithics counselling.
No puns intended, and I don't want to escalate this further.
HAND and kind regards,
R.