Forum Moderators: Robert Charlton & goodroi
in what cases should it be used and why?
As I understand it (having read many posts on the subject), you only need it if your internal links are relative. The best policy is absolute links, and a mod_rewrite in the htaccess forcing things to either www or non-www. If you do that, you don't need the base tag.
The base element is probably a good idea to use if you want your cached pages to show up nicely on Google but I'd have to disagree that using absolute urls is always the best policy (unless your page is hosted outside of the domain it was originally intended and you are linking to external html documents, images, external files, etc- ie the Google cache).
I really have my doubts as to whether Google uses this element at all on a page. It might crawl the link but I doubt it utilizes it in the same fashion as a browser would. Googlebot already knows what host/page it's on, otherwise it wouldn't be there.
Unless there is a stupendously obvious bug in Googlebot (that could be identified by some simple webmaster tests using redirects and looking at error logs) setting the <base href> to the same page will have absolutely zero effect.
For slightly different reasons, using absolute urls will also have zero effect (unless there is a similar stupendously obvious bug).
Kaled.
Unless there is a stupendously obvious bug in Googlebot (that could be identified by some simple webmaster tests using redirects and looking at error logs) setting the <base href> to the same page will have absolutely zero effect.
If Googlebot were to use the base element correctly, the example code given in the w3c documentation for the base element would seem to indicate otherwise.
[w3.org...] (Section 12.4)
but I'd have to disagree that using absolute urls is always the best policy
I'm speaking purely in terms of avoiding canonicalization problems with G. This has been such a great problem for some people that it seems best to use absolute, rather than relative. It shouldn't be necessary, true, but better safe than sorry.
But the much better way is absolute url's.
No its not better - pages become bigger in size and less portable (though some use it to annoy people who steal content).
Googlebot is just a crawler - it gets urls and retrieves pages, in all probability its not the job of the bot to parse for URLs and thus resolve them, which is where BASE tag comes into play.
If BASE tag was not supported then it would have been very obvious - the fact that its not mentioned in "documentation" is because you or me don't really need to know the details.
I remeber the other day GoogleGuy said that its best to use absolute URLs - rather strange suggestion really, if search engine can't resolve relative urls then it really should get back to basics.
If it didn't, Googlebot really would be broken.
How? Most sites don't even use the base element.
<base href> should be read and understood by all user agents, bots and browsers alike.
Maybe in theory but it all depends on the application making the request.
If you arbitrarily change it, your page will break.
? :)
Lord Majestic makes a good point. If I were to use absolute urls on my sites it would require an additional 500mb of download bandwidth per month using absolute urls and 50mb per month if I were to use the base element.
If the base element was that important to Google I would imagine they would get the word out and recommend using it but it isn't listed on their webmaster pages. I would think such info would be just as important as their meta tag recommendations.
Even if Googlebot does utilize the base element on pages it crawls, I think it's a far stretch to recommend it's usage in cases where it clearly wouldn't be needed. It's the user agent's responsibility to make a proper host/get request (per the HTTP/1.1 protocol).
A friend owns a ".com" domain, and the registrar has simply set up a 302 redirect to his webspace at a freehost, which is located at www.freehost.com/users/user.name/sub.folder/keyword/. Note that the content is actually stored on this freehost. The ".com" is not hosted at all, it just issues a 302 redirect.
By using the <base href="http://www.thedomain.com/"> tag on all of the pages of the site, and all internal links as "/" and "/this.page.html" and "/that.page.html" and "/images/the.image.jpg" and /stylesheet.css etc, the entire site appears in Google's index (and MSNs) using the .com domain name, and none of the pages are indexed under the Freehost URLs at all (some were to begin with, but they have all disappeared now).
Yahoo still has a 50/50 split. The <base> tag was added after the site had already been fully indexed under the FreeHost URLs, and Yahoo is slowly dropping them in favour of the .com URL specified in the <base> tag.
Usually a 302 redirect is seen as a bad thing, but in this case the <base> tag is confirming which URL is actually required to be listed and, so far, search engines are obeying it (as expected).
[edited by: g1smd at 12:34 am (utc) on Jan. 9, 2006]
whether this is "www.freehost.com/users/user.name/sub.folder/keyword/" vs. "www.domain.com/"
or whether it is "domain.com/" vs. "www.domain.com/" is irrelevant
... then the <base> tag confirms the true base domain and/or true full URL for that page.
.
It also forces all the internal links out from that page to conform to the same base domain - and that is exactly what you need to happen if you want to get Google to drop non-www listings and list all pages as www.domain.com.
The 301 redirect from non-www to www helps achieve that too, but the <base> tag has a very large part to play as well. I have been testing this in a variety of ways for the last 4 or 5 months. It works as it should. Bots do obey it.
Are you sure it's not a case of Googlebot strickly following HTTP protocol by continuing to use the Request-URI for future visits instead of the location of the 302 temporary URI?
[w3.org...]
The index page was already indexed as being at the .com address, but all of the internal pages were indexed under the long Freehost URL when I started the experiment. There was no <base> tag on any of those pages.
The <base> tag was then added but only on new pages as they were added to the site, and as exected those only got indexed as being .com pages. The fact that they were really on a freehost was totally ignored by Google. They took the <base> tag as being the "true" location of the page.
Later on, and as soon as the <base> tag was added to the older pre-indexed pages, they then started appearing in the SERPs as .com pages too, but those didn't all drop out as Freehost URLs until the <base> tag was finally added to the root index page (missed doing it to that page for many weeks) - and then all of the freehost URLs were gone from the SERPs in a matter of just a few days.
Now the entire site appears only under .com URLs, and when you navigate round the site all the links show as .com destinations. After you have clicked the link, you are redirected and the browser URL bar shows the actual freehost address of the page. The links within the new page show only .com as the destination....
If that were the case, wouldn't it be really easy to make the spider believe it was somewhere it really wasn't? Seems like a really good way to get rid of those pesky competitors.
.
Additionally, note that absolute URLs in outgoing links to other pages do NOT confirm the "true" URL of the page that you are on right now.
Even if all the links on a page point to www.domain.com/page2.html and www.domain.com/ etc, that doesn't fully prove that you are on some page at www.domain.com right now.
If all the internal links have the same format (domain.com vs. www.domain.com), then it does make it harder for a spider to get to the "wrong" ones (or having got to the wrong one - like otherdomain.com/page3.html for example - is then forced by all the outgoing links from the page to get back on track, for indexing the rest of the site - but not that page - using the correct one again).
The 301 redirect is one method to stop some of the "wrong" URLs being indexed, but still does not completely confirm the true identity of the right one to list - adding the <base> tag, on the page itself, as well as doing the redirect, can and does do that.