Googlebot and <base href.*>

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Googlebot and <base href.*>

Does Googlebot use it?

Key_Master

10:43 pm on Dec 21, 2005 (gmt 0)

Is this fact or speculation? If fact, does Google recommend using it and if so, in what cases should it be used and why?

I think if Googlebot needs to use a <base href.*> to figure out where in the hell it's already at, something is seriously wrong with their indexing software.

Stefan

4:00 am on Dec 22, 2005 (gmt 0)

in what cases should it be used and why?

As I understand it (having read many posts on the subject), you only need it if your internal links are relative. The best policy is absolute links, and a mod_rewrite in the htaccess forcing things to either www or non-www. If you do that, you don't need the base tag.

Key_Master

4:50 am on Dec 22, 2005 (gmt 0)

HTTP/1.1 protocol supports both absolute and relative urls. Even Google uses relative urls.

The base element is probably a good idea to use if you want your cached pages to show up nicely on Google but I'd have to disagree that using absolute urls is always the best policy (unless your page is hosted outside of the domain it was originally intended and you are linking to external html documents, images, external files, etc- ie the Google cache).

I really have my doubts as to whether Google uses this element at all on a page. It might crawl the link but I doubt it utilizes it in the same fashion as a browser would. Googlebot already knows what host/page it's on, otherwise it wouldn't be there.

kaled

11:56 am on Dec 22, 2005 (gmt 0)

Whenever a robot visits a page, it's going to look at the <base href> so that it knows where to look for relatively-linked pages. If it finds no <base href> then it will use the url of the page being read.

Unless there is a stupendously obvious bug in Googlebot (that could be identified by some simple webmaster tests using redirects and looking at error logs) setting the <base href> to the same page will have absolutely zero effect.

For slightly different reasons, using absolute urls will also have zero effect (unless there is a similar stupendously obvious bug).

Kaled.

Key_Master

1:25 pm on Dec 22, 2005 (gmt 0)

So Kaled, not to put you on the spot, but show me some documentation that Googlebot uses the base element.

Unless there is a stupendously obvious bug in Googlebot (that could be identified by some simple webmaster tests using redirects and looking at error logs) setting the <base href> to the same page will have absolutely zero effect.

If Googlebot were to use the base element correctly, the example code given in the w3c documentation for the base element would seem to indicate otherwise.

[w3.org...] (Section 12.4)

Stefan

2:41 pm on Dec 22, 2005 (gmt 0)

but I'd have to disagree that using absolute urls is always the best policy

I'm speaking purely in terms of avoiding canonicalization problems with G. This has been such a great problem for some people that it seems best to use absolute, rather than relative. It shouldn't be necessary, true, but better safe than sorry.

kaled

4:17 pm on Dec 22, 2005 (gmt 0)

So Kaled, not to put you on the spot, but show me some documentation that Googlebot uses the base element.

If it didn't, Googlebot really would be broken.

Kaled.

texasville

5:08 pm on Dec 22, 2005 (gmt 0)

As I understand it, using the base url in your document is to help the bots. But the much better way is absolute url's. The base href is pure speculation.
However, absolute url's is supposed to be one safeguard against your document being hijacked.

Lord Majestic

5:30 pm on Dec 22, 2005 (gmt 0)

But the much better way is absolute url's.

No its not better - pages become bigger in size and less portable (though some use it to annoy people who steal content).

Googlebot is just a crawler - it gets urls and retrieves pages, in all probability its not the job of the bot to parse for URLs and thus resolve them, which is where BASE tag comes into play.

If BASE tag was not supported then it would have been very obvious - the fact that its not mentioned in "documentation" is because you or me don't really need to know the details.

I remeber the other day GoogleGuy said that its best to use absolute URLs - rather strange suggestion really, if search engine can't resolve relative urls then it really should get back to basics.

kaled

6:11 pm on Dec 22, 2005 (gmt 0)

As I understand it, using the base url in your document is to help the bots.

<base href> should be read and understood by all user agents, bots and browsers alike. If you arbitrarily change it, your page will break.

Kaled.

Key_Master

6:58 pm on Dec 22, 2005 (gmt 0)

If it didn't, Googlebot really would be broken.

How? Most sites don't even use the base element.

<base href> should be read and understood by all user agents, bots and browsers alike.

Maybe in theory but it all depends on the application making the request.

If you arbitrarily change it, your page will break.

? :)

Lord Majestic makes a good point. If I were to use absolute urls on my sites it would require an additional 500mb of download bandwidth per month using absolute urls and 50mb per month if I were to use the base element.

If the base element was that important to Google I would imagine they would get the word out and recommend using it but it isn't listed on their webmaster pages. I would think such info would be just as important as their meta tag recommendations.

Even if Googlebot does utilize the base element on pages it crawls, I think it's a far stretch to recommend it's usage in cases where it clearly wouldn't be needed. It's the user agent's responsibility to make a proper host/get request (per the HTTP/1.1 protocol).

g1smd

12:22 am on Jan 9, 2006 (gmt 0)

I have been involved in tesing something for several months. The base tag was vital in its operation. I will make a longer post when I have finished this work. For now, just this:

A friend owns a ".com" domain, and the registrar has simply set up a 302 redirect to his webspace at a freehost, which is located at www.freehost.com/users/user.name/sub.folder/keyword/. Note that the content is actually stored on this freehost. The ".com" is not hosted at all, it just issues a 302 redirect.

By using the <base href="http://www.thedomain.com/"> tag on all of the pages of the site, and all internal links as "/" and "/this.page.html" and "/that.page.html" and "/images/the.image.jpg" and /stylesheet.css etc, the entire site appears in Google's index (and MSNs) using the .com domain name, and none of the pages are indexed under the Freehost URLs at all (some were to begin with, but they have all disappeared now).

Yahoo still has a 50/50 split. The <base> tag was added after the site had already been fully indexed under the FreeHost URLs, and Yahoo is slowly dropping them in favour of the .com URL specified in the <base> tag.

Usually a 302 redirect is seen as a bad thing, but in this case the <base> tag is confirming which URL is actually required to be listed and, so far, search engines are obeying it (as expected).

[edited by: g1smd at 12:34 am (utc) on Jan. 9, 2006]

g1smd

12:29 am on Jan 9, 2006 (gmt 0)

To reiterate and confirm: if a page of content can be reached by way of more than one URL (and all of those URLs return HTTP status "200 OK")...

whether this is "www.freehost.com/users/user.name/sub.folder/keyword/" vs. "www.domain.com/"

or whether it is "domain.com/" vs. "www.domain.com/" is irrelevant

... then the <base> tag confirms the true base domain and/or true full URL for that page.

It also forces all the internal links out from that page to conform to the same base domain - and that is exactly what you need to happen if you want to get Google to drop non-www listings and list all pages as www.domain.com.

The 301 redirect from non-www to www helps achieve that too, but the <base> tag has a very large part to play as well. I have been testing this in a variety of ways for the last 4 or 5 months. It works as it should. Bots do obey it.

Key_Master

12:50 am on Jan 9, 2006 (gmt 0)

Thanks g1smd for posting. Can of worms, eh. (I'm thinking of 302 hijack posts about now) :)

Are you sure it's not a case of Googlebot strickly following HTTP protocol by continuing to use the Request-URI for future visits instead of the location of the 302 temporary URI?

[w3.org...]

tedster

12:54 am on Jan 9, 2006 (gmt 0)

As I mentioned in another thread, Google adds a <base href=""> element to the top of the mark-up in the cached pages. I think that's a sure sign that Google makes pretty intensive use of it.

g1smd

1:02 am on Jan 9, 2006 (gmt 0)

I didn't want to post this stuff yet, as I haven't finished messing...

The index page was already indexed as being at the .com address, but all of the internal pages were indexed under the long Freehost URL when I started the experiment. There was no <base> tag on any of those pages.

The <base> tag was then added but only on new pages as they were added to the site, and as exected those only got indexed as being .com pages. The fact that they were really on a freehost was totally ignored by Google. They took the <base> tag as being the "true" location of the page.

Later on, and as soon as the <base> tag was added to the older pre-indexed pages, they then started appearing in the SERPs as .com pages too, but those didn't all drop out as Freehost URLs until the <base> tag was finally added to the root index page (missed doing it to that page for many weeks) - and then all of the freehost URLs were gone from the SERPs in a matter of just a few days.

Now the entire site appears only under .com URLs, and when you navigate round the site all the links show as .com destinations. After you have clicked the link, you are redirected and the browser URL bar shows the actual freehost address of the page. The links within the new page show only .com as the destination....

Key_Master

1:11 am on Jan 9, 2006 (gmt 0)

As I mentioned in another thread, that's a pretty poor example tedster. It's one thing to use the base element in a cached copy of a page offsite from the original host (the page wouldn't work any other way), but it's a completely different thing for a search engine spider to replace the HTTP_HOST header with the url supplied by the base element.

If that were the case, wouldn't it be really easy to make the spider believe it was somewhere it really wasn't? Seems like a really good way to get rid of those pesky competitors.

g1smd

1:14 am on Jan 9, 2006 (gmt 0)

>> If that were the case, wouldn't it be really easy to make the spider believe it was somewhere it really wasn't? <<

Been there. Done that.

Used it to confirm which end of a 302 redirect I really wanted listed. See posts above.

Could also be used to repel a 302 hijack aimed at a site.

g1smd

2:33 pm on Jan 10, 2006 (gmt 0)

Hmm, I though that posts 12, 13 and 16 would have provoked a flurry of posting... but instead the topic died.

Phil_Payne

2:40 pm on Jan 10, 2006 (gmt 0)

> However, absolute url's is supposed to be one safeguard against your document being hijacked.

There are several products that will relocate absolute URLs for you if you wish. Teleport Pro comes to mind.

g1smd

4:38 pm on Jan 10, 2006 (gmt 0)

If you can get to a page through external or internal incoming links like:
domain.com/page1.php
www.domain.com/page1.php
otherdomain.com/page1.php
www.otherdomain.com/page1.php (and don't forget that for all sites without any redirects that any links like /page4.html cannot force the domain name at all, the supposed domain name of the page the link points to will be whatever the domain name of the page you started at was)
and all of those serve a document with HTTP status "200 Found" then the document has ALL of those URLs and spiders are going to have a hard job working out which one is the "true" or "real" one.

Additionally, note that absolute URLs in outgoing links to other pages do NOT confirm the "true" URL of the page that you are on right now.

Even if all the links on a page point to www.domain.com/page2.html and www.domain.com/ etc, that doesn't fully prove that you are on some page at www.domain.com right now.

If all the internal links have the same format (domain.com vs. www.domain.com), then it does make it harder for a spider to get to the "wrong" ones (or having got to the wrong one - like otherdomain.com/page3.html for example - is then forced by all the outgoing links from the page to get back on track, for indexing the rest of the site - but not that page - using the correct one again).

The 301 redirect is one method to stop some of the "wrong" URLs being indexed, but still does not completely confirm the true identity of the right one to list - adding the <base> tag, on the page itself, as well as doing the redirect, can and does do that.