homepage Welcome to WebmasterWorld Guest from 54.161.190.9
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Canonical Issue from Joomla's <base> Tag - Google Bug? Browser Bug?
Regent




msg:4214024
 11:37 pm on Oct 8, 2010 (gmt 0)

I don't post much - rather read. So be gentle - pls. I am not sure I found something critical or just an anomalous condition in GoogleLand. So I am asking the community for a second opinion and corroboration. If my observation is widespread, this could be a significant issue form many.

The Back-story:
Google webmaster tools reported several strange URLs that it reported as 404 error pages. A test with a http header checker confirmed the 404 error status. These pages happened to be critical SEO pages on a site. Further, Google was not reporting these pages with the correct path which would explain the 404 errors. But after checking the source code for the links to these SEO pages and performing a xenu crawl, I could not find how Google found path/pages that did not exist. Furthermore, the correct path to these SEO pages were in the sitemap.xml that had been registered with Google for months. Yet Google had no mention of the correct path/page in the webmaster tools crawl logs.

Google was reporting 404 errors for pages with paths that did not exist and was not reporting the correct path/page registered in the sitemap.xml file.

You should also know this is a Joomla website and all Joomla websites have a <base> tag that is dynamically created for each page and every page. The path and file name in the <base> tag is equal to the current page, which is the proper coding according to W3C.

The Observation:
The best way to convey what I saw is through an example.

Example:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html><head>
<base href=http://example.com/path1/path2/doc1234.html />
...
</head>
<body>
....
<a href="/seo_page.html">anchor</a>
</body>
</html>


FF and IE browsers resolved to http://example.com/seo_page.html.

The source code shows the following:
FF3.6 ==>> <a href="http://example.com/seo_page.html">
IE8 ==>> <a href="/seo_page.html">
Google cache page viewed in FF3.6 ==>> <a href="/seo_page.html">

But Google tried to find http://example.com/path1/path2/seo_page.html, which does not exist. Further, Google never found http://example.com/seo_page.html even though the URL was part of the sitemap and plenty of other site links to the target pages with correct path/page existed. Technically, Google was right - the browsers, both IE and FF, were wrong (but the code worked in the browser just fine). The code even worked in Google's cache page where the links reside. Even though there were 10 correctly coded links throughout the site for every 1 that had an incorrect path (based on the <base> tag) that Google saw, Google never reported the correct path/pages.

The problem:
With all of Google's canonical challenges, do we have a real issue? Are browsers rendering code differently than Google is? Do the browsers consider the preceding "/" in each href to indicate a path/file from the domain level (as so many webmasters think it does without regard to <base>)?

The Request:
Rather than pulling the trigger too fast and pronouncing a real problem, I am asking the WebmasterWorld community for corroboration. Has anyone else seen this issue - or were the stars aligned just right for me alone :-) ?

 

tedster




msg:4214048
 1:00 am on Oct 9, 2010 (gmt 0)

If the URL in the href attribute begins with a forward slash, then the rest of the URL simply follows on after the domain root. The <base href> is essentially nullified by root relative links and only would kick in with the ../ style of relative linking.

So it sounds to me like the browsers have it right. However, Google may be finding those messed up URLs somewhere else, somewhere that does use the ../ style.

jdMorgan




msg:4214065
 1:57 am on Oct 9, 2010 (gmt 0)

The source code shows the following:
FF3.6 ==>> <a href="http://example.com/seo_page.html">
IE8 ==>> <a href="/seo_page.html">
Google cache page viewed in FF3.6 ==>> <a href="/seo_page.html">

If you are saying that different browsers are showing you different "source code," then this means that your server is sending different code to different user-agents.

The <base href> value should be a full URL containing protocol, hostname, and URL-path, and this should NOT be changing based on the requesting user-agent. All user-agents should be receiving the same thing that you show for the Firefox case.

Jim

jdMorgan




msg:4214070
 2:34 am on Oct 9, 2010 (gmt 0)

And looking at this further, what I initially thought was a result of an inconsistent example now looks like an error. The <base href> should be referring to the published URL of the document, and not to the internal filepath of that URL. In short, it should refer to the same URL as that by which you refer to it in links -- the friendly URL.

So, based on your other examples, I believe it should read

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html><head>
<base href=http://example.com/seo_page.html />

Jim

lammert




msg:4214073
 2:57 am on Oct 9, 2010 (gmt 0)

The <base href> should be referring to the published URL of the document

No, this is not true. The base href should be referring to the location which you want to use to build an absolute URL from if in the source code only a relative URL is mentioned. It may or may not be the same as the URL of the document [w3.org...] Both legitimate search engines and browsers parse the base href correctly if it differs from the URL of the document, but many scrapers crash on it. I therefore use deliberately a base href which differs from the main URL on almost all my sites. It helps prevent scraping and bad bots, without affecting legitimate visitors.

jdMorgan




msg:4214074
 2:59 am on Oct 9, 2010 (gmt 0)

Perhaps I could have worded it better, but the /path1/path2/ URL-path-part is wrong in either case, and that is what is throwing Googlebot off-track.

Jim

lammert




msg:4214084
 3:23 am on Oct 9, 2010 (gmt 0)

You can add the /path1/path2/ URL part and it shouldn't confuse Googlebot, as long as the internal URLs are all absolute. What I expect to be happening based on my personal experience is that not Googlebot is off-track, but a bad bot throwing out a Googlebot UserAgent string. In the logfiles it looks like Googlebot makes an error, but you have to check the IP address to be sure it is really a visit from Google and not a wolf in sheep's clothing.

Another thing which should be considered is that this /seo_page.html doesn't show up for some other obvious reason, like excluded in robots.txt, over optimization etc. Because even if on some pages the link to the page is corrupted based to a wrong base href tag, it normally should have been found based on the following statement of the OP:
Even though there were 10 correctly coded links throughout the site for every 1 that had an incorrect path (based on the <base> tag) that Google saw, Google never reported the correct path/pages.

enigma1




msg:4214530
 11:46 pm on Oct 9, 2010 (gmt 0)

IMO you should remove the self-closing tag from the base element. I know for sure it causes problems to some of the old browsers and it may cause problems with SEs (it is stated in the html spec). And set the href tag as a string.

<base href="http://example.com/path1/path2/doc1234.html">

Looks like your application has problems with the heading section so check the document with the w3c validator.

encyclo




msg:4214583
 1:13 am on Oct 10, 2010 (gmt 0)

You should also know this is a Joomla website and all Joomla websites have a <base> tag that is dynamically created for each page and every page. The path and file name in the <base> tag is equal to the current page, which is the proper coding according to W3C.


I don't use Joomla, but if the above is true then there is no reason for having a base element on the page at all. As lammert mentioned above, the base element is a mechanism for overriding the base href for relative links when it differs from the current page URI.

In general, the base element is more trouble than it's worth, even with "correct" usage it is bad practice. Of course, CMS developers don't always produce the best HTML markup, often the nature of a CMS leads to bloated markup.

See if you can simple comment out the base element from the head section, your site should work perfectly well without it and it is one less element to confuse Googlebot.

Robert Charlton




msg:4214645
 7:37 am on Oct 10, 2010 (gmt 0)

FWIW, here's a link to another discussion that also gets confused about the <base> element in spots... but contains enough information along the way that it is a useful reference.

The <base> element and how to use it effectively
http://www.webmasterworld.com/html/3414338.htm [webmasterworld.com]

To minimize confusion when you read it, know that this is the conclusion that pageoneresults ultimately comes to....

Ah, so I was incorrect?

If you are using Root Relative links (/images/file.gif), then the use of the Base Element is negated and there is no need for it, correct?

If you are using Relative links (images/file.gif, ./images/file.gif, ../images/file.gif., etc.), then the use of the Base Element comes into play.

So now you have another unique element on each page and that is the full URI of that page which is now the base element for all relative paths. Not Root Relative but Relative.

The thread is a good argument for avoiding both the Base Element and Relative links.

enigma1




msg:4214654
 8:55 am on Oct 10, 2010 (gmt 0)

In general, the base element is more trouble than it's worth, even with "correct" usage it is bad practice. Of course, CMS developers don't always produce the best HTML markup, often the nature of a CMS leads to bloated markup.

Well no and no. To give you an example, web applications may have templates in sub-folders. So if the root of the domain is
www.example.com

and the template is in
www.example.com/design/templates/blue-template/

And inside template folder you have the stylesheet for the template which is very common, then without the base element, resources in the CSS will have to use a path based on the script where is loaded from, or a fully qualified path.

And that's a bad thing for the template now.
.box {
background: url(http://www.example.com/design/templates/blue-template/images/arrow.gif) no-repeat;
}

So you setup the base to point to the template folder so the CSS will be:

.box {
background: url(images/arrow.gif) no-repeat;
}

And so the template is easily portable because it doesn't need to know the parent folder references.

encyclo




msg:4214693
 12:12 pm on Oct 10, 2010 (gmt 0)

the template is in
www.example.com/design/templates/blue-template/

And inside template folder you have the stylesheet for the template which is very common, then without the base element, resources in the CSS will have to use a path based on the script where is loaded from, or a fully qualified path.


This is not the case - the path for the images referenced in the CSS file are relative to that file, not that of the parent document.

So the template can be in
www.example.com/design/templates/blue-template/ and you can use background: url(images/arrow.gif) safely with NO base element as the image will be loaded from www.example.com/design/templates/blue-template/images/arrow.gif.

The base element does not help in this situation. As long as the CSS file uses relative URIs then the location of the template directory is irrelevant. Any CMS which depends on the base element is badly-designed. I don't know about Joomla specifically, but from the initial description it does sound like the base element is merely superfluous.

enigma1




msg:4214741
 4:29 pm on Oct 10, 2010 (gmt 0)

So the template can be in www.example.com/design/templates/blue-template/ and you can use background: url(images/arrow.gif) safely with NO base element as the image will be loaded from www.example.com/design/templates/blue-template/images/arrow.gif.

You are simplifying things too much.
Document base:
<base href="http://www.example.com/test-site/">


Stylesheet and under it various images:
<link rel="stylesheet" type="text/css" href="includes/template/stylesheet.css" />


Request using an SEO URL:
http://www.example.com/test-site/constructs/


Won't work without the base element set. So it becomes relevant and is not working because now the browser will use the requested link and add the template path. It's typical configuration with many sites. You cannot ignore it.

encyclo




msg:4214744
 4:46 pm on Oct 10, 2010 (gmt 0)

I guess it's a question of preference, but in my mind if the CMS can generate this:

<base href="http://www.example.com/test-site/">

Then it can just as well generate this:

<link rel="stylesheet" type="text/css" href="/test-site/includes/template/stylesheet.css" />

If it can't, then it's a design choice by the authors rather than an obligation to use a base element. Using the base element is, in my opinion, a band-aid in lieu of doing it right.

But I fear we are digressing from the original poster's question ;)

enigma1




msg:4214756
 5:37 pm on Oct 10, 2010 (gmt 0)

Well in this case you will have to setup absolute paths for js,css files, content images etc which is very complicated and unnecessary. You can avoid all this with just a single line in the header.

As of the original question I see it as an application problem not something generic. And I don't use joomla either.

Regent




msg:4215384
 9:56 pm on Oct 11, 2010 (gmt 0)

Hi All,

First, thanks for all your input. Yet the thread is starting to look like so many others I have seen on this same subject – how to use <base> and what does it do. After looking at what seemed like hundreds of posts – including debates in the Joomla forum and reading W3C’s explanation, it seems pretty clear that there is no consensus. And it is this very point that concerns me. By the way, the <base> tag is included in Joomla’s hearer jdoc include. Joomla always produces the <base> as the path/page you are currently on. I could hack it but I think it is correct.

The real issue here is how is Google interpreting the <base> tag.

It was the strange reportings in Google’s Webmaster Tools (WMT) account that prompted this thread. Here are some other meaningful notes on my observations:

* The site was a re-launch of a domain that has been around for many years (7+ years).
* The SEO pages were additions – brand new content written specifically for the site. These SEO pages do not have dup or near-dup content of any kind
* The re-launch took place in the May-June time frame (shortly after the May-Day update as I recall).
* The sitemap.xml file has had the correct SEO page path and file name since its relaunch. The sitemap.xml file was included in the site’s root and registered with Google.
* As of last week, WMT reported no SEO pages indexed, yet did report SEO pages under an incorrect path.
* Links to the SEO pages exist on every site page – in the footer as a SEF drop-up, 100% CSS menu. Links to the SEO pages also exist on one sub-menu.
* Links to the SEO pages are also on two other page of the site that produce right Urls for browsers but I suspect where Google got the incorrect path. These 2 pages are:
www.domain.com/component/user/reset.html
www.domain.com/component/user/remind.html

As you can see, these are pages for resetting a password or reminding a password for admin access. These are the only 2 pages on the site that could possibly produce what Google recorded in WMT. The base on each one of these pages is:
<base href="http://www.domain.com/component/user/reset.html" />
and
<base href="http://www.domain.com/component/user/remind.html" />
respectively.

* The SEO page URLs that Google reported in WMT
example: www.domain.com/component/user/SEO_PAGE1.html. The browser resolves to www.domain.com/SEO_PAGE1.html
* As Tedster rightfully points out, Google could have gotten the incorrect SEO page URLs from some other source. Perhaps in an email. But it is pretty unlikely that these pages would have gotten out into Google’s spidering range as they are not very popular.
* The correct URLs for the SEO pages did not appear in any of WMT internal links. Only the incorrect path pages did, and they produced a 500 error.
* As of this writing, all SEO pages with correct URLs respond properly to an “info:” query. All have cache pages. But none of the SEO pages have PR value. The home page of the site is a PR4.

To make a long story short, everything on the outside (info: query, browser resolve to the correct URL) looks right. But everything on the inside of WMTs looks wrong (absence of the correct path/SEO_page URLs, reporting of wrong page/SEO_page URLs).

There are only a few possibilities:
1. Google WMT is mis-reporting
2. Google did find legitimate links to incorrect path/SEO_page URLs
3. Google does have a canonical issue and the way they handle <base> relative to the way browsers do.

[edited by: tedster at 9:59 pm (utc) on Oct 11, 2010]

Regent




msg:4215392
 10:13 pm on Oct 11, 2010 (gmt 0)

tedster,

If there is no consensus on how to use <base>, it may be worth it to use your influence and ask Google folks to chime in and straighten us all out :-)
dgj

tedster




msg:4215398
 10:28 pm on Oct 11, 2010 (gmt 0)

You think I have some special hotline to Mountain View? Sorry to disappoint, Eric Schmidt is not my BFF ;)

I don't think there's any technical mystery about the <base> tag - it's very cut and dried. In addition, it only comes into play if you use the ../ style of linking.

The question of what is going with Google in your case is the mystery. But from this thread, I don't think it's the <base> tag. Also, realize that Google will intentionally experiment with oddly formed URLs, just to see how your server handles it. If your server says 404, then no problem.

If I were you I would put the <base> tag explanation aside and look at the problem with fresh eyes - once more again from the top.

Regent




msg:4215403
 11:06 pm on Oct 11, 2010 (gmt 0)

Tks Tedster,

As usual - good advise. So one with limited experience can understand, what is the answer to this example. How should the URIs resolve given:

<base href="http://www.example.com/path1/path2/page_A.html">
<a href="/page_B.html">page B</a>
and
<a href="page_C.html">page C</a> (without forward slash)

As I understand W3C and what I saw in WMT is:

http://www.example.com/path1/path2/page_B.html.
and
http://www.example.com/path1/path2/page_C.html.

But IE and FF and at least one sitemap.xml tools and Xenu resolve to is:

http://www.example.com/page_B.html.
and
http://www.example.com/page_C.html.

Which is right?

tedster




msg:4215431
 12:56 am on Oct 12, 2010 (gmt 0)

/page_B.html
...that link points to http://www.example.com/page_B.html. The beginning slash means "start at the domain root". That's true no matter what the URL of the source page may be.

page_C.html
... this link resolves differently in different cases. Without the beginning slash, the full URL begins with the directory where the source page "lives".

Regent




msg:4215745
 3:27 pm on Oct 12, 2010 (gmt 0)

MYSTERY SOLVED

As Tedster has suspected, there is more to this story. After several hours of further research and experimentation, I believe I have figured out what has happened.

There are 2 conditions that have led to my confusion. For the sake of others that follow, I will outline them below.

First, Joomla is a CMS framework. Most of you know that but what may not be clear and was not obvious to me is that all URLs constructed by the framework are based on the paths and pages off the root domain. In other words, the framework creates output HTML with a link that looks like /path1/path2/page.html regardless of what page the link is on. Furthermore, in-content or navigation links are all relative (without a leading "/"). Joomla editors will produce a back-end link with "../page.html" or "page.html". The CMS engine converts either link to the proper front end output.

The condition is further complicated with a SEF engine. This engine uses re-write rules to convert URLs with parameters to nice looking SEF URLs.

To make a long story short(er), the <base> tag is a Joomla front-end output tag that has no affect on normal <a> tag links in Joomla because the CMS engine takes care of creating clean SEF links that have a complete path. So what threw me off was placing an <a> tag link like this: href="SEO_Page1.html" on a page that had a URL that looked like /path1/path2/page.html with a <base href=www.example.com/path1/path2/page.html">.

Now, about the /path1/path2/SEO_pages.html URLs that Google picked up. I had to go back to some of the original code to figure out this one. Turns out the Joomla CMS engine figures out <a> tags just fine. And for the most part, does a good job to make links SEF with some re-write rules in the .htaccess file. But what it does not do is clean up <option value="SEO_page.html"> links. Very early in the sites design, SEO page URLs were placed in a drop list in the footer. Well - you can see where this is going. The Joomla engine did not correct these <option> tag links and the <base> tag OR current page path took over. Bingo, URLs that lead to pages that did not exist. I can not explain how or why Google decided to crawl these obscure pages when it has not done a good job crawling sitemap.xml pages.

Now I am writing "Tedster, you were right" over and over. :-)

The moral of this story is two fold. When dealing with CMS, the back-end code gets manipulated by the CMS engine and re-write rules. Although there has been a long standing debate on the pros and cons of using <base> tags, in Joomla's case, <a> tags are manipulated and re-written such that its <base> tags have no affect. I am sure there is good reason as mentioned in earlier posts that Joomla decided to add a <base> tag to their output. The second lesson is that CMS systems may not handle <option> tags (or other link code) the same way as <a> tags. In Joomla's case, they are different and caused output issues depending on the page where code was placed.

Case close!

enigma1




msg:4216075
 9:00 am on Oct 13, 2010 (gmt 0)

But what it does not do is clean up <option value="SEO_page.html"> links

Search engines do not extract nor synthesize links from the option element of drop-down lists. And I believe your problems are still there with the way the links are created. You had a relevant link somewhere perhaps that what got this problem, but the CMS you are using, responds to whatever path you setup and changes the base element.

That's a bad thing. It means whatever path someone types in the server responds with 200 OK and sets up the base element.
eg:
http://example.com/path1/path2/garbage1/garbage2/garbage3/
and the server shows the page of
http://example.com/path1/path2/
with the base element set to whatever is in the url.
So now not only the request poisons the base element but you cannot setup a relative link.

With what I see about security I wouldn't use that framework. And it is not something general about CMS, but specific to the one you are using

tedster




msg:4216207
 2:13 pm on Oct 13, 2010 (gmt 0)

Search engines do not extract nor synthesize links from the option element of drop-down lists.

That certainly used to be the case. However, I have evidence that Google now can, and does. It comes from a site that foolishly has no conventional menu.

enigma1




msg:4216263
 3:30 pm on Oct 13, 2010 (gmt 0)

I have evidence that Google now can, and does.

Is that about GET forms where the bot tries to determine links? Because I heard about it but it's old. In general using POST forms is the standard way. I haven't seen anything strange with POST forms.

tedster




msg:4216286
 4:08 pm on Oct 13, 2010 (gmt 0)

It seems to me that Google discovers a URL and then establishes a virtual link in their webgraph whenever any URL is directly in the source code (rather than generated by a script) and whether the action is POST or GET.

In the case I'm talking about, this would have been the only way the site could have been indexed as completely as it was, especially within a week of its launch when backlinks were rare. The key, I believe, was that the full text string of the URL was right there, directly in the source code.

enigma1




msg:4216372
 6:12 pm on Oct 13, 2010 (gmt 0)

I haven't seen it that and I am checking server logs of various sites. Because you would expect the bot to index the form's action tag which is going to be a valid link anyways. And I never saw that with POST forms.

Now if what you say is true then is going to open a new world of comment spam, injecting any kind of element/tag with a link eg:
<font href="http://example.com" /> having google to follow them. That's something I haven't tried.

tedster




msg:4216375
 6:16 pm on Oct 13, 2010 (gmt 0)

Your experiment sounds pretty interesting - really outside the box!

My guess would be that there's some kind of filtering before one of these virtual links gets set up - in fact I almost think there must be.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved