<base href> ignored by Googlebot?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

<base href> ignored by Googlebot?

NedProf

2:20 pm on Jan 2, 2007 (gmt 0)

It looks like Googlebot has a problem with the base href element when this construction is used:


<base href="http://www.-----.com/" />
<a href="news/">News</a>
<a href="contact/">Contact</a>

Note the / stands after the tld and not before the link url.

Googlebot ignores the base href and spiders the website like this:
[-----.com...] ans so on: [-----.com...] etc.

Anybody else seeing this problem? It can cause a lot of bandwidth-usage on some occasions. Especially on sites with dynamic url's that are handled by one file that return a 200 OK on all addresses.

I know, the last solution is not good but it also shouldn't give a problem. Other spiders and browsers work fine.

NedProf

11:51 am on Jan 3, 2007 (gmt 0)

And I see an other victim of this bug, anybody else seeing this?

Fox_Mulder

12:34 pm on Jan 3, 2007 (gmt 0)

I have it like this for years and never had a problem.

tedster

1:11 pm on Jan 3, 2007 (gmt 0)

<base href="http://www.example.com/" /> can be a problem if you use it on a page that is in an interior directory.

For example, if the page is in the /news/ directory and it has a relative link to another page in the /news/ directory, the base tag above is telling the user agent not to use the /news/ directory in the file path, but to calculate it from the domain root. The correct value for the base href tag is the fully qualified absolute address of the page itself.

What kind of error recovery a bot might have at that point would only be guesswork for us, but technically it would have a error to cope with

NedProf

2:13 pm on Jan 3, 2007 (gmt 0)

tedster: you are totally right, and that's also exactly my point: all robots and browser do it right except Googlebot. Googlebot just ignores the base href element and calculates the wrong path.

Sometimes it can go recursive through a site and index url's like this: www.-----.com/news/contact/news/contact/news/contact/ etc.

Extra info: this is not something that is broken for ages, this is a new bug that I'm seeing more and more.

[edited by: NedProf at 2:22 pm (utc) on Jan. 3, 2007]

tedster

2:50 pm on Jan 3, 2007 (gmt 0)

I remember one report like this last summer, and I was not able to pin down any reason for it that the site could be reponsible for. Just make sure your server is returning a true 404 status code for those bad urls. Many .NET sites return a 302 to serve a custom error page -- that can spell trouble, especially if you combine it with this issue.