Some dynamic sections of our site are structured like this:
This URL produces a page that has some content and some internal links. Some of the internal links
are intended to be as follows:
www.ourdomain.com/link-1.ext (.htm/.php - whatever)
While looking (closely) through the logs recently, I found out that GoogleBot was requesting
(and crawling, with 200 response), URLs like this:
This caught my eye, because we're certainly not publishing/using such URLs anywhere on the site.
Heck, I didn't even know what output such URLs would produce so I tried a few in the browser.
To my horror, I found that they ALL produced a (more or less) duplicate copy of the URL:
I'm sure if left unchecked, this would get a site in a horrible 'duplicate content' mess.
(Well, actually, this site of ours is already in this mess, for different reasons, but that's
besides the point here.)
I was quite sure that GoogleBot was not finding such URLs via our internal links, so I went
investigating further and found out that recently, one webmaster had 'kindly' linked to our URL
I found that this URL (with the extra trailing /) produced our info.php page, with mal-formed links
in the format:
Done, I think. The root cause of the current problem detected.
But just to think about it, one little unwanted trailing / in an external inbound link can have the potential to cause a major disaster!
Now how to 'fix' this? I think the 'fix' would have to be (at least) a two-way fix:
1. Strengthen our scripts to do a strict validation of all arguments, to look for such unrequired
parameters and to deal with them in a consistent manner.
(I think this should be a good standard practice for all webmasters/developers, whether they've landed
in trouble with Google or not ;-) I know that this advice has been freely and frequently given out out here before, but there's
nothing like 'self-discovery' to make one a true believer ;-))
2. Make all the internal links absolute, always - or to use the base href (meta tag) on all pages?
3. Request the other webmaster to 'correct' the link
I don't quite know yet if .htaccess can also be deployed to help protect against such 'accidents'.
Perhaps tedster, g1smd, jdMorgan and other experts here would throw some light on all this.