The duplicate content factory (previously known as Google) has been working overtime on my most important site.
In the past few days I have encountered a wide range of directory and filename modification covering changing to all lowercase, changing to all uppercase, and selective first letter to uppercase (thought you were safe with all lowercase, huh!). I've also found directories that are nested at the same level on my server, nested one within the other in the DCF index! What a mess!
With a 3 tier directory, this adds up to a lot of duplicate content given the range of variations used. If the DCF thinks this requires a penalty, and I certainly do, it should start beating itself with a very big stick right now!
If you're going to claim to be a search engine and take content for free from other companies websites, then the very least you can do is to ensure that you represent those companies websites accurately in terms of content and structure.
Although time consuming (and at the expense of improving the site for visitors), the DCF's creative input to my directory and file names is relatively easily dealt with, but the real problem that I could do with some advice on is how to deal with filename repetition and filename appending in the DCF index.
This is the kind of thing I'm seeing:
plus it will append a querystring when it feels like it.
Neither cgi.script_name or cgi.query_string will detect the trailing forward slash after the first filename. The ones with a query string I am able to nail but I cannot see any way at present to deal with 2 filenames seperated by the forward slash.
I'm on an NT/CFM server, and in all other cases I'm checking the script_name against what I know it should be and delivering a page with a robots meta tag when they don't match.
Any ideas on how to detect a consecutive filename combo?