Forum Moderators: DixonJones
I have a site that uses different servernames to access unique portions of the site. (i.e., [careers.company.com...] goes to the careers section of the site). Only pages related to the careers should be using this URL.
This works really well, but somewhere along the way the googlebot managed to index a number of pages with the wrong URL. (i.e. a page in the news section was indexed as [careers.company.com...]
From the user's standpoint, it's doesn't matter. When they click on a link in Google, they'll go to the correct page. It's only for me that it suddenly creating a reporting nightmare.
I need to exclude certain portions of our site for some reporting. Up until now I could easliy do by using a combination of "Multi Homed Domain" and "Directory" filters.
However, now because of these rogue urls I'm running into a problem.
Okay, so here's what I am trying to accomplish...
I need to EXCLUDE everything that goes to the root URL of "http://resources.company.com/" , but at the same time I need to INCLUDE the rogue urls that come up as, for example, [resources.company.com...] (this litterally could be any extended path).
If I Exclude "resources.company.com" using the "Multi Homed Domain" filter, then I end up losing the rogue pages as well - which I need to keep. I've tried the "Full URL", "File Only", and "Entry Page" filters, but they just ignore me when I try to filter out the path "http://resources.company.com/" .
Am I making sense?
If I had the Analyzer Suite from Webtrends it would be no problem because it has a URL Search and Replace feature that would allow me to replace the rogue URL with the correct one (i.e., replace "resources.company.com" with "www.company.com"). But, I only have the Log Analyzer and it doesn't have this feature.
Any suggestions on how I can get this to work would be greatly appreciated.
The best solution I have used in similar situations is to check on server side what version (entry point for this page) is requested. You then set the default you want (you third level domain structure) and if it do not match you dynamically insert a META-robots noondex tag.
This way the spiders will only be able to index one entry point for each unique page and you secure a consistent logging for your analysis. This also makes sure you don't spam by accident (having multiple entry points can look as multiple identical copies to a search engine, that might punish you for it
Mikkel: Thanks for the tip. I will be sure to implement your suggestion going forward.