Forum Moderators: Robert Charlton & goodroi
I want www.example.com indexed
and www.example.com/?ad=xyz removed
Will this cause any trouble with the main url?
[Note: client uses Yahoo hosting so I can't edit robots.txt]
[edited by: tedster at 6:07 pm (utc) on Dec. 18, 2008]
[edit reason] switch to example.com - it can never be owned [/edit]
<added>
Yahoo hosting does not automatically rule out editing robots.txt as I understand it. Only for sites that are built using their "Store Editor".
Some sites have taken to using the # mark. That's usually a fragment identifier, but these sites are using it for tracking various referrals. Search engines do not consider different fragement identifiers to be a different url.
301 redirect query string [google.com]
We not only had tracking codes to indicate which display which CPM campaign sent us traffic or which CPA affiliate sent us traffic, we also had tracking codes that were being used to track things internally as well.
For example, the home page on our site might have multiple links to the same destination page (one in header, one in footer, one imbedded in an text blurb, all on the same page) and business owners wanted to track which link consumers were clicking on the most. So our old (not so SEO savvy) developers would add something like placement=1, placement=2, and placement=3 to the header, footer, contextual link, respectively to track which link they clicked on. So the home page would have 3 links to the same destination page similar to the following:
<a href="/folder/somedestinationpage.asp?placement=1">linktext</a>
<a href="/folder/somedestinationpage.asp?placement=2">linktext</a>
<a href="/folder/somedestinationpage.asp?placement=3">linktext</a>
Between the internal tracking and external tracking, we had single pages with literally hundreds of thousands of possible combinations of tracking query string parameters. Our site has about 25000 URLs yet we were showing over 250,000+ URLs indexed. Talk about a canonicalization nightmair!
First we created an XML file that listed all of our internal and external tracking codes we might see on the URL. For each tracking code the XML defined the name of the code as it appears in the URL (the query string name), a flag to indicate whether or not it should be cookied, and if applicable, the name of the cookie where it should be stored, how long it should be cookied (expiration), whether the cookie should be logged for our web analytics program when the requested page eventually loads, whether the cookie should be cleared by the page being requested when it eventually loads (after it has logged it for our web analytics program, if necessary), and several other parameters to tell our system how to handle the query string parameters and cookied values.
Next we wrote something to use the tracking XML to process incoming page requests before .NET got hold of the request. We are an IIS shop (not Apache) so our options for solving this are different. But we basically wrote an HTTP Module that essentially becomes part of IIS that gets executed with each page request before the page is handed over to the .NET handler for executing the actual ASPX pages. The HTTP Module does essentially the following for every page request (slightly simplified):
*** BEGIN HTTP MODULE LOGIC ***
First - Check for tracking codes in the URL:
Cycle through the tracking XML to see if any tracking codes listed in the tracking XML are present in the query string. If so then
1) cookie the tracking code (if applicable) as indicated by the tracking XML (it tells the name of cookie to be used, how long before expiring, etc)
2) remove the query string name=value pair from the query string, storing the new modifed URL in a NewURL variable, and
3) set a flag (URLModified) that we have modified the URL.
Once we have processed the entire tracking XML against the requested URL, if URLModified = 'true' then 301 redirect to the NewURL which is free of tracking codes. This will pass on the link juice to the 'cleaned' tracking-free URL.
NOTE: This process only strips tracking codes listed in the XML. Other query string parameters not used for tracking purposes will remain in the URL. It should also be noted that step all tracking codes listed in the XML get removed regardless of whether the XML says they should be cookied. I can deprecate a tracking code by listing it in the XML with a flag to NOT cookie it's value. This will result in it simply getting stripped from the URL.
Second - The URL is now free of tracking codes if it gets this far, process the cookies:
Before handing the page request over to the .NET handler to render the page, if the URL is free of tracking codes then the HTTP Handler checks to see if it has some work to do with cookies (i.e. if it were just redirected to a 'cleaned' URL as a result of the first step above). It performs the following:
Cycle through the XML file again to see if any cookie names listed for tracking codes exist for the consumer. For each cookie listed in the XML that exists for the consumer:
1) Check the XML to see if we need to log the tracking cookie value for our web analytics. If so log it.
2) Check the XML to see if the cookie should be cleared once it is logged. (This if for things like the internal tracking example about to figure out where consumers are clicking - these tracking codes are essentially only good for one click).
*** END HTTP MODULE LOGIC ***
Once all of the URL processing and cookie processing is complete, the HTTP Module exits and IIS gives the request to the .NET handler to process the page request which now has a cleaned, tracking-free query string.
Note this code worked well for our internal tracking as well. In the example above where I indicated our home page might have 3 links to the same destination page, previously those links appeared similar to:
<a href="/folder/somedestinationpage.asp?placement=1">linktext</a>
<a href="/folder/somedestinationpage.asp?placement=2">linktext</a>
<a href="/folder/somedestinationpage.asp?placement=3">linktext</a>
I changed this so that now they look like:
<a href="/folder/somedestinationpage.asp" onclick="SetCookie('placement:1')">linktext</a>
<a href="/folder/somedestinationpage.asp" onclick="SetCookie('placement:2')">linktext</a>
<a href="/folder/somedestinationpage.asp" onclick="SetCookie('placement:3')">linktext</a>
A javascript function called SetCookie was written. It takes a pipe delimited string of name:value pairs like 'cookiename1:value1¦cookiename2:value2¦...¦cookienamen:valuen' and sets all of the cookies in the list to their corresponding values.
The placement tracking code was set up in the tracking XML to get logged for our web traffic analyzer and then cleared on the next page load. So this cookie only has a value between when they click on a link and when the page being linked to actually loads, essentially long enough for us to log the click for our web traffic analyzer.
So now all of the links on our site are clean ("/folder/somedestinationspage.asp" w/ no query string) yet I can still track where they clicked if they have cookies and javascript turned on. The number of consumers w/ javascript and cookies enabled is more than sufficient to know which links people are clicking the most and the approximate distribution.
Any external sites linking to us w/ the old tracking codes in the URLs will result in the tracking codes being cookied, removed from the URL, and a 301 redirect to the clean URL.
This drastically reduced the number of URLs indexed by the engines and eliminated link equity being split over multiple URLs for he same page due to tracking codes.