google indexing ?CFID=CFTOKEN=

Forum Moderators: phranque

Message Too Old, No Replies

google indexing ?CFID=CFTOKEN=

meelosh

2:27 pm on Jul 19, 2010 (gmt 0)

I am seeing in one of my sites that google is adding some coldfusion session id's to some urls and marking them as duplicate titles......why is this happening and what can i do to prevent it....or will it go away? thanks
?CFID=&CFTOKEN= (with allot of numbers offcourse)

phranque

12:09 am on Jul 20, 2010 (gmt 0)

the best technical solution for this situation is always a 301 redirect.

however it is also possible to suggest to google the parameters that should be ignored:
Parameter handling - Webmaster Tools Help [google.com]

meelosh

12:10 pm on Jul 20, 2010 (gmt 0)

thanks for getting back phranque...i will do both...i had a look in the parameter handling settings and the CFID is there to be added so i will do that....thanks again for the help..kinda forgot about the parameter handling setting..thanks

enigma1

5:29 pm on Jul 26, 2010 (gmt 0)

I think the redirect is irrelevant as long as there is no links with private parameters exposed from within your domain when a spider crawls them.

Instead I would check the application on the server. Theoretically web-apps that personalize content should recognize requests from popular spiders and not start sessions nor send cookies.

phranque

12:04 am on Jul 27, 2010 (gmt 0)

the redirect is relevant whenever the spider makes a request with non-canonical parameters, regardless of how such a url is discovered.
the problem also exists with external links that are cut-and-pasted from personalized or tracked pages.
when these links are followed then your server needs to do the redirect so that the url is depersonalized or the tracking parameters are removed.

enigma1

7:21 am on Jul 27, 2010 (gmt 0)

the redirect is relevant whenever the spider makes a request with non-canonical parameters, regardless of how such a url is discovered.

I don't think so. It's directly related to where the url is. If they are inside the domain then there is a problem. If there are outside the domain it is irrelevant.

To give you an example, consider a valid link in the domain:
http://example.com/?param1=1&param2=2

Someone posts on another site
http://example.com/?param1=1&param2=2&param4=1&param5=1

The spider may use the link to enter your domain but will not index the link unless it is found inside your domain.

And even if the parameters have a different order won't be a problem unless they are inside the domain.

phranque

8:19 am on Jul 27, 2010 (gmt 0)

the search engine may index any url that returns a 200 OK status code.

if the url contains the same parameters and values in a different order that will be considered a different url, merely with a non-canonical parameter order.

The spider may use the link to enter your domain but will not index the link unless it is found inside your domain.

i would be interested to read any authoritative reference you have to support that assertion.

here is one example of many WebmasterWorld threads that discuss various forms of external url discovery.
Google indexing large volumes of (unlinked?) dynamic pages:
http://www.webmasterworld.com/google/3490043.htm [webmasterworld.com]

enigma1

10:43 am on Jul 27, 2010 (gmt 0)

the search engine may index any url that returns a 200 OK status code.

I disagree and I can prove it I believe.

You can create an invalid link to my site that returns a 200 OK status. The spiders will never show it in their index. So if you do that you search site:www.example.com you won't see the page ever. Otherwise the spider has a serious problem indexing.

Now if you ask me to do the opposite I don't know the results, because I don't know the application you have on your site and its specifics. Lots of applications (carts, cms, forums etc.), incorrectly propagate parameters passed through the url and then these parameters are exposed inside their pages. A prime example is the WP which has all kinds of issues with this.

So do this.
1. Goto any WP page that has comments posted.
2. Hover over the date of a posted comment the url should show like example.com/page/#comment-1234
3. Copy the link to the address bar of the browser change the url so it reads example.com/page/?param=invalid
4. Then hit enter to enter the page with the invalid url.
5. Repeat step-2 see how the urls now show in the page.

So that means if I post an invalid url for another site that has WP installed, that site now has problems, because a spider finds the link to my page and then it verifies its existence because WP propagates invalid parameters.

And it's what I understand that happens in this case from the OP's comments. Either internally or externally some parameters propagate. That problem is application specific.

phranque

10:56 am on Jul 27, 2010 (gmt 0)

You can create an invalid link to my site that returns a 200 OK status.

how do you define "an invalid link" to your site?
is this a non-canonical url that is nevertheless returning relevant content?
or is this a completely bogus (or perhaps obsolete) url that is getting a 200 OK response while being shown the home page or perhaps a "not found" page? (also known as a "soft 404")
or...?

So that means if I post an invalid url for another site that has WP installed, that site now has problems, because a spider finds the link to my page and then it verifies its existence because WP propagates invalid parameters.

yes, that's what it means - unless the webmaster of that site did some extra work to redirect requests for non-canonical urls containing fictitious parameters.

although in your specific example, those comment urls don't get indexed because they are essentially the same url as the page on which they reside and everything after the hash mark (#) is a fragment identifier.
google recently started indexing special fragment identifiers that start with '#!' and are intended to make ajaxian applications stateful, but other than these, google and the other SE's ignore document fragment specifications when indexing urls.

enigma1

11:21 am on Jul 27, 2010 (gmt 0)

how do you define "an invalid link" to your site?

Any link that does not exist in exactly the same way in my pages.

So urls with canonical problems are invalid urls. If you have example.com/page.php that is valid and someone put in the address bar of his browser example.com/page.php?id=5 your site doesn't need to redirect or anything. It can very well respond with 200 OK and it will not be indexed with spiders because the link doesn't exist anywhere in your pages.

The fragment was to illustrate the problem. There are many other examples

phranque

11:43 am on Jul 27, 2010 (gmt 0)

there must be hundreds of WebmasterWorld threads dealing with this specific problem.
here is a thread where one our most prolific and respected contributors describes the results of his tests of the effects of non-canonical inbound links [webmasterworld.com].

enigma1

11:53 am on Jul 27, 2010 (gmt 0)

That post explains issues with different domains not urls inside the same domain.

example.com and www.example.com and joe.example.com are all different domains. How they mapped is again an application issue.

But when a spider crawls a site will index links found in the HTML source. If the links in the HTML source have problems then they will be indexed with their problems. But if an outsiders posts an invalid link and the link doesn't exist in the HTML source of any page in the same domain, the spider won't index it. Otherwise anyone could flood the spider index with garbage.

phranque

12:30 pm on Jul 27, 2010 (gmt 0)

Otherwise anyone could flood the spider index with garbage

a SE will not index a garbage url if it is 301 redirected to a "good" url or if it gets a 404/410 or robots noindex response.

enigma1

1:05 pm on Jul 27, 2010 (gmt 0)

yes and it will not index a page that doesn't exist anywhere in the domain, even if the server response is 200 OK.

And every site that I checked, will return 200 OK if you add to a valid url some invalid parameters. It's up to the application what it does with the parameters.

meelosh

2:13 pm on Jul 27, 2010 (gmt 0)

I have noticed that 90% of the sites out there..if i add an invalid parameter to the url it will show the page with the correct url and the invalid parameters attached. there is a very small handful that come up 404.
yesterday WMT acknowledged my parameter change...it seems to take awhile...and now i will see if it removes the so-called duplicate titles.

BillyS

3:04 pm on Jul 27, 2010 (gmt 0)

My approach would be...

1 - Try to stop the server from behaving this way
2 - Webmaster tools (ignore parameter)
3 - Modify robots.txt to stop indexing

phranque

9:40 pm on Jul 27, 2010 (gmt 0)

I am seeing in one of my sites that google is adding some coldfusion session id's to some urls

where are you seeing these urls?
i'm assuming it's not a CF-based site.
server access logs?
google index?

meelosh

10:00 pm on Jul 27, 2010 (gmt 0)

marking them as duplicate titles

phranque they have appeared in my wmt under the html suggestions and are coming up as duplicate titles.
no the site is not CF based and i cannot find any inbound links with the exact url and parameter....i cannot figure out where they are coming from...but have suggested to ignore the parameters and will see what happens.
It is a very old domain that i took over a long time ago..and with the deep crawls that have showing all of these thousands of new inbound links it could be something from 10 years ago that google stumbled across that had cold fusion parameters..i dont know.

phranque

10:07 pm on Jul 27, 2010 (gmt 0)

(i missed the "duplicate titles" when rereading your OP.)

have you looked for those urls in the index using a search such as:
site:example.com inurl:CFID
or have you looked for the inbound links using yahoo siteexplorer or some other backlink resource?

meelosh

10:27 pm on Jul 27, 2010 (gmt 0)

yip have done all that ...and nothing turns up..the thing is that in wmt there are thousands of links more than show up in any backlink checker....i still dont quite get what is going on in there..as the links shown (to me) are about a 3rd of the total number shown....originally thought it was broken as the internal link number is also way out but now it has been around for awhile and the number grows quite a bit every day or so...but i only get to see about a third of the number itself....sorry if i am not making sense it has been a long day

enigma1

8:13 am on Jul 28, 2010 (gmt 0)

i cannot find any inbound links with the exact url and parameter

How do you test this? From your wmt take one of the invalid links paste it in the browser then extract the <a> tags from the generated html source code and check them. See if there are links with invalid parameters in them.

In most cases (not always) what you see in the wmt are links found by the spider while crawling your site.