Forum Moderators: open
You can play with it on Microsoft's own site. Just add (A(abcdefg-1234567) between any two slashes. You can change it as much as you like to create lots of different URL's. The only rules are that the letter "A" has to be only one letter and it has to be a capital. The 2nd part can be just about anything. There are some characters like # or % that you can't use. You could get real fancy with it and add something like (I(id=1546973)). This could make Google think it is a session id.
[microsoft.com...]
I'm not sure what this is. I'm sure it is a "feature". Does anybody know what it is and how to turn off?
User-agent: *
Disallow: /(*
Personally I wouldn't use the robots.txt to control this. Anything you put in the robots.txt is going to become indexed with a URI only by Google. There is no need to let them know about the problem via robots.txt. It should be solved at the source and not with a bandaid workaround.
StuWhite
Did you test this entry robots text through the google webmaster tools to see if would disallow the indexing?
Yep, I tested it in webmaster console before implementing it.
The console also reports Googlebot attempts which are actually being blocked by the robots.txt so I know it is working.
There is no need to let them know about the problem via robots.txt
I don't think you need to classify it as a 'problem'. All you are doing is telling search bots not to index pages which include a session variable. I don't see it as advertising some huge security flaw in your system.
[edited by: StuWhite at 2:20 pm (utc) on Aug. 8, 2007]
User-agent: *
Disallow: /(
...is the valid (and presumably widely-supported) option I've chosen until something better arrives. There is apparently no need for a trailing *, owing to the way the robots syntax works.
Note that this is only a partial fix though as the bracketed part doesn't necessarily need to be right after the first backslash after the domain.
[edited by: Panic_Man at 8:53 am (utc) on Aug. 9, 2007]
In general the URLs with thedubious characters in are being handled as 404, but are returning a 200 in the headers.
Putting the following ASP code in the 404 Error Document will fix the behaviour.
***
Response.status = "404 File not found"
server.transfer "/404location.htm"
***
<%
Response.status = "404 File not found"
server.transfer "/404.htm"
%>
404.htm is the old custom 404 page I had that works fine when it is set up as the custom 404 in IIS. In IE I get a default 404 page and firefox trys to get me to down load a file. If I browse the file 404.asp it works as expected and brings up the contents in 404.htm.
When I check the 404 code with the header checker I get a 404 for each method. The problem is that the new method does not seem to work in a browser.
If you miss the forward slash it treats the address as being relative to the current folder, rather than being relative to site root - given that the problem URLs are effectively slipping the folder structure down a level yoou should really look at making the files relative to the site root.
Disallow: /(*
The * should be on the left as a wildcard.
.
The Disallow directive, matches URL that begin with whatever is disallowed.
So:
Disallow: /abcd
stops any URL that begins /abcd
and:
User-agent: Googlebot
Disallow: *(
disallows any URL that starts with anything and then includes a ( bracket and then continues with either anything or nothing else.
Wildcards should be aimed only at Googlebot.
If you want to turn off Cookieless State in Asp.net I have made example Configuration section to disable Cookieless Sessions. This will not stop all the unwanted behavior that has been noted. But it will stop the bots from getting new Cookieless urls handed to them automatically, does nothing for the old urls with the cookieless session keys in them.
Definition:
Cookieless. The cookieless option for ASP.NET is configured with this simple Boolean setting.
<configuration>
<system.web>
<sessionState cookieless="false" />
</system.web>
</configuration>
References:
ASP.NET Session State [msdn2.microsoft.com]
sessionState Element (ASP.NET Settings Schema) [msdn2.microsoft.com]
Underpinnings of the Session State Implementation in ASP.NET [msdn2.microsoft.com]
One of the things I like to do to a website is enforce URL's. Every page is aware of itself and will not allow access to itself unless the proper url is used. Otherwise it will send out a 301 to the proper URL.
This is the best way to handle it and a lesson I learned the hard way a few years ago.
BTW, it's not just a case-sensitivity issue. You can rearrange the parameters in the querystring of a URL and most bots will look at each combination as a different page with the same content. ie. somepage.asp?a=1&b=2 will render the same page as somepage.asp?b=2&a=1 but the URLs are different and are therefor two different pages with the exact same content. In order to prevent this situation you need to inspect the URL and 301 to the correct one if the values are out of proper order. It's not trivial but fairly straightforward once you figure it all out.