Session IDs in the URL

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Session IDs in the URL

Possibly in the index

Tonearm

5:49 pm on Aug 26, 2007 (gmt 0)

My site suppresses all session IDs from appearing in the URL if a robot user agent is detected, but I'm concerned that some may have slipped through anyway.

I have somewhere around 500-600 legitimate pages. Google reports about 600, but when I start browsing the pagination it drops to about 450, so those are the only URLs I can inspect. A session ID doesn't appear in any of those 450, but what about the other 150?

Yahoo Site Explorer reports about 1200 pages of which about half are duplicated with a session ID. Those duplicated URLs must be very old because I'm sure the site is properly suppressing session IDs at this point.

How would you handle this?

tedster

6:23 pm on Aug 26, 2007 (gmt 0)

Does the first page of results say: "Results 1-10 of about 600", but when you click in you see "Results 61 - 69 of 450" - without the word "about"?

If so, I'd say you have no concern at all. There are no "hidden" urls affecting you; it's just an artifact of the complex way Google needs to use in collecting the page count information from data which is sharded in many different ways across their proprietary database format.

Tonearm

7:06 pm on Aug 26, 2007 (gmt 0)

It actually never says "about" at all. It always gives me counts to an exact page. Does that indicate a possible problem then?

Tonearm

11:09 pm on Aug 26, 2007 (gmt 0)

Ok, at first Google tells me "about" 651, then when I get to the last page of pagination they tell me 433 without the "about".

Tonearm

3:53 pm on Aug 27, 2007 (gmt 0)

I still wonder about all the session IDed URLs in Yahoo Site Explorer. Would you guys just let it be or 301 the IDed URLs?

pageoneresults

3:58 pm on Aug 27, 2007 (gmt 0)

My site suppresses all session IDs from appearing in the URL if a robot user agent is detected, but I'm concerned that some may have slipped through anyway.

If detecting based on UA and not IP, I would think you are subject to a few issues.

Yahoo Site Explorer reports about 1200 pages of which about half are duplicated with a session ID.

I'd be concerned about those.

Those duplicated URLs must be very old because I'm sure the site is properly suppressing session IDs at this point.

Was there a point where things were not working properly? Take a sampling of those URIs with the indexed IDs and see if you can find references to them.

Tonearm

4:09 pm on Aug 27, 2007 (gmt 0)

Hi pageone,

If detecting based on UA and not IP, I would think you are subject to a few issues.

So you think a high-profile robot like Yahoo or Google will sometimes spider a page without their usual UA?

Was there a point where things were not working properly? Take a sampling of those URIs with the indexed IDs and see if you can find references to them.

I've been in business for years and I wasn't always filtering session IDs. Do you mean I should see if Yahoo is currently trying to access those URIs?

pageoneresults

4:24 pm on Aug 27, 2007 (gmt 0)

So you think a high-profile robot like Yahoo or Google will sometimes spider a page without their usual UA?

No, not their "normal" bots. I tend to think they have "other bots" that perform "other functions" and some may not be easily identified, even by IP.

Do you mean I should see if Yahoo is currently trying to access those URIs?

I'd want to find out why they are, yes. I'd be performing advanced searches to see where and if there are any hard-coded references to those session IDs. Maybe not from your own site, but from someone else's?

Tonearm

4:49 pm on Aug 27, 2007 (gmt 0)

What do you think about 301ing all of them and then watching for any new ones?

g1smd

4:58 pm on Aug 27, 2007 (gmt 0)

If users can see session IDs while they are browsing your site then there is always the risk that they cut and paste those session-ID-based URLs into content on your site (e.g. if you run a forum) or into content at other sites. In that case, those URLs will get indexed with the old session IDs included.

I would investigate the list of indexed URLs and copy down the session-ID-string of all those indexed session IDs. If the number of indexed strings is fairly small (perhaps a few dozen, or so) I would set up a 301 redirect for any requests for URLs with those session IDs included.

Rather then detect bots, and suppress session IDs for them, I would suppress session IDs for all users that have not logged in to the site. Bots can't log in, therefore they never see session IDs. This doesn't rely on any user-agent, IP, or referrer data and is much more reliable.

pageoneresults

5:18 pm on Aug 27, 2007 (gmt 0)

Rather then detect bots, and suppress session IDs for them, I would suppress session IDs for all users that have not logged in to the site.

How about Session IDs for tracking purposes?

g1smd

5:26 pm on Aug 27, 2007 (gmt 0)

Always a problem.

Be more careful who or what gets a session ID, or perhaps use cookies.

There are problems with all methods.

Tonearm

5:32 pm on Aug 27, 2007 (gmt 0)

If the number of indexed strings is fairly small (perhaps a few dozen, or so) I would set up a 301 redirect for any requests for URLs with those session IDs included.

Actually, none of those hundreds of session IDs should match. Which of these would you do:

1. 301 them all
2. set up Yahoo's "Dynamic URL Rewriting"
3. set up robots.txt to filter the session IDs

Rather then detect bots, and suppress session IDs for them, I would suppress session IDs for all users that have not logged in to the site.

The point of including the session ID in the URL for non-robots is to prevent a new session from being created upon each access from a user who doesn't accept cookies. Otherwise, that user can't use the site and the server has to work hard creating all those sessions. This is a really tricky balancing act that I've struggled with quite a bit.

edit:

Is there a possibility Google has some session ID URLs in its index like Yahoo does, even though they do no show up with site:www.example.com?

[edited by: Tonearm at 5:34 pm (utc) on Aug. 27, 2007]

g1smd

5:56 pm on Aug 27, 2007 (gmt 0)

What does this search [google.com ] give you?

(... where "session" is whatever you use as the parameter name).

jd01

6:06 pm on Aug 27, 2007 (gmt 0)

If you are properly serving pages without session IDs, you should be able to use some mod_rewrite in the .htaccess (on an Apache server on IIS you need ISAPI_Rewrite) to remove the query string with a session ID from bot requests.

I recommend using the same detection method you are currently using.
IOW: If you use IP, use the REMOTE_ADDR code and if you are using user agent, use the HTTP_USER_AGENT code.

RewriteEngine on
RewriteCond %{REMOTE_ADDR} ^IP\.NUM\.NUM\.NUM$
RewriteCond %{THE_REQUEST} \?
RewriteRule (.*) /$1? [R=301,L]

OR (for user agent based)

RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^robotstring$
RewriteCond %{THE_REQUEST} \?
RewriteRule (.*) /$1? [R=301,L]

If you need QUERY_STRINGS for some URLs requested by bots, you can add your session variable to THE_REQUEST:

If 'sessionID' is the first variable:
RewriteCond %{THE_REQUEST} \?sessionID=

If 'sessionID' is not the first variable, match var followed by = followed by one or more characters not an &, followed by an &, followed by sessionID:
RewriteCond %{THE_REQUEST} \?var=[^&]+&sessionID=

See the Apache Forum [webmasterworld.com] for more details on setting this up correctly.

Justin

Tonearm

4:45 pm on Aug 28, 2007 (gmt 0)

g1smd,

I don't get anything from that search. Hopefully if Google has session ID URLs in its index they would be included in the site: search.

Justin,

I like the idea of 301ing any robot that accesses with a query string. Thank you.

pageoneresults

4:53 pm on Aug 28, 2007 (gmt 0)

I'd try this type of search...

"example.com/foo"

Tonearm

5:57 pm on Aug 28, 2007 (gmt 0)

I don't get anything from "www.example.com/?session=" (with quotes) or similar searches, but then again I don't get anything from "www.example.com/keywor" when I know I have "www.example.com/keyword" in the index.

pageoneresults

6:00 pm on Aug 28, 2007 (gmt 0)

Try an exact phrase search. Dump the entire string into quote marks. Not just parts of it.

Tonearm

6:03 pm on Aug 28, 2007 (gmt 0)

Isn't that what I did above?

pageoneresults

6:24 pm on Aug 28, 2007 (gmt 0)

Isn't that what I did above?

I'm not sure because we are examplifying. I'd take the full URI string, everything, put it between quotes and perform the searches. What you will typically find are references to those strings in pages where they may not be linked. Or, they were linked at one point and are now no longer linked. I've seen all sorts of stuff going on when doing "quoted searches" with URI strings.

"www.example.com/?session="

That example above should include the "full Session ID, all the way to the end of the string. Not just parts of it but the entire string.

Tonearm

6:32 pm on Aug 28, 2007 (gmt 0)

Ok, you're saying I should search for:

"www.example.com/?session=abc123"

as opposed to just:

"www.example.com/?session="

Is that right? Are you saying I should take the URLs Yahoo has which include a session ID and see if Google has them too? Then I can use that information to track down a link? Is that the idea?

pageoneresults

6:33 pm on Aug 28, 2007 (gmt 0)

Is that the idea?

You are correct. :)

Tonearm

6:39 pm on Aug 28, 2007 (gmt 0)

Well, I perused each of the links for my site in Yahoo Explorer and didn't find one. Is that reliable?

pageoneresults

6:39 pm on Aug 28, 2007 (gmt 0)

What about Google?

Basically we are trying to find where and if there are any references to those URIs with the Session IDs. One sure way to do this is searching for those strings in quotes.

Its very difficult to determine what your particular issues are as you state Yahoo! reports 1200 links and half contain a Session ID. They had to come from somewhere. And if they are still there and have "fresh dates", they are still being served. Now, where the heck are they being served from? A scraper site somewhere?

[edited by: pageoneresults at 6:43 pm (utc) on Aug. 28, 2007]

Tonearm

6:42 pm on Aug 28, 2007 (gmt 0)

Google definitely doesn't have a session ID URL in a site: search or a link: search. I checked those manually too. Yahoo sure is loaded up with them though. I guess they must be from old crawls or "UA-disguised" crawls.

pageoneresults

6:48 pm on Aug 28, 2007 (gmt 0)

Google definitely doesn't have a session ID URL in a site: search or a link: search. I checked those manually too. Yahoo sure is loaded up with them though. I guess they must be from old crawls or "UA-disguised" crawls.

DUDE!, don't do a site: or link: search. Do the quoted string searches as I've specified above. The other two advanced searches are not going to give you what we are looking for which are references to those URI strings whether they are linked or not.

The link: command is not going to show you all of the links out there that Google knows about, it has been that way for years. Yahoo!'s Site Explorer will give you a larger number of results when performing those advanced searches and so will Live. Google throttled that back years ago.

If you perform a quoted string search in Google and cannot find references to those then you're probably okay there. Now, where is Yahoo! getting them from? Check the cache dates.

g1smd

6:48 pm on Aug 28, 2007 (gmt 0)

There was an error in my earlier post. The & should have been a + instead. It should have read:

What does this search [google.com...] give you?

(... where "session" is whatever you use as the parameter name).

Tonearm

7:12 pm on Aug 28, 2007 (gmt 0)

pageoneresults,

Are you saying I should do a quoted Google search for the 600 or so URIs Yahoo has which include a session ID? Where do I find the Yahoo cache date? The "Yahoo" portion of the cache display is obscured because I use CSS absolute positioning for source-ordered content. I can't find any such date by viewing the source code of the cache either.

g1smd,

I still getting nothing after de-examplifying that search, so I guess that's a good sign.

pageoneresults

7:24 pm on Aug 28, 2007 (gmt 0)

Are you saying I should do a quoted Google search for the 600 or so URIs Yahoo has which include a session ID?

No, just a sampling. I'd take about 10 of them, perform the quoted string searches and see if and where the references to those URIs are. One of the first things you'll see when doing those "quoted string searches" are most likely bolded references of the string sitting on a page somewhere. In many instances, its a scraper or some other bottom-feeding creature.

I've even seen instances, mind you they were very rare, of where someone hard coded a Session ID by mistake. A user of a WYSIWYG program did a cut and paste from a browser session into one of their geocities pages. It happens.

Go to Copyscape and see if any of your content is sitting out there in the "muck".

So you think a high-profile robot like Yahoo or Google will sometimes spider a page without their usual UA?

If you do not find those Session IDs in Google or Live, or any other search engine besides Yahoo!, then that would lead me to believe that there was/is a technical error with your UA implementation and serving Session IDs to Slurp.

This 31 message thread spans 2 pages: 31