homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

mod_rewrite for removing session ids
I think this should work...

 8:08 pm on Dec 13, 2002 (gmt 0)

OK, I've finally decided to print out the Apache mod_rewrite docs and do something about my employer's shopping cart URLs. Now, I decided not to bother doing an across the board switch from the current variable-loaded URLs to pseudo-directory style links, because most major SE robots (afaik) will follow cgi/variable links these days... it's the session ids that really screw them up.

I read all through the forums here, and found plenty of posts about the full pseudo-directory URL rewrites (ie: mapping url.com/script/1/2/3/4/5/6 to url.com/script.cgi?1=2&3=4&5=6), but nothing about how to just get rid of the session id when a robot comes knocking... after a bit of head scratching, here is what I came up with...

My URLs look like so:

rewriteEngine on
rewriteBase /shop
rewriteCond %{HTTP_USER_AGENT} Googlebot.*
rewriteRule ^script\.cgi\?user_id=id&(.*)$ script\.cgi\?$1

If I am correct, that should just remove the user_id=id& out of the middle of the URL when Googlebot tries to follow the link, am I right? Then I can just add a new rewriteCond for each UA for whom I want the user_id variable removed.

Someone please let me know if there's a problem there (or tell me how to trick the server into thinking I'm Googlebot, so I can test it myself)... ;)

If I went the other way (modifying my link to remove the user_id variable, and then using mod_rewrite to reinsert it for everyone but the SE spiders), mod_rewrite would have to alter links for the majority of visitors (instead of only modifying them for the spiders), and have to parse the HTTP_REFERER to retrieve the session id to re-insert it for regular visitors, which seems like it would be a much larger drain on the server (and those who had referers turned off in their browser wouldn't be able to use the store).

I realize leaving all the variables in their ugly cgi form may not be ideal for spiders, but from what I've read, just getting rid of the session ids should at least allow those links to be crawled and indexed... Thoughts?



 9:00 pm on Dec 13, 2002 (gmt 0)

Lots of questions. I'll take the easy one and let someone else do the real work! :)

tell me how to trick the server into thinking I'm Googlebot
Here's [webmasterworld.com] a thread offering several solutions.


 9:13 pm on Dec 13, 2002 (gmt 0)

:) Unfortunately, being a Mac user, none of those suggestions apply... but I'll do some digging on that one.

I suppose I could edit the rewriteCond to read Opera, and go visit the site myself... then if it worked, I could switch it to Googlebot, and if it didn't I could delete it and start over. Really, my biggest question was whether the syntax looked OK. Hoping a mod_rewrite expert could give it a gander before I uploaded it (since I just started reading the Apache docs yesterday, and haven't really built up much confidence in my grasp of the material...).

Beyond that, I was just looking to start a discussion... with recent developments, it seems it's becoming more important than ever to ensure crawlability for online catalogs. ;)


 9:24 pm on Dec 13, 2002 (gmt 0)

I'm confident that one of our regular gurus won't be able to pass this up. And it is a timely discussion. Patience! :)

edit the rewriteCond to read Opera

For test you could do something like:
RewriteCond %{HTTP_USER_AGENT} Opera [AND]
RewriteCond %{HTTP_HOST} ^yourispdomainname
RewriteRule ^script\.cgi\?user_id=id&(.*)$ script\.cgi\?$1

[edited by: DaveAtIFG at 11:39 pm (utc) on Dec. 13, 2002]


 10:57 pm on Dec 13, 2002 (gmt 0)


The syntax looks OK, but won't work in .htaccess. It should work in httpd.conf, though.

You should add an [L] flag to the end of the RewriteRule, unless you know you need to continue with more rewrites on that URL.

By the time Apache gets to .htaccess, the query string is stripped, and is available for testing or backreference creation only to RewriteCond, or for direct-substitution into the target URL as %{QUERY_STRING}.

If you are using mod_rewrite in .htaccess, LMK if you need more details.



 11:01 pm on Dec 13, 2002 (gmt 0)


To make your browser look like Googlebot, try WannaBrowser. It's web-based and platform-independent. Just copy the Googlebot (or other) UA string from your log file.



 11:08 pm on Dec 13, 2002 (gmt 0)

The syntax looks OK, but won't work in .htaccess.

Oh dear... what to do? I don't have access to httpd.conf. :(


 11:38 pm on Dec 13, 2002 (gmt 0)


Try something like this:

Options +FollowSymlinks
RewriteEngine on
RewriteBase /shop
RewriteCond %{HTTP_USER_AGENT} ^Googlebot
RewriteCond %{QUERY_STRING} user\_id=[^&]*&(.*)$
RewriteRule ^script\.cgi script.cgi?%1 [L]

I haven't tested it, though.



 12:45 am on Dec 14, 2002 (gmt 0)

Hmm... I changed Googlebot to Opera, uploaded it, and visited with Opera, and I was still given a user_id number. Doesn't seem to do anything that way. Thanks though! Gives me another angle to approach it from.

Back to the books for me then!


 1:23 am on Dec 14, 2002 (gmt 0)

Iīm not sure whether the thing you are trying to achieve is possible using mod_rewrite, mivox.

Letīs walk through a cycle of GoogleBot requesting a document.

- GoogleBot requests the URI domain.com/shop.html which is the start page of your shop.

- Your server serves this document which will contain URIs like this: domain.com/cgi-bin/shop/script.cgi?user_id=id&var_infinitum=val_infinitum.

- Google parses the page and finds this link. After indexing the current document the URI domain.com/cgi-bin/shop/script.cgi?user_id=id&var_infinitum=val_infinitum is requested.

- Now your RewiteRules kick in and you do an internal rewrite to the same URI sans the session id. This is totally transparent to GoogleBot. The page that GoogleBot receives will still be referred to by the URI of domain.com/cgi-bin/shop/script.cgi?user_id=id&var_infinitum=val_infinitum. An internal rewrite is transparent to the requesting UA. You could do an external rewrite sending a Moved Permanently status code. Then GoogleBot would request the page again using the session-id-less URI. But I am not sure whether GoogleBot likes getting the same old URIs on each dance and being told that they are old and to use the new ones.

Again, mod_rewrite is generally a one-way-street. Using some method you fake URIs and use mod_rewrite to turn them into the real ones internally. mod_rewrite does not parse your documents before they leave the server. It does not rewrite the URIs contained in those documents.



 1:24 am on Dec 14, 2002 (gmt 0)


What may be confusing is the "direction" that the rewrite goes...

mod_rewrite takes the URL your browser requests and modifies it for use inside the server. It does not affect the URLs output back to your browser by your scripts and html pages.

So, this probably is not what you wanted to do.

There are several posts from the last few days addressing this.



 1:39 am on Dec 14, 2002 (gmt 0)

When Googlebot requests domain.com/cgi-bin/shop/script.cgi?id=id&va=val I would like the server to parse that request, and rewrite it to exclue the session id, leaving domain.com/cgi-bin/shop/script.cgi?var=val so the cgi returns a page without assigning a session id.

I don't expect that the actual html code being served would be changed at all... that would be silly. mod_rewrite is supposed to rewrite URLs, not rewrite html. ;)

However, I do not see why it should be able to take a link reading domain.com/script/var/val/var2/val2 and turn it into domain.com/script?var=val&var2=val2, which appears to be a simple character substitution, but not take a link reading domain.com/script?id=id&var=val and turn it into domain.com/script?var=val by being told to substitute id=id for nothing.


 1:48 am on Dec 14, 2002 (gmt 0)

The page that GoogleBot receives will still be referred to by the URI of domain.com/cgi-bin/shop/script.cgi?user_id=id&var_infinitum=val_infinitum

That is fine... The store links on my html templates read user_id=id. After you've followed one of those links, the script assigns you an ID number, and the links generated thereafter contain user_id=12345 (a random 5 digit number). If mod_rewrite could remove the user_id=id from the internal request, the script would not assign the session id, and would generate the suceeding links as the generic user_id=id.

So, if crawl10.googlebot comes along and requests a few store pages, they will appear as user_id=id&page=page.html. Then crawl2.googlebot comes along, requests the same pages, and gets user_id=id&page=page.html. The URLs are the same.

Currently, crawl10 will get something like user_id=54208&page=page.html, and crawl2 will get user_id=60234&page=page.html, giving the spider-scaring illusion that there are an infinite number of store pages to crawl, and supposedly causing them to throw those pages out (these pages get crawled every month, and never appear in the index).

It seems if they got the generic user_id=id everytime, it would be apparent when they'd requested the same page, and would remove the infinite pages problem.

<added>In short, I am not trying to make user_id=id disappear from the robots' perspective. I am trying to stop user_id=id from being "delivered" to the store script when a robot follows the link, so the links are not changed to user_id=12345 on the next page generated/retrieved.</added>

[edited by: mivox at 1:54 am (utc) on Dec. 14, 2002]


 1:54 am on Dec 14, 2002 (gmt 0)

I wrote that before reading your post mivox.

However, I do not see why it should be able to take a link reading domain.com/script/var/val/var2/val2 and turn it into domain.com/script?var=val&var2=val2, which appears to be a simple character substitution, but not take a link reading domain.com/script?id=id&var=val and turn it into domain.com/script?var=val by being told to substitute id=id for nothing.

There is no question that mod_rewrite can do just that. But this might not be what you want. The fake URI is the one that Google will assign to the page. Since this URI does not exist on the server it is rewritten internally to the right URI. In your situation the URI containing the session id is the one Google will assign to your page. Internally the URI with session id is rewritten to a URI sans session id. If this is the only thing you really want, although Iīm not sure how that would help, then using mod_rewrite will be ok.

It would be helpful if you could let us know why you think that the problems I mentioned above will not apply to your particular situation.


[edited by: andreasfriedrich at 1:56 am (utc) on Dec. 14, 2002]


 1:55 am on Dec 14, 2002 (gmt 0)

See my post that I just edited, above... I tried to make my goal a little clearer. :) (We must've been writing at the same time!)

<ROFL... and then we edited our posts at the same time...>

So... how do I do it? I've been tinkering with the {QUERY_STRING} idea for a while now, and I can't get it to stop assigning me a user_id number... grr.

[edited by: mivox at 1:58 am (utc) on Dec. 14, 2002]


 1:56 am on Dec 14, 2002 (gmt 0)

Yes, just edited my post ;)


 2:00 am on Dec 14, 2002 (gmt 0)

OK, mod_rewrite is the way to go. Now all you need to do is to get the rewriting to work. ;)

Sorry that I insisted on telling you what mod_rewrite does and what it doesnīt do. I wasnīt sure whether you expected things that it just wasnīt designed for. A lot of people do. You didnīt.

Iīll have a look at the rules now. ;)



 2:01 am on Dec 14, 2002 (gmt 0)

This reminds me of the constitutional law class I took years ago. By the time you finished making an argument complex and circular enough to cover all contingencies and conditions under consideration, it was really hard to follow what you were exactly trying to say in the first place. hehe

But I got a good grade in that class, so I ought to be able to figure this out. I'm just finding a major lack of documentation on .htaccess URL manipulation, and I'm trying to learn regex at the same time as I'm figuring out mod_rewrite, so I'm not quite sure which end is up at the moment... ;)


 2:08 am on Dec 14, 2002 (gmt 0)

Here's what I've got now... but it doesn't work. ;) Just to update where I'm coming from:

Options +FollowSymlinks
RewriteEngine on
RewriteBase /shop
RewriteCond %{HTTP_USER_AGENT} ^Opera.*
RewriteCond %{QUERY_STRING} user_id=[0-9a-z]{2,5}&(.*)$
RewriteRule ^script\.cgi script.cgi?%1 [L]


 2:15 am on Dec 14, 2002 (gmt 0)

Constitutional law is great, isnīt it ;)

major lack of documentation on .htaccess URL manipulation

There isnīt that big a difference between using RewriteRules in httpd.conf and .htaccess files as far as the systax is concerned. All you need to remember is that the directory prefix is removed prior to matching and added again later on. Performance wise there is a BIG difference between the two.

Since user_id=id is static as you write, you could use a rule like this:

RewriteCond %{QUERY_STRING} user_id=id&(.*)$

When toying with mod_rewrite do so in a structured manner. Using a simple rewrite rule test whether rewriting works at all in the particular directory. If it works add more conditions piece by piece. But you probably do this anyway since it applies to almost all work one does.



 2:18 am on Dec 14, 2002 (gmt 0)


Now that Andreas is here, you'll have all the help you need. <stage whisper>I'm glad he hasn't disappeared completely into law school...</stage whisper>

In which subdirectory is the .htaccess containing your RewriteRule? - I'd like to make sure RewriteBase is needed and correct.

If you liked constitutional law, you'll love regex and mod_rewrite - It's a lot of logic and precise language, and the details have to be absolutely correct. :)


<added after AF's last post>... And remember that you can test using a "dummy" URL on the left side of the RewriteRule, so as to avoid breaking your site for real visitors.</added>


 2:21 am on Dec 14, 2002 (gmt 0)

OK... I have to catch my ride home now, so I'll dig through the last two posts as soon as I get home. (My brain could use a rest anyway. Trying to learn this in two days is a slight strain. ;) )

Thanks for everyone's help!


 2:44 am on Dec 14, 2002 (gmt 0)

Thanks Jim.

I tested those rules on my test server and they work great.

Options +FollowSymlinks
RewriteEngine on
RewriteBase /shop
<--- make sure this is ok.
RewriteCond %{HTTP_USER_AGENT} ^Opera.*
RewriteCond %{QUERY_STRING} user_id=[0-9a-z]{2,5}&(.*)$
RewriteRule ^script\.cgi script.cgi?%1 [L]

As Jim wrote all you need to do is make sure that RewriteBase is correct and that the htaccess file is in the right directory: The one the URI domain.com/cgi-bin/shop resolves to.



 2:51 am on Dec 14, 2002 (gmt 0)

Iīm off to bed now since it is 3:50 am already.


 8:22 am on Dec 14, 2002 (gmt 0)

:) You guys are great! I will post file path/RewriteBase specifics on Monday when I'm back in the office, as that is apparently where the rules are going wrong...

Hopefully, at least one other person with obnoxious session id URLs and dynamic shopping cart pages will be able to use it. ;)


 7:07 pm on Dec 16, 2002 (gmt 0)

OK. I wanted to keep it in my main .htaccess file, so I had to change it like so:

RewriteBase /

RewriteRule ^cgi-bin/shop/script\.cgi cgi-bin/shop/script.cgi?%1 [L]

Works like a charm now. Now I just need to add one more script name to the rewrite rules, and I'll be all ready to change the target UA to block SE spiders from getting user ids.


 8:10 pm on Dec 16, 2002 (gmt 0)


Sounds good - Please let us know how this works out for you over the long term - i.e., whether it solves your practical problem of spiders not liking userIDs.



 8:55 pm on Dec 16, 2002 (gmt 0)

I'll certainly keep an eye out... but I think knowing whether or not it helps with the SEs will be a wait-and-see-and-guess proposition for a while.

Global Options:
 top home search open messages active posts  

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved