Forum Moderators: phranque

Message Too Old, No Replies

Using mod_rewrite to remove session IDs

         

seasalt

6:06 am on Apr 8, 2004 (gmt 0)

10+ Year Member



Hi:

In order to be crawled better, I want to remove an automatic session ID and an additional tracking parameter - only when a bot visits my site.

(ie. www.example.com/?sessionID=aa162314bDRa53872123&crea=52)

Does anyone know of any good examples or other resources of mod_rewrite being used for this specific situation. I need something that is pretty clear and straightforward.

seasalt

jdMorgan

7:29 am on Apr 8, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



mod_rewrite can modify a URL once it has been requested and before content is served. It cannot modify a URL in a page being served to a client browser or spider.

So, you'll need to modify your script to suppress session ID's when the requesting user-agent is a known search engine robot.

An approach where mod_rewrite is useful is to put query string parameters into URL form. In other words, output URLs to all clients that look like plain static URLs, such as:

http://www.example.com/widgets/blue/parm23/userid10001

and then use mod_rewrite to convert that to a form that your script uses when the plain URL is requested:

http://www.example.com/index.php?product=widgets&color=blue&parm=23&id=10001

...Just a dumb example to illustrate the point. You will still need to suppress the session-related variables when a robot request is detected, even with this method. Otherwise, you'll get duplicate pages listed for each crawl, and that can cause big problems.

Jim

seasalt

3:28 pm on Apr 8, 2004 (gmt 0)

10+ Year Member



jdMorgan:

you'll need to modify your script to suppress session ID's

Do you know of any examples of this in mod_perl you could point me to?

Thanks

jdMorgan

3:55 pm on Apr 8, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Here's a code fragment from a routine that skips counting requests for known 'bots. Not exactly what you're doing, but it illustrates the concept:

$remaddr = $ENV{'REMOTE_ADDR'};
# ... (code snipped)
# Bypass counter for requests from Google
unless ($remaddr =~ /^216\.239\.45\./)
{
open(COUNTER,">$ctrpath") ¦¦ die $!;
print COUNTER ($count);
close(COUNTER);
}

Jim

seasalt

12:48 am on Apr 9, 2004 (gmt 0)

10+ Year Member



Jim:

I failed to mention that the session ID is not generated on the initial page requested (whether index or interior page); but is generated in subsequent links from that requested page.

Example:

page requested:
www.example.com/dir1/ (and will appear as such to bots)

links on requested page appear as:
www.example.com/?sessionID=aa162314bDRa53872123&crea=52
www.example.com/dir2/?sessionID=aa162314bDRa53872123&crea=52
and so on....

Would mod_rewrite work in this instance? If so, any examples for a situation like that?

Thanks.

seasalt

jdMorgan

8:08 am on Apr 9, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It doesn't matter where the session ID links are generated -- you must suppress them wherever they are generated if the requesting visitor is a known search engine robot.

Unfortunately, mod_rewrite is of no help whatsover in doing this. It's only good when you want to modify the URL that a browser is asking for, and point the request somewhere else or change its form, for example, from a static-appearing link to a dynamic link to be passed to your script. Mod_rewrite works on the "input" or "request" end of the transaction, not on the "output" or "response".

Jim