Forum Moderators: phranque

Message Too Old, No Replies

PHPSESSID - hiding from Google via HTACCESS

301'ing to the same URL without certain parameters

         

badbadmonkey

11:16 am on Oct 9, 2007 (gmt 0)

10+ Year Member



This is a double post sorry, realized this would be the better category.

I got the problem with Google where links with the PHPSESSID parameter are indexed, and need to solve that. Additionally, my site uses "regions" which are looked after by the session, but initialized or changed via the URL at some point. The point is I don't want Google indexing either of these parameters, and maybe others in the future that I add.

So, what I want to do:
- Detect SE user agent, then
- 301 to the same URL requested but with ...
1) &PHPSESSID=asdf
2) &region=hjkl
... stripped from it.

Now I found some code in another thread that goes like this:

RewriteCond %{HTTP_USER_AGENT} "Google" [NC,OR] 
RewriteCond %{HTTP_USER_AGENT} "Slurp" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "MSNBOT" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "teoma" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "ia_archiver" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "Scooter" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "Mercator" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "FAST" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "MantraAgent" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "Lycos" [NC,OR]
RewriteCond %{HTTP_USER_AGENT} "ZyBorg" [NC]
RewriteCond %{QUERY_STRING} PHPSESSID
RewriteRule ^(.*)$ $1? [L,R=301]

Before I start mucking about, is this right?
Also, how do I also remove the "region" parameter - can I just add another line under the one that mentions PHPSESSID?

I don't know much about HTACCESS, so: help!

jdMorgan

12:32 pm on Oct 9, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Just to be clear, have you already taken steps to prevent the search engines from ever seeing URLs with session IDs and 'regions' parameters attached to them? If not, then that is step #1. Having done that, you can treat the redirection you're contemplating here as a "clean-up and future insurance" step.

The exact answer to your question depends on whether the session ID and region parameters are the only ones attached to these URLs; If you have other query string parameters that must be passed through intact, then that changes the requirements dramatically.

Jim

badbadmonkey

12:44 pm on Oct 9, 2007 (gmt 0)

10+ Year Member



Steps you mean like using robots.txt? No, because I don't want the SEs to ignore the pages - I want them just redirected to the correct URL. If I banned anything with PHPSESSID in it, then wouldn't that hurt the effect of any inbound link which happens to contain it?

My plan is to implement this HTACCESS code then wait for Google to re-index the pages, maybe I'll have to add the specific URLs already indexed to my sitemaps.xml.

As to the URLs, yes there are other parameters. A typical URL could look like this:

/main.php?section=sec&chapter=chap&page=1&region=uk

and I want Google to be 301'ed to

/main.php?section=sec&chapter=chap&page=1

etc.

jdMorgan

1:36 pm on Oct 9, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



OK, well you need to modify your script so that when a search engine fetches pages, the script will not assign a session ID or add a region parameter. Otherwise, your search results will be less than optimal, because the search engines will constantly receive a 301 on requests for the session/region URLs. You will be relying on them to process the 301 correctly and in a timely manner -- introducing a dependency on your site for the search engines to get it right.

I recommend that you view changing the script as the 'cure' and view the redirect as a 'fast search engine results cleanup' only. In fact, if the script is modified correctly, then the redirects are not even needed, except to speed up the process of purging the session/region URLs. Alone, the 301 approach is only a partial (and poor) fix.

Further, once the search engines no longer receive session/region parameters, the only reason you'd need to keep the redirects in place after the search engines remove the session/region URLs from their results would be if other sites continue to link to those session/region URLs.

Jim

badbadmonkey

1:47 pm on Oct 9, 2007 (gmt 0)

10+ Year Member



They do - people on forums etc.

Without turning off trans_sid, how do I do this then? The only solution I can think of is to use PHP to, ummn, 301 itself when it gets a SE. I mean the PHPSESSID stuff is handled by PHP, not my code, so I don't see how to do what you're suggesting.

Is there a way to use

php_value session.use_only_cookies 1 
php_value session.use_trans_sid 0

in HTACCESS but only for SEs?

Any help on the HTACCESS code above which is still going to be necessary?

badbadmonkey

1:51 pm on Oct 9, 2007 (gmt 0)

10+ Year Member



I guess it could simply not initiate a session for a SE.
But what about incoming links with the 'region' parameter set? If it displays a page without a 301 then isn't Google going to index the resulting page under that URL?

jdMorgan

2:34 pm on Oct 9, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You might want to ask about the sessionid thing in the php forum [webmasterworld.com] -- or try a site search, since that subject has already been discussed quite a bit.

The solution to stripping the query string parameters using mod_rewrite varies from almost easy to extremely difficult, depending on the following:
1) Are both PHPSESSID= and region= parameters always present in the URLs to be redirected?
2) If present, are they always in the same order?
3) If present and in the same order, are they always contiguous?
4) If not contiguous, how many other parameters might be present between them?

Coding a one-size-fits-all mod_rewrite solution may or may not even be possible, depending on the answers to the above questions, since mod_rewrite is not nearly as flexible as a general scripting language.

Jim

badbadmonkey

5:04 am on Oct 11, 2007 (gmt 0)

10+ Year Member



I'm thinking now that I need to:

1)
Change PHP site-wide to detect SEs then skip initiating sessions.
This will minimize 301s.

2)
Change PHP to detect SEs and modify internal links to not pass parameters.
This will further minimize 301s.

3)
Use HTACCESS to 301 from the requested URL to the same page less the parameter(s) for SEs.
Still necessary to "fix" currently indexed results and off-site inbound links containing the parameters.

I can do 1 and 2 easily, it's 3 that sees me here.

To answer the above questions concerning the parameters:

They're not always present, no, and could be either one or the other.
- If either one is present it is going to be at the very end of the URL
- and if both are present the PHPSESSID will be after 'region'. However there happens to be no case of this in Google at the present time.

So for whatever reason let's worry just about the first instance, where either PHPSESSID or 'region' is at the end of the URL. Should make it simple, eh?

The alternative is to header redirect (301) SEs from within the script, which again I can do without help, but that can't be done for PDFs etc.

jdMorgan

12:15 pm on Oct 11, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Sounds like you've figured out the PHP requirements; Steps 1 and 2 sound correct to me (although I defer to the PHP forum group).

Having the extra stuff at the end simplifies the problem considerably -- The code to handle "either order, anywhere" is really big, slow, and ugly. With region and SESSID (if present) always at the end, and always in that order, something like this should work:


RewriteCond %{HTTP_USER_AGENT} Googlebot¦Slurp¦MSNBOT¦teoma¦ia_archiver¦Mercator¦FAST¦MantraAgent¦Lycos [NC]
RewriteCond %{QUERY_STRING} ^((([^&]+&)*([^&]+)+)&)?region=[^&]+&PHPSESSID= [OR]
RewriteCond %{QUERY_STRING} ^((([^&]+&)*([^&]+)+)&)?region= [OR]
RewriteCond %{QUERY_STRING} ^((([^&]+&)*([^&]+)+)&)?PHPSESSID=
RewriteRule (.*) http://www.example.com/$1?%2 [R=301,L]

Note that I removed Altavista's Scooter and WiseNut's Zyborg user-agents, because they are no longer active; AltaVista is now a Yahoo property using Slurp, and WiseNut was taken down [webmasterworld.com] last month. Also, be aware that FAST is now "corporate search-only" and as far as I know, they sold the public search to Overture, which was in turn purchased by Yahoo, so this also uses Slurp.

The first querystring RewriteCond with both parameters is not strictly necessary. You can take it out and experiment with the order of the other two querystring RewriteConds to achieve correct operation if you like. Just be sure that the last RewriteCond in the rule does not have an [OR] flag on it, while all but the first RewriteCond do have one.

Jim

badbadmonkey

12:02 pm on Oct 28, 2007 (gmt 0)

10+ Year Member



Hmmmn I copied that code into HTACCESS but, using User Agent Switcher in Firefox to emulate Googlebot, it doesn't seem to work. Visiting any of the URLs indexed in Google just takes you right through to the site with the PHPSESSID still in the URL. Same for &region.

It should work okay with other parameters in the URL (okay to be indexed) before either of those right?

Is it okay to post the Google search which shows the actual indexed URLs rather than working on hypothetical examples?

jdMorgan

1:48 pm on Oct 28, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Be sure to:

1) Replace all broken pipe "¦" characters with solid pipes before use; Posting on this forum modifies the pipe characters.

2) Completely flush your browser cache before testing any change to code in .htaccess, httpd.conf, conf.d, etc.

Jim

badbadmonkey

2:21 pm on Oct 28, 2007 (gmt 0)

10+ Year Member



Duh, I thought the pipes looked strange!
Hahah.

Works fine. Thanks mate, truly appreciate it.