Forum Moderators: phranque
The E-Commerce system we use on a site, uses a session ID, and for every website visitor, this session ID is turned on. We do not want it to be turned on (sessions) for any spiders, crawlers, bots,etc, because then the session ID turns up in search engines, and we can have the instance of two people visiting the site, both with the same session ID. It causes huge problems.
So, we have the PHP code modified to look for the "agent" name (slurp, msnbot, googlebot,etc), and if the agent name is recognised as a spider,etc, we don't allow sessions to be used. This mod to the code works perfectly, however some search engines are still revisiting the site and using 'old' referenced URL's with the session ID's in the URL.
We have absolutely no control over what people or spiders send as the 'GET' though, hence the problem. Hoping that mod_rewrite will help this situation, we now have the (mod_rewrite) code to look for the "agent", and if it is either:
msnbot
slurp
googlebot
(there may be others?)
and they try and do this:
[example.com...]
the mod_rewrite will rewrite the url to be:
[example.com...]
Here is the mod_rewrite code:
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^(msnbot¦slurp¦googlebot) [NC]
RewriteCond %{QUERY_STRING} ^(.*)\&?sessionID=[a-zA-Z0-9]+\&?(.*)$
RewriteRule ^(.*) $1?%1%2 [R=301,L]
which I am told will re-write the URL as shown above, and apparently cause a '301' (redirect).
Given the above example, if it does cause a 301 (we are about to start testing), will this stop the page from being indexed, or what will happen? What will spiders/bots like slurp and msnbot do if this happens?
Our objective is to try and get the spiders and bots from adding url's to their search engines which contain the session ID's. We do not want to affect the PR of the site, in our attempt to 'force' the spiders to re-index.
Session ID's cannot be in any links to the site, or in any search engine results. Will the 301 do the trick?
Thanks,
Peter
Actually, it looks as though you could use a bit of mod_rewrite help...
The code you're using won't do exactly what you expect it to do, in that it won't match those spiders, and it will drop the ampersands, even if other parameters are present.
The user-agent pattern should not be start-anchored (remove the "^") or your user-agent condition will fail.
In order to prevent problems with missing or orphaned ampersands in the search engine listings, the cases of leading-parameters only, trailing-parameters only, and both leading and trailing parameters must be handled.
RewriteEngine on
RewriteBase /
#
# Skip the next two rewriterules if NOT a spider
RewriteCond %{HTTP_USER_AGENT} !(msnbot¦slurp¦googlebot) [NC]
RewriteRule .* - [S=2]
#
# case: leading and trailing parameters
RewriteCond %{QUERY_STRING} ^(.+)&sessionID=[0-9a-z]+&(.+)$ [NC]
RewriteRule (.*) $1?%1&%2 [R=301,L]
#
# case: leading or trailing parameters only
RewriteCond %{QUERY_STRING} ^(.+)&sessionID=[0-9a-z]$¦^sessionID=[0-9a-z]+&(.+)$ [NC]
RewriteRule (.*) $1?%1 [R=301,L]
#
As to what the search engines will do, they'll see the 301-Moved Permanently response, re-fetch the page from the new (sessionID-less) URL given in that response, and -- after a while, update their database to use the new URL.
Jim
RewriteEngine on
RewriteBase /
#
# Skip the next two rewriterules if NOT a spider
RewriteCond %{HTTP_USER_AGENT} !(msnbot¦slurp¦googlebot) [NC]
RewriteRule .* - [S=2]
#
# case: leading and trailing parameters
RewriteCond %{QUERY_STRING} ^(.+)&sessionID=[0-9a-z]+&(.+)$ [NC]
RewriteRule (.*) $1?%1&%2 [R=301,L]
#
# case: leading-only, trailing-only or no additional parameters
RewriteCond %{QUERY_STRING} ^(.+)&sessionID=[0-9a-z]$¦^sessionID=[0-9a-z]+&?(.*)$ [NC]
RewriteRule (.*) $1?%1 [R=301,L]
[edited by: jdMorgan at 12:54 am (utc) on Jan. 4, 2005]
Actually, it looks as though you could use a bit of mod_rewrite help...The code you're using won't do exactly what you expect it to do, in that it won't match those spiders, and it will drop the ampersands, even if other parameters are present.
The user-agent pattern should not be start-anchored (remove the "^") or your user-agent condition will fail.
In order to prevent problems with missing or orphaned ampersands in the search engine listings, the cases of leading-parameters only, trailing-parameters only, and both leading and trailing parameters must be handled.
Well, there are quite a few errors there then, and here I am thinking it is ready for testing. I only understand about the user-agent pattern, the string can be anywhere, but I think the code was expecting it to be at the first position.
Thanks for posting the corrected code. :)
As to what the search engines will do, they'll see the 301-Moved Permanently response, re-fetch the page from the new (sessionID-less) URL given in that response, and -- after a while, update their database to use the new URL.
That is perfect, exactly what we want to happen; as you say it won't happen immediately, but the desired objective will be 'sessionID-less' URL's.
As far as testing, there is a log command I think, we have a test site, and if I add my browser agent name, the details in the logs are:
"-" "Mozilla/5.0 (Windows; U; Win95; en-US; rv:1.6) Gecko/20040113" so if I just add mozilla as a temporary agent name, for testing purposes, use some of those URL's with the sessionID in it, and see what happens.
Thanks,
Peter
I forgot the case of no additional parameters!
By looking at the code, I couldn't see any difference, so 'Beyond Compare' to the rescue, and picked up 2 differences.
1.
RewriteCond %{HTTP_USER_AGENT}!(msnbot¦slurp¦googlebot) [NC]
no space before the "!"
2.
RewriteCond %{QUERY_STRING} ^(.+)&sessionID=[0-9a-z]$¦^sessionID=[0-9a-z]+&?(.*)$ [NC] the only changes I can see above are:
+&(.+)$ [NC]
TO .....
+&?(.*)$ [NC]
Now, one final question please; just a minor issue, is it okay/conventional to have the code like this?
RewriteEngine on
RewriteBase /
#
# Check for the following spiders
RewriteCond %{HTTP_USER_AGENT} (msnbot¦slurp¦googlebot) [NC]
# case: leading and trailing parameters
RewriteCond %{QUERY_STRING} ^(.+)&sessionID=[0-9a-z]+&(.+)$ [NC]
RewriteRule (.*) $1?%1&%2 [R=301,L]
#
# Check for the following spiders
RewriteCond %{HTTP_USER_AGENT} (msnbot¦slurp¦googlebot) [NC]
# case: leading-only, trailing-only or no additional parameters
RewriteCond %{QUERY_STRING} ^(.+)&sessionID=[0-9a-z]$¦^sessionID=[0-9a-z]+&?(.*)$ [NC]
RewriteRule (.*) $1?%1 [R=301,L]
It's the same no. of lines of code, just that coming from a few decades of application programming languages background, we had a joke:
We are not to use NOT's!
I guess old habits die hard, and have found in 2GL's, 3GL's and especially 4GL's, the use of a 'OR Not' is sometimes a bit unstable. I don't know Apache _that_ well at all to have confidence in using the NOT, and although there is one line of redundant code in this other way (testing for agents), I guess it just makes me feel 'warm and fuzzy'. :D
Thanks very much for all your help,
Peter
A space is always required between "}" and "!" but posting on this forum eats those spaces. Be sure to put the space back in. I'll go fix it in the post.
Jim
Well, after 4 days of having the mod_rewrite code on the site, decided to have a good look through the web server logs. Looked for strings 'msnbot', 'googlebot', 'slurp' , AND containing 'sessionID'. All 3 spiders had considerable activity during these 4 days.
msnbot - nothing found - good
googlebot - 3 days nothing, 1 day this entry:
66.249.65.226 - - [08/Jan/2005:01:54:05 -0600] "GET /www.example.com/shop/default.php?cPath=8&sessionID=94e6b5d9ddbd53e19616cda29beee477 HTTP/1.1" 200 31798 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
I was expecting a 301?
yahoo(slurp) - entries found on all 4 days, too many to list, here is one enrty from one day.
66.196.90.60 - - [05/Jan/2005:21:25:37 -0600] "GET /www.example.com/shop/product_info.php?products_id=11&sessionID=d16ba91e57f43cdbe19e3fca9d9a5e40 HTTP/1.0" 200 42222 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]
Here is the code in .htaccess
# Set some options
Options -Indexes
Options FollowSymLinksRewriteEngine on
RewriteBase /
#
# Skip the next two rewriterules if NOT a spider
RewriteCond %{HTTP_USER_AGENT}!(msnbot¦slurp¦googlebot) [NC]
RewriteRule .* - [S=2]
#
# case: leading and trailing parameters
RewriteCond %{QUERY_STRING} ^(.+)&sessionID=[0-9a-z]+&(.+)$ [NC]
RewriteRule (.*) $1?%1&%2 [R=301,L]
#
# case: leading-only, trailing-only or no additional parameters
RewriteCond %{QUERY_STRING} ^(.+)&sessionID=[0-9a-z]$¦^sessionID=[0-9a-z]+&?(.*)$ [NC]
RewriteRule (.*) $1?%1 [R=301,L]
This .htaccess file of course is in the 'web root' path. It doesn't matter that the part of the url path is "/shop" surely, as my (limited) understanding of mod_rewrite is that whatever is placed in the web root will apply to all paths.
It would appear the 301 isn't working, as I would have expected to see a 301 in the web server logs, not a '200'?
Strange thing is, when I did testing of it, and added my browser as an agent name, and then used a url with the 'sessionID' in it, the url re-write did work, it was taken out of the browser bar url.
Any clues?
Peter
Yes, missing a plus sign...
# case: leading-only, trailing-only or no additional parameters
RewriteCond %{QUERY_STRING} ^(.+)&sessionID=[0-9a-z][b]+$[/b]¦^sessionID=[0-9a-z]+&?(.*)$ [NC]
RewriteRule (.*) $1?%1 [R=301,L]
Jim
Are you saying that %{HTTP_USER_AGENT} is wrong, and that it should be %{ HTTP_USER_AGENT }? (well, not really " , but space.")
and further, that "!abc" is also wrong, and that it should be "! abc"?
@jehoshua:
You aren't by any chance using zen-cart or osCommerce, are you? Anyway, would you mind explaining how you got the static links?
>>the mod_rewrite will rewrite the url to be:
>>http://www.example.com/shop/product_info.php/products_id/128
I'm trying to do get static urls (in zen-cart), with the rewrite rules mentioned in the following thread:
[webmasterworld.com...]
Yes, missing a plus sign...
Okay, thanks for the correction to the code, I will add that in. One other minor thing, I saw a post in another forum, similar problem and they talked about the need to have this:
session.use_trans_sid
set to off, and I checked the site , and it is on, but this site is an E-Commerce one, it runs osCommerce, I would need to check if it should be off or on, for osCommerce. We rely on sessions when people login and add to carts,etc.
Thanks,
Peter
Are you saying that %{HTTP_USER_AGENT} is wrong, and that it should be %{ HTTP_USER_AGENT }? (well, not really " , but space.")
My understanding of what Jim meant was that it needs to be:
RewriteCond %{HTTP_USER_AGENT} ! (msnbot¦slurp¦googlebot) [NC]
and further, that "!abc" is also wrong, and that it should be "! abc"?
Yes, I think it would be as you say. It would be nice is spaces didn't get eaten and code posted was code displayed. :D
@jehoshua:
You aren't by any chance using zen-cart or osCommerce, are you?
Yes, osCommerce.
Anyway, would you mind explaining how you got the static links?>>the mod_rewrite will rewrite the url to be:
>>http://www.example.com/shop/product_info.php/products_id/128
Do you mean the search engine friendly links? If so, osCommerce just does it, you set a switch in admin that all.
Regards,
Peter
Exactly as in this line:
RewriteCond %{HTTP_USER_AGENT} !(msnbot¦slurp¦googlebot) [NC]
The purpose of this forum function is to stop wasting database space with post replies like:
Me too !!!!!!!!!!!!!
It shortens that to
Me too!
Jim
Just out of curiousity, do the robots do anything with those 301's? I never had good experience with using redirects and bots.
As Jim stated in msg#3 in this thread:
As to what the search engines will do, they'll see the 301-Moved Permanently response, re-fetch the page from the new (sessionID-less) URL given in that response, and -- after a while, update their database to use the new URL.
Another issue. Maintaining a large list of bots seems quite impossible. Are there some general rules that will cover 80%+ of bots?
Something generic like:
* spider
* bot
* crawler
would get a lot, but miss a lot also. I think the best approach is to keep on eye on your web server logs, identify the ones that do kepp using old URL's with the session ID's, add them into .htaccess, and keep monitoring the search engine results; when you see no more session ID's for _that_ search engine, you no longer need _that_ spider in the mod_rewrite.
One person was concerned about the performance issues; will mod_rewrite cause some pwebsite performance degradation. I don't know.
Peter
Just out of curiousity, do the robots do anything with those 301's?
I'd like to comment on redirecting status codes;
There are currently 7 different redirecting status codes in use (300-305 + 307), and afaik 301 and 302 are the most commonly used. 301 [w3.org] (permanent) is used when you want people to use the new location, and 302 [w3.org] (found) is used when you want people to continue using the url they already have.
To learn more about status codes, read W3C's status code definitions [w3.org] for the full list.
I'd like to comment on redirecting status codes;
There are currently 7 different redirecting status codes in use (300-305 + 307), and afaik 301 and 302 are the most commonly used. 301 (permanent) is used when you want people to use the new location, and 302 (found) is used when you want people to continue using the url they already have.To learn more about status codes, read W3C's status code definitions for the full list.
Thanks. I'm familiar with those. The reason am asking is because i have some 301 for well over two years now and still bots and search-engine's are referring to those page's. From recollection, I think it took AskJeeves/Teoma well over a year, Yahoo is still using the old url (though it does go to the redirected page) and many other smaller engine's do either not recognize 301's or follow them but do not update their database :-(
My suggestion is that you find those sites that link to your old url, and ask them to update their links.
Nevertheless, a 301 redirect is usually the correct thing to do. If the search engine spiders do their part, everything works nicely. If either your site or the search spider does not follow the HTTP protocol, then there is no chance it will work. So, all we can do is to implement redirects in compliance with HTTP/1.1, and hope the spiders handle them properly. Or hope that those which don't will eventually get fixed.
Jim