homepage Welcome to WebmasterWorld Guest from 54.226.230.76
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
Will a redirect fix this spider problem?
Getting rid of session ID's
jehoshua




msg:1521342
 11:07 am on Jan 3, 2005 (gmt 0)

Hi,

The E-Commerce system we use on a site, uses a session ID, and for every website visitor, this session ID is turned on. We do not want it to be turned on (sessions) for any spiders, crawlers, bots,etc, because then the session ID turns up in search engines, and we can have the instance of two people visiting the site, both with the same session ID. It causes huge problems.

So, we have the PHP code modified to look for the "agent" name (slurp, msnbot, googlebot,etc), and if the agent name is recognised as a spider,etc, we don't allow sessions to be used. This mod to the code works perfectly, however some search engines are still revisiting the site and using 'old' referenced URL's with the session ID's in the URL.

We have absolutely no control over what people or spiders send as the 'GET' though, hence the problem. Hoping that mod_rewrite will help this situation, we now have the (mod_rewrite) code to look for the "agent", and if it is either:

msnbot
slurp
googlebot
(there may be others?)

and they try and do this:

[example.com...]

the mod_rewrite will rewrite the url to be:

[example.com...]

Here is the mod_rewrite code:


RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^(msnbot¦slurp¦googlebot) [NC]
RewriteCond %{QUERY_STRING} ^(.*)\&?sessionID=[a-zA-Z0-9]+\&?(.*)$
RewriteRule ^(.*) $1?%1%2 [R=301,L]

which I am told will re-write the URL as shown above, and apparently cause a '301' (redirect).

Given the above example, if it does cause a 301 (we are about to start testing), will this stop the page from being indexed, or what will happen? What will spiders/bots like slurp and msnbot do if this happens?

Our objective is to try and get the spiders and bots from adding url's to their search engines which contain the session ID's. We do not want to affect the PR of the site, in our attempt to 'force' the spiders to re-index.

Session ID's cannot be in any links to the site, or in any search engine results. Will the 301 do the trick?

Thanks,

Peter

 

jehoshua




msg:1521343
 12:06 pm on Jan 3, 2005 (gmt 0)

Hi,

I don't need Apache or mod_rewite help, but search engine help please.

Peter

jdMorgan




msg:1521344
 4:01 pm on Jan 3, 2005 (gmt 0)

Peter,

Actually, it looks as though you could use a bit of mod_rewrite help...

The code you're using won't do exactly what you expect it to do, in that it won't match those spiders, and it will drop the ampersands, even if other parameters are present.

The user-agent pattern should not be start-anchored (remove the "^") or your user-agent condition will fail.

In order to prevent problems with missing or orphaned ampersands in the search engine listings, the cases of leading-parameters only, trailing-parameters only, and both leading and trailing parameters must be handled.

RewriteEngine on
RewriteBase /
#
# Skip the next two rewriterules if NOT a spider
RewriteCond %{HTTP_USER_AGENT} !(msnbot¦slurp¦googlebot) [NC]
RewriteRule .* - [S=2]
#
# case: leading and trailing parameters
RewriteCond %{QUERY_STRING} ^(.+)&sessionID=[0-9a-z]+&(.+)$ [NC]
RewriteRule (.*) $1?%1&%2 [R=301,L]
#
# case: leading or trailing parameters only
RewriteCond %{QUERY_STRING} ^(.+)&sessionID=[0-9a-z]$¦^sessionID=[0-9a-z]+&(.+)$ [NC]
RewriteRule (.*) $1?%1 [R=301,L]
#

Note that you'll have to change any broken pipe "¦" characters above to solid pipes before use.

As to what the search engines will do, they'll see the 301-Moved Permanently response, re-fetch the page from the new (sessionID-less) URL given in that response, and -- after a while, update their database to use the new URL.

Jim

jdMorgan




msg:1521345
 4:34 pm on Jan 3, 2005 (gmt 0)

Doh! - I forgot the case of no additional parameters!

RewriteEngine on
RewriteBase /
#
# Skip the next two rewriterules if NOT a spider
RewriteCond %{HTTP_USER_AGENT} !(msnbot¦slurp¦googlebot) [NC]
RewriteRule .* - [S=2]
#
# case: leading and trailing parameters
RewriteCond %{QUERY_STRING} ^(.+)&sessionID=[0-9a-z]+&(.+)$ [NC]
RewriteRule (.*) $1?%1&%2 [R=301,L]
#
# case: leading-only, trailing-only or no additional parameters
RewriteCond %{QUERY_STRING} ^(.+)&sessionID=[0-9a-z]$¦^sessionID=[0-9a-z]+&?(.*)$ [NC]
RewriteRule (.*) $1?%1 [R=301,L]

Jim

[edited by: jdMorgan at 12:54 am (utc) on Jan. 4, 2005]

jehoshua




msg:1521346
 10:23 pm on Jan 3, 2005 (gmt 0)

Hi Jim,

Actually, it looks as though you could use a bit of mod_rewrite help...

The code you're using won't do exactly what you expect it to do, in that it won't match those spiders, and it will drop the ampersands, even if other parameters are present.

The user-agent pattern should not be start-anchored (remove the "^") or your user-agent condition will fail.

In order to prevent problems with missing or orphaned ampersands in the search engine listings, the cases of leading-parameters only, trailing-parameters only, and both leading and trailing parameters must be handled.

Well, there are quite a few errors there then, and here I am thinking it is ready for testing. I only understand about the user-agent pattern, the string can be anywhere, but I think the code was expecting it to be at the first position.

Thanks for posting the corrected code. :)

As to what the search engines will do, they'll see the 301-Moved Permanently response, re-fetch the page from the new (sessionID-less) URL given in that response, and -- after a while, update their database to use the new URL.

That is perfect, exactly what we want to happen; as you say it won't happen immediately, but the desired objective will be 'sessionID-less' URL's.

As far as testing, there is a log command I think, we have a test site, and if I add my browser agent name, the details in the logs are:


"-" "Mozilla/5.0 (Windows; U; Win95; en-US; rv:1.6) Gecko/20040113"

so if I just add mozilla as a temporary agent name, for testing purposes, use some of those URL's with the sessionID in it, and see what happens.

Thanks,

Peter

jehoshua




msg:1521347
 10:47 pm on Jan 3, 2005 (gmt 0)

Hi Jim,

I forgot the case of no additional parameters!

By looking at the code, I couldn't see any difference, so 'Beyond Compare' to the rescue, and picked up 2 differences.

1.

RewriteCond %{HTTP_USER_AGENT}!(msnbot¦slurp¦googlebot) [NC]

no space before the "!"

2.

RewriteCond %{QUERY_STRING} ^(.+)&sessionID=[0-9a-z]$¦^sessionID=[0-9a-z]+&?(.*)$ [NC]

the only changes I can see above are:


+&(.+)$ [NC]

TO .....


+&?(.*)$ [NC]

Now, one final question please; just a minor issue, is it okay/conventional to have the code like this?


RewriteEngine on
RewriteBase /
#
# Check for the following spiders
RewriteCond %{HTTP_USER_AGENT} (msnbot¦slurp¦googlebot) [NC]
# case: leading and trailing parameters
RewriteCond %{QUERY_STRING} ^(.+)&sessionID=[0-9a-z]+&(.+)$ [NC]
RewriteRule (.*) $1?%1&%2 [R=301,L]
#
# Check for the following spiders
RewriteCond %{HTTP_USER_AGENT} (msnbot¦slurp¦googlebot) [NC]
# case: leading-only, trailing-only or no additional parameters
RewriteCond %{QUERY_STRING} ^(.+)&sessionID=[0-9a-z]$¦^sessionID=[0-9a-z]+&?(.*)$ [NC]
RewriteRule (.*) $1?%1 [R=301,L]

It's the same no. of lines of code, just that coming from a few decades of application programming languages background, we had a joke:


We are not to use NOT's!

I guess old habits die hard, and have found in 2GL's, 3GL's and especially 4GL's, the use of a 'OR Not' is sometimes a bit unstable. I don't know Apache _that_ well at all to have confidence in using the NOT, and although there is one line of redundant code in this other way (testing for agents), I guess it just makes me feel 'warm and fuzzy'. :D

Thanks very much for all your help,

Peter

jdMorgan




msg:1521348
 12:53 am on Jan 4, 2005 (gmt 0)

I used NOT for another very good programming-related reason: Ease of maintenance. Using NOT allows you to have *one* list of user-agent names to maintain, instead of two. For this reason, I recommend using that negative construct.

A space is always required between "}" and "!" but posting on this forum eats those spaces. Be sure to put the space back in. I'll go fix it in the post.

Jim

jehoshua




msg:1521349
 1:50 am on Jan 4, 2005 (gmt 0)

Hi Jim,

Okay, if you say the one liner to do the multiple 'NOT's does work, then I'll use it. I agree about ease of maintenance,etc.

Thanks,

Peter

jehoshua




msg:1521350
 12:53 am on Jan 10, 2005 (gmt 0)

Hi,

Well, after 4 days of having the mod_rewrite code on the site, decided to have a good look through the web server logs. Looked for strings 'msnbot', 'googlebot', 'slurp' , AND containing 'sessionID'. All 3 spiders had considerable activity during these 4 days.

msnbot - nothing found - good
googlebot - 3 days nothing, 1 day this entry:


66.249.65.226 - - [08/Jan/2005:01:54:05 -0600] "GET /www.example.com/shop/default.php?cPath=8&sessionID=94e6b5d9ddbd53e19616cda29beee477 HTTP/1.1" 200 31798 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

I was expecting a 301?

yahoo(slurp) - entries found on all 4 days, too many to list, here is one enrty from one day.


66.196.90.60 - - [05/Jan/2005:21:25:37 -0600] "GET /www.example.com/shop/product_info.php?products_id=11&sessionID=d16ba91e57f43cdbe19e3fca9d9a5e40 HTTP/1.0" 200 42222 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]

Here is the code in .htaccess


# Set some options
Options -Indexes
Options FollowSymLinks

RewriteEngine on
RewriteBase /
#
# Skip the next two rewriterules if NOT a spider
RewriteCond %{HTTP_USER_AGENT}!(msnbot¦slurp¦googlebot) [NC]
RewriteRule .* - [S=2]
#
# case: leading and trailing parameters
RewriteCond %{QUERY_STRING} ^(.+)&sessionID=[0-9a-z]+&(.+)$ [NC]
RewriteRule (.*) $1?%1&%2 [R=301,L]
#
# case: leading-only, trailing-only or no additional parameters
RewriteCond %{QUERY_STRING} ^(.+)&sessionID=[0-9a-z]$¦^sessionID=[0-9a-z]+&?(.*)$ [NC]
RewriteRule (.*) $1?%1 [R=301,L]

This .htaccess file of course is in the 'web root' path. It doesn't matter that the part of the url path is "/shop" surely, as my (limited) understanding of mod_rewrite is that whatever is placed in the web root will apply to all paths.

It would appear the 301 isn't working, as I would have expected to see a 301 in the web server logs, not a '200'?

Strange thing is, when I did testing of it, and added my browser as an agent name, and then used a url with the 'sessionID' in it, the url re-write did work, it was taken out of the browser bar url.

Any clues?

Peter

jdMorgan




msg:1521351
 1:28 am on Jan 10, 2005 (gmt 0)

> Any clues?

Yes, missing a plus sign...

# case: leading-only, trailing-only or no additional parameters
RewriteCond %{QUERY_STRING} ^(.+)&sessionID=[0-9a-z][b]+$[/b]¦^sessionID=[0-9a-z]+&?(.*)$ [NC]
RewriteRule (.*) $1?%1 [R=301,L]

As it was, it would accept only a single-chararacter sessionID in the form of the requests you posted, so that's why they didn't redirect.

Jim

valder




msg:1521352
 1:44 am on Jan 10, 2005 (gmt 0)

jdMorgan said:
>> A space is always required between "}" and "!" but posting on this forum eats those spaces.

Are you saying that %{HTTP_USER_AGENT} is wrong, and that it should be %{ HTTP_USER_AGENT }? (well, not really " , but space.")

and further, that "!abc" is also wrong, and that it should be "! abc"?

@jehoshua:
You aren't by any chance using zen-cart or osCommerce, are you? Anyway, would you mind explaining how you got the static links?

>>the mod_rewrite will rewrite the url to be:
>>http://www.example.com/shop/product_info.php/products_id/128

I'm trying to do get static urls (in zen-cart), with the rewrite rules mentioned in the following thread:
[webmasterworld.com...]

jehoshua




msg:1521353
 2:04 am on Jan 10, 2005 (gmt 0)

Hi Jim,


Yes, missing a plus sign...

Okay, thanks for the correction to the code, I will add that in. One other minor thing, I saw a post in another forum, similar problem and they talked about the need to have this:


session.use_trans_sid

set to off, and I checked the site , and it is on, but this site is an E-Commerce one, it runs osCommerce, I would need to check if it should be off or on, for osCommerce. We rely on sessions when people login and add to carts,etc.

Thanks,

Peter

jehoshua




msg:1521354
 2:11 am on Jan 10, 2005 (gmt 0)

Hi valder,


Are you saying that %{HTTP_USER_AGENT} is wrong, and that it should be %{ HTTP_USER_AGENT }? (well, not really " , but space.")

My understanding of what Jim meant was that it needs to be:

RewriteCond %{HTTP_USER_AGENT} ! (msnbot¦slurp¦googlebot)  [NC]

and further, that "!abc" is also wrong, and that it should be "! abc"?

Yes, I think it would be as you say. It would be nice is spaces didn't get eaten and code posted was code displayed. :D


@jehoshua:
You aren't by any chance using zen-cart or osCommerce, are you?

Yes, osCommerce.


Anyway, would you mind explaining how you got the static links?

>>the mod_rewrite will rewrite the url to be:
>>http://www.example.com/shop/product_info.php/products_id/128

Do you mean the search engine friendly links? If so, osCommerce just does it, you set a switch in admin that all.

Regards,

Peter

jdMorgan




msg:1521355
 2:22 am on Jan 10, 2005 (gmt 0)

> required space between "}" and "!"

Exactly as in this line:

RewriteCond %{HTTP_USER_AGENT} !(msnbot¦slurp¦googlebot) [NC]

(You can put a [bold][/bold] bbCode tag pair in front of the "!" to stop the forum eating spaces. The italics tag works too.)

The purpose of this forum function is to stop wasting database space with post replies like:
Me too !!!!!!!!!!!!!
It shortens that to
Me too!

Jim

valder




msg:1521356
 2:34 am on Jan 10, 2005 (gmt 0)

Ah, I see, didn't know any of that.
Thanks again :)

valder




msg:1521357
 2:41 am on Jan 10, 2005 (gmt 0)

>> .. If so, osCommerce just does it, you set a switch in admin that all.

Is that the one that says "under development" (2.2 MS2)?
I never dared use it because of that line :) Anyway, switched to zen-cart now, I find it much better in fact. It's based on osCommerce, and in some ways very similar.

jehoshua




msg:1521358
 3:12 am on Jan 10, 2005 (gmt 0)

Hi Jim,


Yes, missing a plus sign...

Thanks for that, I have updated .htaccess and its uploaded to the site now.

Thanks,

Peter

jehoshua




msg:1521359
 3:15 am on Jan 10, 2005 (gmt 0)

Hi,


Is that the one that says "under development" (2.2 MS2)?
I never dared use it because of that line :)

I didn't turn it on for osC sites, for the same reason, but then I came upon an MS-1/MS-2 site ( a snapshot _somwhere_ between the two? ) which had it turned on, and it works 100%.

Peter

jehoshua




msg:1521360
 2:33 am on Jan 14, 2005 (gmt 0)

Hi Jim,

I have just finished going through the web server logs for the last 4 days, looking for these 3 spiders, and checking where they used the session ID's.

Every occurance of using the sesion ID in the URL returned the "301". :D

Many thanks for all your help, it works perfectly.

Peter

Orange_XL




msg:1521361
 8:16 pm on Jan 17, 2005 (gmt 0)

Just out of curiousity, do the robots do anything with those 301's? I never had good experience with using redirects and bots.

Another issue. Maintaining a large list of bots seems quite impossible. Are there some general rules that will cover 80%+ of bots?

jehoshua




msg:1521362
 10:33 pm on Jan 17, 2005 (gmt 0)

Hi,

Just out of curiousity, do the robots do anything with those 301's? I never had good experience with using redirects and bots.

As Jim stated in msg#3 in this thread:

As to what the search engines will do, they'll see the 301-Moved Permanently response, re-fetch the page from the new (sessionID-less) URL given in that response, and -- after a while, update their database to use the new URL.

Another issue. Maintaining a large list of bots seems quite impossible. Are there some general rules that will cover 80%+ of bots?

Something generic like:

* spider
* bot
* crawler

would get a lot, but miss a lot also. I think the best approach is to keep on eye on your web server logs, identify the ones that do kepp using old URL's with the session ID's, add them into .htaccess, and keep monitoring the search engine results; when you see no more session ID's for _that_ search engine, you no longer need _that_ spider in the mod_rewrite.

One person was concerned about the performance issues; will mod_rewrite cause some pwebsite performance degradation. I don't know.

Peter

valder




msg:1521363
 6:06 pm on Jan 18, 2005 (gmt 0)

Orange_XL said:
Just out of curiousity, do the robots do anything with those 301's?

I'd like to comment on redirecting status codes;
There are currently 7 different redirecting status codes in use (300-305 + 307), and afaik 301 and 302 are the most commonly used. 301 [w3.org] (permanent) is used when you want people to use the new location, and 302 [w3.org] (found) is used when you want people to continue using the url they already have.

To learn more about status codes, read W3C's status code definitions [w3.org] for the full list.

Orange_XL




msg:1521364
 9:46 pm on Jan 18, 2005 (gmt 0)


I'd like to comment on redirecting status codes;
There are currently 7 different redirecting status codes in use (300-305 + 307), and afaik 301 and 302 are the most commonly used. 301 (permanent) is used when you want people to use the new location, and 302 (found) is used when you want people to continue using the url they already have.

To learn more about status codes, read W3C's status code definitions for the full list.

Thanks. I'm familiar with those. The reason am asking is because i have some 301 for well over two years now and still bots and search-engine's are referring to those page's. From recollection, I think it took AskJeeves/Teoma well over a year, Yahoo is still using the old url (though it does go to the redirected page) and many other smaller engine's do either not recognize 301's or follow them but do not update their database :-(

valder




msg:1521365
 1:22 am on Jan 19, 2005 (gmt 0)

My guess is that the bots follow links from other sites that don't know about your change of url. The search engines probably don't have a list of all 301 links, so they keep following the same old links. I'm only guessing here, but it seems logical to me.

My suggestion is that you find those sites that link to your old url, and ask them to update their links.

jdMorgan




msg:1521366
 1:34 am on Jan 19, 2005 (gmt 0)

I'd also suggest you use the Server Headers checker in the WebmasterWorld control panel to make sure your 301 redirect is in fact returning a 301. Many times, problems with search engines following redirects are caused by "mis-implementation" of the redirects. Admittedly though, I've seen Ask keep requesting 404'ed and 410'ed pages for up to a year, too.

Nevertheless, a 301 redirect is usually the correct thing to do. If the search engine spiders do their part, everything works nicely. If either your site or the search spider does not follow the HTTP protocol, then there is no chance it will work. So, all we can do is to implement redirects in compliance with HTTP/1.1, and hope the spiders handle them properly. Or hope that those which don't will eventually get fixed.

Jim

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved