Forum Moderators: phranque
I read all through the forums here, and found plenty of posts about the full pseudo-directory URL rewrites (ie: mapping url.com/script/1/2/3/4/5/6 to url.com/script.cgi?1=2&3=4&5=6), but nothing about how to just get rid of the session id when a robot comes knocking... after a bit of head scratching, here is what I came up with...
My URLs look like so:
domain.com/cgi-bin/shop/script.cgi?user_id=id&var_infinitum=val_infinitum
rewriteEngine on
rewriteBase /shop
rewriteCond %{HTTP_USER_AGENT} Googlebot.*
rewriteRule ^script\.cgi\?user_id=id&(.*)$ script\.cgi\?$1
If I am correct, that should just remove the user_id=id& out of the middle of the URL when Googlebot tries to follow the link, am I right? Then I can just add a new rewriteCond for each UA for whom I want the user_id variable removed.
Someone please let me know if there's a problem there (or tell me how to trick the server into thinking I'm Googlebot, so I can test it myself)... ;)
If I went the other way (modifying my link to remove the user_id variable, and then using mod_rewrite to reinsert it for everyone but the SE spiders), mod_rewrite would have to alter links for the majority of visitors (instead of only modifying them for the spiders), and have to parse the HTTP_REFERER to retrieve the session id to re-insert it for regular visitors, which seems like it would be a much larger drain on the server (and those who had referers turned off in their browser wouldn't be able to use the store).
I realize leaving all the variables in their ugly cgi form may not be ideal for spiders, but from what I've read, just getting rid of the session ids should at least allow those links to be crawled and indexed... Thoughts?
tell me how to trick the server into thinking I'm GooglebotHere's [webmasterworld.com] a thread offering several solutions.
I suppose I could edit the rewriteCond to read Opera, and go visit the site myself... then if it worked, I could switch it to Googlebot, and if it didn't I could delete it and start over. Really, my biggest question was whether the syntax looked OK. Hoping a mod_rewrite expert could give it a gander before I uploaded it (since I just started reading the Apache docs yesterday, and haven't really built up much confidence in my grasp of the material...).
Beyond that, I was just looking to start a discussion... with recent developments, it seems it's becoming more important than ever to ensure crawlability for online catalogs. ;)
edit the rewriteCond to read OperaExactly!
For test you could do something like:
RewriteCond %{HTTP_USER_AGENT} Opera [AND]
RewriteCond %{HTTP_HOST} ^yourispdomainname
RewriteRule ^script\.cgi\?user_id=id&(.*)$ script\.cgi\?$1
[edited by: DaveAtIFG at 11:39 pm (utc) on Dec. 13, 2002]
The syntax looks OK, but won't work in .htaccess. It should work in httpd.conf, though.
You should add an [L] flag to the end of the RewriteRule, unless you know you need to continue with more rewrites on that URL.
By the time Apache gets to .htaccess, the query string is stripped, and is available for testing or backreference creation only to RewriteCond, or for direct-substitution into the target URL as %{QUERY_STRING}.
If you are using mod_rewrite in .htaccess, LMK if you need more details.
Jim
Letīs walk through a cycle of GoogleBot requesting a document.
- GoogleBot requests the URI domain.com/shop.html which is the start page of your shop.
- Your server serves this document which will contain URIs like this: domain.com/cgi-bin/shop/script.cgi?user_id=id&var_infinitum=val_infinitum.
- Google parses the page and finds this link. After indexing the current document the URI domain.com/cgi-bin/shop/script.cgi?user_id=id&var_infinitum=val_infinitum is requested.
- Now your RewiteRules kick in and you do an internal rewrite to the same URI sans the session id. This is totally transparent to GoogleBot. The page that GoogleBot receives will still be referred to by the URI of domain.com/cgi-bin/shop/script.cgi?user_id=id&var_infinitum=val_infinitum. An internal rewrite is transparent to the requesting UA. You could do an external rewrite sending a Moved Permanently status code. Then GoogleBot would request the page again using the session-id-less URI. But I am not sure whether GoogleBot likes getting the same old URIs on each dance and being told that they are old and to use the new ones.
Again, mod_rewrite is generally a one-way-street. Using some method you fake URIs and use mod_rewrite to turn them into the real ones internally. mod_rewrite does not parse your documents before they leave the server. It does not rewrite the URIs contained in those documents.
Andreas
What may be confusing is the "direction" that the rewrite goes...
mod_rewrite takes the URL your browser requests and modifies it for use inside the server. It does not affect the URLs output back to your browser by your scripts and html pages.
So, this probably is not what you wanted to do.
There are several posts from the last few days addressing this.
Jim
I don't expect that the actual html code being served would be changed at all... that would be silly. mod_rewrite is supposed to rewrite URLs, not rewrite html. ;)
However, I do not see why it should be able to take a link reading domain.com/script/var/val/var2/val2 and turn it into domain.com/script?var=val&var2=val2, which appears to be a simple character substitution, but not take a link reading domain.com/script?id=id&var=val and turn it into domain.com/script?var=val by being told to substitute id=id for nothing.
The page that GoogleBot receives will still be referred to by the URI of domain.com/cgi-bin/shop/script.cgi?user_id=id&var_infinitum=val_infinitum
That is fine... The store links on my html templates read user_id=id. After you've followed one of those links, the script assigns you an ID number, and the links generated thereafter contain user_id=12345 (a random 5 digit number). If mod_rewrite could remove the user_id=id from the internal request, the script would not assign the session id, and would generate the suceeding links as the generic user_id=id.
So, if crawl10.googlebot comes along and requests a few store pages, they will appear as user_id=id&page=page.html. Then crawl2.googlebot comes along, requests the same pages, and gets user_id=id&page=page.html. The URLs are the same.
Currently, crawl10 will get something like user_id=54208&page=page.html, and crawl2 will get user_id=60234&page=page.html, giving the spider-scaring illusion that there are an infinite number of store pages to crawl, and supposedly causing them to throw those pages out (these pages get crawled every month, and never appear in the index).
It seems if they got the generic user_id=id everytime, it would be apparent when they'd requested the same page, and would remove the infinite pages problem.
<added>In short, I am not trying to make user_id=id disappear from the robots' perspective. I am trying to stop user_id=id from being "delivered" to the store script when a robot follows the link, so the links are not changed to user_id=12345 on the next page generated/retrieved.</added>
[edited by: mivox at 1:54 am (utc) on Dec. 14, 2002]
However, I do not see why it should be able to take a link reading domain.com/script/var/val/var2/val2 and turn it into domain.com/script?var=val&var2=val2, which appears to be a simple character substitution, but not take a link reading domain.com/script?id=id&var=val and turn it into domain.com/script?var=val by being told to substitute id=id for nothing.
There is no question that mod_rewrite can do just that. But this might not be what you want. The fake URI is the one that Google will assign to the page. Since this URI does not exist on the server it is rewritten internally to the right URI. In your situation the URI containing the session id is the one Google will assign to your page. Internally the URI with session id is rewritten to a URI sans session id. If this is the only thing you really want, although Iīm not sure how that would help, then using mod_rewrite will be ok.
It would be helpful if you could let us know why you think that the problems I mentioned above will not apply to your particular situation.
Andreas
[edited by: andreasfriedrich at 1:56 am (utc) on Dec. 14, 2002]
<ROFL... and then we edited our posts at the same time...>
So... how do I do it? I've been tinkering with the {QUERY_STRING} idea for a while now, and I can't get it to stop assigning me a user_id number... grr.
[edited by: mivox at 1:58 am (utc) on Dec. 14, 2002]
Sorry that I insisted on telling you what mod_rewrite does and what it doesnīt do. I wasnīt sure whether you expected things that it just wasnīt designed for. A lot of people do. You didnīt.
Iīll have a look at the rules now. ;)
Andreas
But I got a good grade in that class, so I ought to be able to figure this out. I'm just finding a major lack of documentation on .htaccess URL manipulation, and I'm trying to learn regex at the same time as I'm figuring out mod_rewrite, so I'm not quite sure which end is up at the moment... ;)
major lack of documentation on .htaccess URL manipulation
There isnīt that big a difference between using RewriteRules in httpd.conf and .htaccess files as far as the systax is concerned. All you need to remember is that the directory prefix is removed prior to matching and added again later on. Performance wise there is a BIG difference between the two.
Since user_id=id is static as you write, you could use a rule like this:
RewriteCond %{QUERY_STRING} user_id=id&(.*)$ When toying with mod_rewrite do so in a structured manner. Using a simple rewrite rule test whether rewriting works at all in the particular directory. If it works add more conditions piece by piece. But you probably do this anyway since it applies to almost all work one does.
Andreas
Now that Andreas is here, you'll have all the help you need. <stage whisper>I'm glad he hasn't disappeared completely into law school...</stage whisper>
In which subdirectory is the .htaccess containing your RewriteRule? - I'd like to make sure RewriteBase is needed and correct.
If you liked constitutional law, you'll love regex and mod_rewrite - It's a lot of logic and precise language, and the details have to be absolutely correct. :)
Jim
<added after AF's last post>... And remember that you can test using a "dummy" URL on the left side of the RewriteRule, so as to avoid breaking your site for real visitors.</added>
I tested those rules on my test server and they work great.
Options +FollowSymlinks
RewriteEngine on
RewriteBase /shop <--- make sure this is ok.
RewriteCond %{HTTP_USER_AGENT} ^Opera.*
RewriteCond %{QUERY_STRING} user_id=[0-9a-z]{2,5}&(.*)$
RewriteRule ^script\.cgi script.cgi?%1 [L]
As Jim wrote all you need to do is make sure that RewriteBase is correct and that the htaccess file is in the right directory: The one the URI domain.com/cgi-bin/shop resolves to.
Andreas
RewriteBase /
RewriteRule ^cgi-bin/shop/script\.cgi cgi-bin/shop/script.cgi?%1 [L]
Works like a charm now. Now I just need to add one more script name to the rewrite rules, and I'll be all ready to change the target UA to block SE spiders from getting user ids.