Forum Moderators: phranque
I wanted a shorter URL which would redirect to the lengthy URL for a cgi script. I stumbled onto a solution completely by accident, but I want to make sure I understand why it works.
Here is the line in my .htaccess
RedirectMatch /fred/$ http://example.com/cgi-bin/subfolder/myScript.cgi$1
The result is that when I type
http://example.com/fred/?foo=bar
I get redirected to
http://example.com/cgi-bin/subfolder/myScript.cgi?foo=bar
..which is exactly what I wanted. What seems to be happening, based on trial and error, is that $X contains the query string, where X is 1 greater than the number of groups I put in parentheses in the regexp. In this case, I didn't put anything in parentheses, so $1 contains the query string. I also tried this with two groups in parentheses, and sure enough, $3 contained the query string.
The thing that puzzles me is that I haven't seen any mention of this on any of the Unix command web sites I've looked at (so far...). Is this predictable behavior, or something my ISP has arbitrarily enabled? I.E. would it work elsewhere?
Thanks,
Perry
[edited by: rogerd at 7:49 pm (utc) on July 14, 2004]
[edit reason] Example URLs [/edit]
However, I would not advise using this method, since it will tell search engines to always list your pages by the long cgi URL. It is better to use mod_rewrite to do a server-internal rewrite (as opposed to an external redirect) so that the long cgi URLs are never shown to users or robots -- The specified long path simply replaces the short requested path before the request is processed.
The mod_rewrite equivalent of your RedirectMatch, modified to use an internal rewrite, would be:
Options +FollowSymLinks
RewriteEngine on
RewriteCond %{QUERY_STRING} (.+)
RewriteRule ^fred/$ /cgi-bin/subfolder/myscript.cgi%1 [L]
Jim
Thanks for the reply. I will take a look at mod_rewrite.
You also bring up an interesting point that I had not considered. How DO search engines deal with dynamically generated content? Or are they able to at all? Unless there is a dummy page, in plain view of the robots, that somehow redirects to the script-generated page where the content is -- for each and every dynamically generated page -- how could a robot possibly know what query string arguments to pass to any kind of script in order to harvest and index the output?
Thanks!
Perry
Search engine spiders follow the links returned on the "pages" they fetch, regardless of whether those pages are static or dynamic. To an extent, they don't care about the form of those links - whether they say http://example.com/page_two.html or http://www.example.com/cart.php?product=widget&style=fuzzy&color=blue
However, many spiders are programmed to avoid infinite URL-spaces by ignoring session IDs, and by limiting their crawl depth based upon the number of query string parameters. For this reason, static URLs are "easier" for search engines to spider. Webmasters who wish to get best results in the search engines therefore use static URLs on their sites whenever possible.
The typical scenario for a spider-friendly site is this:
One common point of confusion pertains to *when* a rewrite takes place. Mod_rewrite and ISAPI rewrite perform their functions after an HTTP request is received by the server, but before any content is delivered. So, they can modify the requested URL, but not a URL on a page being output to the browser or spider.
A further confusing point is that mod_rewrite can perform either of two broad types of function; When an HTTP request is received, mod_rewrite can do an external redirect -- which returns a message to the browser giving a new URL, which the browser must then use to generate a new HTTP request, or it can perform a simple resource-name rewrite (URL-path substitution) internal to the server, and within the current HTTP request.
Jim
Thanks for the search engine explanation. Re: your solution:
Options +FollowSymLinks
RewriteEngine on
RewriteCond %{QUERY_STRING} (.+)
RewriteRule ^fred/$ /cgi-bin/subfolder/myscript.cgi%1 [L]
I had to make just one little tweak:
RewriteRule ^fred/$ /cgi-bin/subfolder/myscript.cgi\?%1 [L]
...and it worked like a charm.
Thanks again for all your help!
Perry
P.S. FWIW regarding the mysterious behavior of RedirectMatch in my first post, the mysterious part was that there WAS no back-reference to define. The RedirectMatch statement in question did not define it, nor did any earlier statement in the .htaccess file. Hence my puzzlement over why $1 had the query string in it.
Yet another question about spiders. Even though I now have "static" page URLs for these pages (thanks to mod_rewrite), requests for them still ultimately point to the CGI script in the cgi-bin directory. Isn't that directory off-limits to spiders, by definition? Or does the server allow the spiders access (to index the content) because the actual request points to a directory that is not off-limits?
Thanks!
Perry
> Isn't that directory off-limits to spiders, by definition?
By definition, no. By server configuration, possibly.
> Or does the server allow the spiders access (to index the content) because the actual request points to a directory that is not off-limits?
The server has no knowledge of spiders, so it does not behave any differently for them than it would for a browser. Only if you or your host add code (such as mod_rewrite) to check the user-agent name would the server "know" anything about spiders. There's no "magic" involved in this, just access-control code, passwords (if applicable) and file permissions at the operating system level.
You could write access control code that would forbid direct requests for cgi-bin, but allow access via a rewritten path. Some server variables are updated by a rewrite, and others are not, so you can detect the originally-requested URL even if a rewrite takes place.
Jim
RewriteRule ^fred/$ /cgi-bin/subfolder/myscript.cgi\?%1 [L]
This Forum is like a box of chocolates; ya never know what you're gonna get!
Wiz
Before copying that, be aware that the question mark in the substitution url-path does not need to be escaped, and that the %1 variable in the RewriteRule back-references the parenthesized sub-pattern in the preceding RewriteCond (which is not shown in the post immediately above this one).
Just to clarify, here it is again with everything mentioned:
Options +FollowSymLinks
RewriteEngine on
RewriteCond %{QUERY_STRING} (.+)
RewriteRule ^fred/$ /cgi-bin/subfolder/myscript.cgi?%1 [L]
Before copying that, be aware that the question mark in the substitution url-path does not need to be escaped, and that the %1 variable in the RewriteRule back-references the parenthesized sub-pattern in the preceding RewriteCond (which is not shown in the post immediately above this one).
Jim
When I first copied the second line I left in the \? and it worked anyway. I just removed the backslash and it still works. I did copy the preceeding RewriteCond also.
I probably should Post this solution to another WebmasterWorld Forum that deals with advertising links and ad-blocker workarounds. People have been searching for and discussing various solutions to NIS's default behaviour to block all affiliate ads. I was shown a cgi script that accomplishes the task, but the html call to it needed to be shortened (I use 2+ dozen text affiliate ad links). I have been asking and searching for a means to create a redirect/expand URI rule, and thanks to you I found it. I tested it last night on two of my numbered links and it worked like a charm.
Wiz
Wiz