Forum Moderators: phranque

Message Too Old, No Replies

Query string success with RedirectMatch

         

doctormelodious

7:34 pm on Jul 14, 2004 (gmt 0)

10+ Year Member



Greetings,

I wanted a shorter URL which would redirect to the lengthy URL for a cgi script. I stumbled onto a solution completely by accident, but I want to make sure I understand why it works.

Here is the line in my .htaccess


RedirectMatch /fred/$ http://example.com/cgi-bin/subfolder/myScript.cgi$1

The result is that when I type


http://example.com/fred/?foo=bar

I get redirected to


http://example.com/cgi-bin/subfolder/myScript.cgi?foo=bar

..which is exactly what I wanted. What seems to be happening, based on trial and error, is that $X contains the query string, where X is 1 greater than the number of groups I put in parentheses in the regexp. In this case, I didn't put anything in parentheses, so $1 contains the query string. I also tried this with two groups in parentheses, and sure enough, $3 contained the query string.

The thing that puzzles me is that I haven't seen any mention of this on any of the Unix command web sites I've looked at (so far...). Is this predictable behavior, or something my ISP has arbitrarily enabled? I.E. would it work elsewhere?

Thanks,
Perry

[edited by: rogerd at 7:49 pm (utc) on July 14, 2004]
[edit reason] Example URLs [/edit]

jdMorgan

8:43 pm on Jul 14, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I don't know how that works, since you haven't defined the back-reference in the code above.

However, I would not advise using this method, since it will tell search engines to always list your pages by the long cgi URL. It is better to use mod_rewrite to do a server-internal rewrite (as opposed to an external redirect) so that the long cgi URLs are never shown to users or robots -- The specified long path simply replaces the short requested path before the request is processed.

The mod_rewrite equivalent of your RedirectMatch, modified to use an internal rewrite, would be:


Options +FollowSymLinks
RewriteEngine on
RewriteCond %{QUERY_STRING} (.+)
RewriteRule ^fred/$ /cgi-bin/subfolder/myscript.cgi%1 [L]

Apache mod_rewrite documentation [httpd.apache.org]
Apache URL Rewriting Guide [httpd.apache.org]
Regular Expressions Tutorial [etext.lib.virginia.edu]

Jim

doctormelodious

12:25 am on Jul 15, 2004 (gmt 0)

10+ Year Member



Hi JD,

Thanks for the reply. I will take a look at mod_rewrite.

You also bring up an interesting point that I had not considered. How DO search engines deal with dynamically generated content? Or are they able to at all? Unless there is a dummy page, in plain view of the robots, that somehow redirects to the script-generated page where the content is -- for each and every dynamically generated page -- how could a robot possibly know what query string arguments to pass to any kind of script in order to harvest and index the output?

Thanks!
Perry

jdMorgan

12:57 am on Jul 15, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



OK, here's a whirlwind tour:

Search engine spiders follow the links returned on the "pages" they fetch, regardless of whether those pages are static or dynamic. To an extent, they don't care about the form of those links - whether they say http://example.com/page_two.html or http://www.example.com/cart.php?product=widget&style=fuzzy&color=blue

However, many spiders are programmed to avoid infinite URL-spaces by ignoring session IDs, and by limiting their crawl depth based upon the number of query string parameters. For this reason, static URLs are "easier" for search engines to spider. Webmasters who wish to get best results in the search engines therefore use static URLs on their sites whenever possible.

The typical scenario for a spider-friendly site is this:

  1. All "pages" on the site output static URLs without query string parameters. These appear to be static pages to spiders and browser users alike. The search engines will then list these static URLs in search results.
  2. When a static URL is requested, mod_rewrite or ISAPI Rewrite is used to internally rewrite the requested static URL to the form needed by the applciation script.
  3. The modified request is then passed to the script, which outputs "pages" containing static link URLs needed to repeat the first step.

External redirection (e.g. 301 or 302 redirect) does not play an active part in this type of set-up.

One common point of confusion pertains to *when* a rewrite takes place. Mod_rewrite and ISAPI rewrite perform their functions after an HTTP request is received by the server, but before any content is delivered. So, they can modify the requested URL, but not a URL on a page being output to the browser or spider.

A further confusing point is that mod_rewrite can perform either of two broad types of function; When an HTTP request is received, mod_rewrite can do an external redirect -- which returns a message to the browser giving a new URL, which the browser must then use to generate a new HTTP request, or it can perform a simple resource-name rewrite (URL-path substitution) internal to the server, and within the current HTTP request.

Jim

doctormelodious

4:31 am on Jul 15, 2004 (gmt 0)

10+ Year Member



Hi JD,

Thanks for the search engine explanation. Re: your solution:


Options +FollowSymLinks
RewriteEngine on
RewriteCond %{QUERY_STRING} (.+)
RewriteRule ^fred/$ /cgi-bin/subfolder/myscript.cgi%1 [L]

I had to make just one little tweak:


RewriteRule ^fred/$ /cgi-bin/subfolder/myscript.cgi\?%1 [L]

...and it worked like a charm.

Thanks again for all your help!
Perry

P.S. FWIW regarding the mysterious behavior of RedirectMatch in my first post, the mysterious part was that there WAS no back-reference to define. The RedirectMatch statement in question did not define it, nor did any earlier statement in the .htaccess file. Hence my puzzlement over why $1 had the query string in it.

jdMorgan

5:28 am on Jul 15, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes, it was strange, and your question about, "Is this a behaviour I can count on?" was spot-on. I don't think I'd want to count on it, myself... There might be, you know, an unchecked buffer in there or something... ;)

Sorry I forgot the "?" -- My eyes bug out after staring at code all day.

Jim

doctormelodious

10:01 pm on Jul 17, 2004 (gmt 0)

10+ Year Member



Hi again Jim,

Yet another question about spiders. Even though I now have "static" page URLs for these pages (thanks to mod_rewrite), requests for them still ultimately point to the CGI script in the cgi-bin directory. Isn't that directory off-limits to spiders, by definition? Or does the server allow the spiders access (to index the content) because the actual request points to a directory that is not off-limits?

Thanks!
Perry

jdMorgan

10:36 pm on Jul 17, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Perry,

> Isn't that directory off-limits to spiders, by definition?

By definition, no. By server configuration, possibly.

> Or does the server allow the spiders access (to index the content) because the actual request points to a directory that is not off-limits?

The server has no knowledge of spiders, so it does not behave any differently for them than it would for a browser. Only if you or your host add code (such as mod_rewrite) to check the user-agent name would the server "know" anything about spiders. There's no "magic" involved in this, just access-control code, passwords (if applicable) and file permissions at the operating system level.

You could write access control code that would forbid direct requests for cgi-bin, but allow access via a rewritten path. Some server variables are updated by a rewrite, and others are not, so you can detect the originally-requested URL even if a rewrite takes place.

Jim

doctormelodious

12:38 am on Jul 18, 2004 (gmt 0)

10+ Year Member




The server has no knowledge of spiders, so it does not behave any differently for them than it would for a browser.

Ah, there's the part I was forgetting.

Thanks!
Perry

Wizcrafts

5:34 am on Jul 24, 2004 (gmt 0)

10+ Year Member



RewriteRule ^fred/$ /cgi-bin/subfolder/myscript.cgi\?%1 [L]

Guys, this code is exactly what I have been looking for, to allow me to write short-form ad links in my html pages and have them expanded and redirected, and to pass the query string to a cgi script that displays the desired webpage for that product. I am going to use my own names and variables, of course.

This Forum is like a box of chocolates; ya never know what you're gonna get!

Wiz

jdMorgan

1:08 pm on Jul 24, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Ooops...

Before copying that, be aware that the question mark in the substitution url-path does not need to be escaped, and that the %1 variable in the RewriteRule back-references the parenthesized sub-pattern in the preceding RewriteCond (which is not shown in the post immediately above this one).

Just to clarify, here it is again with everything mentioned:


Options +FollowSymLinks
RewriteEngine on
RewriteCond %{QUERY_STRING} (.+)
RewriteRule ^fred/$ /cgi-bin/subfolder/myscript.cgi?%1 [L]

Jim

Wizcrafts

4:05 pm on Jul 24, 2004 (gmt 0)

10+ Year Member



Before copying that, be aware that the question mark in the substitution url-path does not need to be escaped, and that the %1 variable in the RewriteRule back-references the parenthesized sub-pattern in the preceding RewriteCond (which is not shown in the post immediately above this one).

Jim
When I first copied the second line I left in the \? and it worked anyway. I just removed the backslash and it still works. I did copy the preceeding RewriteCond also.

I probably should Post this solution to another WebmasterWorld Forum that deals with advertising links and ad-blocker workarounds. People have been searching for and discussing various solutions to NIS's default behaviour to block all affiliate ads. I was shown a cgi script that accomplishes the task, but the html call to it needed to be shortened (I use 2+ dozen text affiliate ad links). I have been asking and searching for a means to create a redirect/expand URI rule, and thanks to you I found it. I tested it last night on two of my numbered links and it worked like a charm.

Wiz

Wizcrafts

6:05 am on Jul 27, 2004 (gmt 0)

10+ Year Member



As a followup to my previous reference as to how this thread helped me with a similar redirect problem, my full CGI + Mod_Rewrite solution is posted here: [webmasterworld.com...]

Wiz