|Building SEO friendly URLs using PERL|
building a cms with perl running in the background
I have a site which is growing day by day. All I do is generate html files using perl codes connected to a mysql database. So if a small change is required, I have to change the perl codes, and generate the html pages again. I know this is very foolish. So now I want to make my site dynamic, without changing the URLs.
e.g. I have a URL like http://example.com/topic1/subtopic3.html
Can I have something in the background like
http://example.com/cgi-bin/display.pl?topic=topic1&subtopic=subtopic3.html which is invisible to the users and the search engine ?
All they will know is http://example.com/topic1/subtopic3.html
So basically, physically subtopic3.html will not exist in my webserver. I need to get rid of my static html files, at the same time don't want to change my old URLs.
If this is possible, please help.
Thanks in advance
welcome to WebmasterWorld [webmasterworld.com], webtechi2010!
yes you can do exactly that and it should be relatively easy to implement if you are running on an Apache server and you can enable mod_rewrite [httpd.apache.org] which allows you to do "internal rewrites".
using regular expressions you can apply a pattern to the REQUEST_URI and extract the topic and subtopic, and rewrite to the internal url which will contain the topic and subtopic in the query string.
you will probably want to exclude media, images, script, style sheets and other such resources from being rewritten.
if you are in fact on Apache, take a stab at coding those rewrite rules and post any specific questions in the Apache Web Server forum [webmasterworld.com].
there should be many examples of what you are attempting posted in threads in that forum.
another way to do this (using mod_rewrite) is to internally rewrite any request (that doesn't look like an image, script, etc) to the cgi script like:
(note you don't want to include the domain in the rewrite rule or it will externally redirect instead of internally rewriting.)
and then the perl script can figure out what to do with the path information and respond with either:
- 200 OK and the requested document
- 301 Moved Permanently and a Location header if the url requested is non-canonical
- 404 Not Found if the request is junk
you should also verify that any request for http://example.com/cgi-bin/display.pl?request=topic1/subtopic3.html gets either a 301 redirect to the canonical domain or a 404 Not Found,
you don't want those urls to be crawled or indexed.
and it should go without saying that your internal navigation must use canonical urls:
mod_rewrite's two main functions, alluded to above, are external URL-to-URL redirects and internal URL-to-filepath rewrites. You will need one of each, as phranque points out.
Internally rewrite any request for a URL-path ending in ".html" to your script's filepath, passing the requested page's URL-path to the script as an argument.
Then redirect any requests for the dynamic script path which come directly from the HTTP client (i.e. those which have NOT been rewritten by the rule just described above) back to the .html URL to which it corresponds, in order to prevent the same content being accessible using two different URLs.
The "any requests for the dynamic script path which come directly from the HTTP client" clause above is really the only tricky part, and it's not all that tricky as long as you're aware that it's needed.
See this thread [webmasterworld.com] in our Apache Forum Library for a good start. The title addresses one application, but most of the thread content is precisely applicable to your requirements.
Wonderful ! Thanks phranque and Jim for the informative posts. Seems like it's a bit tricky, but possible. I read the post at [webmasterworld.com...] and I am sure there is no physical existence of the 6079.htm file or the forum92 folder in webmasterworld.com's server.
Now, one thing that strikes my mind is that when everything is dynamic, there will be huge number of database requests by the requested pages. So how to manage that, because my site is hosted as shared server, and every hosting account there has limitations on database usages regarding max number of connections etc.
If I have a max number of connections say 100, and 110 people are visiting my site at the same time, so will 10 people not be able to view my pages at that time ?
- if some or all of your pages or content on a page is relatively static, there are various types and levels of caching you can use to avoid querying the database for content that hasn't changed since the last query.
- there are various methods and technologies you can use to maintain and share/reuse persistent database connections.
each of these areas is a deep subject and whether you implement such a solution depends on your requirements and resources.
> If I have a max number of connections say 100, and 110 people are visiting my site at the same time, so will 10 people not be able to view my pages at that time ?
Each page/image/css/object request from each of the extra ten clients at any given moment will have to wait while the other 100 client holds the database connection open. For this reason, it is recommended to close the database connections as soon as possible, and to write the database-access code so that it does not "fool around" while the db connection is open. In other words, pre-compute all variables needed for the db access, getting everything done ahead of time, open the db, read the record(s) into a memory array, and then close the db immediately. The only time you must leave the db connection open is when you need to do an atomic read-modify-write operation, something that's almost always handled automatically when you open a record for writing.
Because of the described behavior, I wouldn't worry so much about 110 users with a limit of 100, but I would worry about 500 users... Luckily, as long as your scripts are well-coded, the db accesses will be fast, while humans are comparatively very slow.
If you use a web framework like Catalyst, it is possible that means to create readable URLs are already included.