Forum Moderators: phranque
One way to view this is that the action of mod_rewrite doing an internal rewrite is no different than the basic function of the Web server. For a simple example, a typical Web server takes a request for the URL http://www.example.com/widgets.php and translates that to a request for a filepath such as /var/users/example/public/html/widgets.php
This action is essentially equivalent to a rewrite, and of course, search engines are unaware of the "/var/users/example/public/html" path and can't/won't really care if you change that path to something else, say "/var/users/example/public/html/newdir".
The main "exposure" is that often, no steps are taken to prevent search engines from indexing the "real" URL path, in this example, "http://www.example.com/newdir/widgets.php" and so, adding the rewrite creates a duplicate-content issue if that "real" URL is accidentally exposed during development by an incorrect link or perhaps by use of the Google Toolbar.
This can easily be prevented or "fixed" by an additional snippet of code that redirects direct client requests for /newdir back to the root of the domain, so that the "real" URL-path is not directly accessible.
The key to understanding these problems is to recognize and consider URL-spaces and filespaces as separate spaces, and to recognize that the basic function of a server is to map a "standard" URL-space to an arbitrary filespace. This allows browsers to request "resources" and "objects" using the standard URL "addressing system" of HTTP without regard to the hardware, operating system, or filesystem conventions of the server that hosts those resources.
Jim
I just wanted to ask about where you mentioned:
This can easily be prevented or "fixed" by an additional snippet of code that redirects direct client requests for /newdir back to the root of the domain, so that the "real" URL-path is not directly accessible.
Where could I get this code please?
And also, could robots.txt be used instead to do the same job?
Thank you,
Try searching WebmasterWorld (link at top of page) for "redirect direct client requests" for several examples.
> And also, could robots.txt be used instead to do the same job?
Perhaps, depending on the nature of the URLs. However, if the "real" URL has ever been indexed by search engines, a 301 redirect is a better method, since it 'recovers' the traffic and the PageRank/Link-popularity of the URL.
Jim