Forum Moderators: phranque
Startup very kindly pointed out that Drupal has quite some issues when dealing with URL duplication. Much of his observations I am mentioning below.
Duplicate URL's are particularly evident when installing certain modules. The URL should be:
/widgets/
However the page could also be located with any of the folling URL's:
/index.php
/widgets/index.php
/widgets/index.htm?=cid1234
/widgets/index.htm
/widgets&cid=1234
Also...
[site.com ]------------ real url
[site.com ] --------- dup content
[site.com ] -------- more dup content
I have excluded /node/ in robot.txt but I was thinking that .htaccess was a better solution for the above issues.
Here is parts of the .htaccess source. I have omitted non relevant.
# Protect files and directories from prying eyes.
<FilesMatch "\.(engine¦inc¦info¦install¦module¦profile¦po¦sh¦.*sql¦theme¦tpl(\.php)?¦xtmpl)$¦^(code-style\.pl¦Entries.*¦Repository¦Root¦Tag¦Template)$">
Order allow,deny
</FilesMatch>
#RewriteBase /
# Rewrite old-style URLs of the form 'node.php?id=x'.
#RewriteCond %{REQUEST_FILENAME} !-f
#RewriteCond %{REQUEST_FILENAME} !-d
#RewriteCond %{QUERY_STRING} ^id=([^&]+)$
#RewriteRule node.php index.php?q=node/view/%1 [L]
# Rewrite old-style URLs of the form 'module.php?mod=x'.
#RewriteCond %{REQUEST_FILENAME} !-f
#RewriteCond %{REQUEST_FILENAME} !-d
#RewriteCond %{QUERY_STRING} ^mod=([^&]+)$
#RewriteRule module.php index.php?q=%1 [L]
# Rewrite current-style URLs of the form 'index.php?q=x'.
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.*)$ index.php?q=$1 [L,QSA]
I would really appreciate some pointers as to how to avoid the potential spider storm.
thanks
dutchie
GlobalRedirect is a simple module which…1. Checks the current URL for an alias and does a 301 redirect to it if it is not being used.
2. Checks the current URL for a trailing slash, removes it if present and repeats check 1 with the new request.
3. Checks if the current URL is the same as the site_frontpage and redirects to the frontpage if there is a match.
4. Checks if the Clean URLs feature is enabled and then checks the current URL is being accessed using the clean method rather than the 'unclean' method. (Currently only in DEV, will be in 1.3 soon)
Also, you may want to fix up the default robots.txt [drupalzilla.com]. That author suggests getting rid of the trailing slashes on some paths in the default robots.txt file. I would actually duplicate them (i.e. both with and without slashes), but that's a minor detail.
Rewrite your URLs so that URLs with/without (your preference) 'www' get redirected.
There is some discussion of drupal duplicate content issues [groups.drupal.org] in the Drupal SEO Group [groups.drupal.org].
The one caveat is that some of the people there know Drupal better than SEO (though many are solid with both). I would tend to favor WebmasterWorld as place to figure out *what* to do, while the Drupal SEO Group is a great place to figure out *how* to do it in Drupal and to identify Drupal-specific gotchas.
No need for the duplication.
Disallow: /folder
will disallow all of these paths:
/folder/
/folder/x
/folder/xyz
/folder
/folderx
/folderxyz
It does not disallow a specific URL path.
It disallows ANY and ALL URL paths that begin with /pattern and exactly that.
.
Disallow: /folder/
will disallow all of these URL paths:
/folder/
/folder/x
/folder/xyz
and will NOT disallow any of these paths:
/folder
/folderx
/folderxyz
because all of those in the latter group fail to match after the "r".
Disallow: / disallows all URL paths that begin with a "/", and that is every URL on the site.
I have previously read the posts you mention and they helped answer some of my questions. GlobalRedirect module seems like a decent mod except for the fact that is strips out all trailing slashes.
As we are converting a static html site we don't want to remove the trailing slash for the sake of re-indexing.
If I were to use GlobalRedirect module how would you ensure that the trailing slash is not removed?
Just wanted to report back to say that Drupal's GlobalRedirect module should allow the option to choose between the two.
In the meantime I have a related but similar issue.
Now that we have trailing slash closing URLs I want to ensure that the URL's are forced to end in a tailing slash. In some cases I have noticed that a few internal pages /page/ and /page both have PR. I need to redirect
/page to /page/
I have looked into many previous posted in WebmasterWorld, many of which Jim wrote, and found this worked for me:
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_URI} !(\.[a-zA-Z0-9]{1,5}¦/)$
RewriteRule (.*)$ [widgets.com...] [R=301,L]
Problem now is that certain items in the admin of Drupal don't work. For example, uploading images and activating/deactivating modules no longer work.
I came across the following which is suppose to apply the above 301 rule front end but excludes certain users or pages backend. If I can exclude this rule from running in the admin section then I hope that I can work inside admin without the redirect being a problem.
Only problem is that when I include the following it takes out the site.
# Except for the admin, user, and node areas
RewriteCond %{REQUEST_FILENAME} !^\/(admin¦user¦node)
RewriteRule ^(.*)$ [%{HTTP_HOST}$1...] [R=301,L]
The directories I would like to exclude from this traing slash rule are: /admin/ /user and /node/
any ideas?
Thanks in advance
[edited by: Pass_the_Dutchie at 3:44 pm (utc) on Sep. 8, 2008]
Also, remember to replace the broken pipe "¦" characters with solid pipes before use; Posting on this forum modifies the pipe characters.
Jim
so now I have:
# Force Trailing Slash
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_URI} !(\.[a-zA-Z0-9]{1,5}¦/)$
RewriteCond %{REQUEST_FILENAME} !^\/(admin¦user¦node¦edit)
RewriteRule (.*)$ http://www.example.com/$1/ [R=301,L]
The trailing slash still works but I cant get the exclusion to work.
www.example.com/page/13/edit will still redirect to www.example.com/page/13/edit/
I must have done something wrong.
[edited by: jdMorgan at 3:44 pm (utc) on Sep. 9, 2008]
[edit reason] example.com [/edit]
RewriteCond %{REQUEST_URI} !^/(admin¦user¦node¦edit)
Change the broken pipe characters to solid pipes before use, as always.
Jim
I need to force the / and I need the upload image function. I am stumped and my developer has run out of ideas.
Any advise would be very appreciated.
Thanks again,
dutchie
We can then discuss adding an additional exclusion to the rule so that uploads are not interfered with.
Jim