Forum Moderators: phranque

Message Too Old, No Replies

Drupal Duplication

.htaccess workaround

         

Pass the Dutchie

8:57 am on Jul 6, 2008 (gmt 0)

10+ Year Member



Hi,

Startup very kindly pointed out that Drupal has quite some issues when dealing with URL duplication. Much of his observations I am mentioning below.

Duplicate URL's are particularly evident when installing certain modules. The URL should be:

/widgets/

However the page could also be located with any of the folling URL's:

/index.php
/widgets/index.php
/widgets/index.htm?=cid1234
/widgets/index.htm
/widgets&cid=1234

Also...

[site.com ]------------ real url
[site.com ] --------- dup content
[site.com ] -------- more dup content

I have excluded /node/ in robot.txt but I was thinking that .htaccess was a better solution for the above issues.

Here is parts of the .htaccess source. I have omitted non relevant.

# Protect files and directories from prying eyes.
<FilesMatch "\.(engine¦inc¦info¦install¦module¦profile¦po¦sh¦.*sql¦theme¦tpl(\.php)?¦xtmpl)$¦^(code-style\.pl¦Entries.*¦Repository¦Root¦Tag¦Template)$">
Order allow,deny
</FilesMatch>

#RewriteBase /

# Rewrite old-style URLs of the form 'node.php?id=x'.
#RewriteCond %{REQUEST_FILENAME} !-f
#RewriteCond %{REQUEST_FILENAME} !-d
#RewriteCond %{QUERY_STRING} ^id=([^&]+)$
#RewriteRule node.php index.php?q=node/view/%1 [L]

# Rewrite old-style URLs of the form 'module.php?mod=x'.
#RewriteCond %{REQUEST_FILENAME} !-f
#RewriteCond %{REQUEST_FILENAME} !-d
#RewriteCond %{QUERY_STRING} ^mod=([^&]+)$
#RewriteRule module.php index.php?q=%1 [L]

# Rewrite current-style URLs of the form 'index.php?q=x'.
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.*)$ index.php?q=$1 [L,QSA]

I would really appreciate some pointers as to how to avoid the potential spider storm.

thanks

dutchie

ergophobe

4:14 pm on Jul 8, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You can solve a lot of this with the GlobalRedirect module [drupal.org] which does...


GlobalRedirect is a simple module which…

1. Checks the current URL for an alias and does a 301 redirect to it if it is not being used.
2. Checks the current URL for a trailing slash, removes it if present and repeats check 1 with the new request.
3. Checks if the current URL is the same as the site_frontpage and redirects to the frontpage if there is a match.
4. Checks if the Clean URLs feature is enabled and then checks the current URL is being accessed using the clean method rather than the 'unclean' method. (Currently only in DEV, will be in 1.3 soon)

Also, you may want to fix up the default robots.txt [drupalzilla.com]. That author suggests getting rid of the trailing slashes on some paths in the default robots.txt file. I would actually duplicate them (i.e. both with and without slashes), but that's a minor detail.

Rewrite your URLs so that URLs with/without (your preference) 'www' get redirected.

There is some discussion of drupal duplicate content issues [groups.drupal.org] in the Drupal SEO Group [groups.drupal.org].

The one caveat is that some of the people there know Drupal better than SEO (though many are solid with both). I would tend to favor WebmasterWorld as place to figure out *what* to do, while the Drupal SEO Group is a great place to figure out *how* to do it in Drupal and to identify Drupal-specific gotchas.

g1smd

4:53 pm on Jul 8, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



*** That author suggests getting rid of the trailing slashes on some paths in the default robots.txt file. I would actually duplicate them (i.e. both with and without slashes), but that's a minor detail. ***

No need for the duplication.

Disallow: /folder

will disallow all of these paths:

/folder/
/folder/x
/folder/xyz
/folder
/folderx
/folderxyz

It does not disallow a specific URL path.

It disallows ANY and ALL URL paths that begin with /pattern and exactly that.

.

Disallow: /folder/

will disallow all of these URL paths:

/folder/
/folder/x
/folder/xyz

and will NOT disallow any of these paths:

/folder
/folderx
/folderxyz

because all of those in the latter group fail to match after the "r".

Disallow: / disallows all URL paths that begin with a "/", and that is every URL on the site.

ergophobe

5:02 pm on Jul 8, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



What was I thinking!

Actually, I wouldn't and when I look at my drupal site robots.txt I didn't.

g1smd thanks for catching that brilliant advice!

Pass the Dutchie

7:26 am on Jul 9, 2008 (gmt 0)

10+ Year Member



Thanks ergophobe and once again g1smd for the sound advice.

I have previously read the posts you mention and they helped answer some of my questions. GlobalRedirect module seems like a decent mod except for the fact that is strips out all trailing slashes.

As we are converting a static html site we don't want to remove the trailing slash for the sake of re-indexing.

If I were to use GlobalRedirect module how would you ensure that the trailing slash is not removed?

ergophobe

4:12 pm on Jul 9, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hmmm... I don't know about that. Sorry.

But how many URLs actually have a trailing slash and are they important (high traffic, high-revenue, high in SERPs) pages currently? I guess it depends your site structure and all.

If it's a small number of pages, you could just 301 them.

Pass the Dutchie

1:05 pm on Jul 10, 2008 (gmt 0)

10+ Year Member



Yes, all URL's have trailing slashes.

ergophobe

7:08 pm on Jul 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Right. So then that is a problem.

I don't know the innards of the GlobalRedirect module, but I bet it wouldn't be too hard to flip it to default to a trailing slash, but that's just guessing. Sorry I can' help more.

Pass the Dutchie

3:40 pm on Sep 8, 2008 (gmt 0)

10+ Year Member



Hi,

Just wanted to report back to say that Drupal's GlobalRedirect module should allow the option to choose between the two.

In the meantime I have a related but similar issue.

Now that we have trailing slash closing URLs I want to ensure that the URL's are forced to end in a tailing slash. In some cases I have noticed that a few internal pages /page/ and /page both have PR. I need to redirect
/page to /page/

I have looked into many previous posted in WebmasterWorld, many of which Jim wrote, and found this worked for me:

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_URI} !(\.[a-zA-Z0-9]{1,5}¦/)$
RewriteRule (.*)$ [widgets.com...] [R=301,L]

Problem now is that certain items in the admin of Drupal don't work. For example, uploading images and activating/deactivating modules no longer work.

I came across the following which is suppose to apply the above 301 rule front end but excludes certain users or pages backend. If I can exclude this rule from running in the admin section then I hope that I can work inside admin without the redirect being a problem.

Only problem is that when I include the following it takes out the site.

# Except for the admin, user, and node areas
RewriteCond %{REQUEST_FILENAME} !^\/(admin¦user¦node)
RewriteRule ^(.*)$ [%{HTTP_HOST}$1...] [R=301,L]

The directories I would like to exclude from this traing slash rule are: /admin/ /user and /node/

any ideas?

Thanks in advance

[edited by: Pass_the_Dutchie at 3:44 pm (utc) on Sep. 8, 2008]

jdMorgan

11:49 pm on Sep 8, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Don't use that new rule -- It will loop, and it's not needed anyway. Just take the new RewriteCond from your new rule, and add it to the rule in your previous post. It will then exclude /admin or /user or/ node from your trailing-slash rule.

Also, remember to replace the broken pipe "¦" characters with solid pipes before use; Posting on this forum modifies the pipe characters.

Jim

Pass the Dutchie

9:39 am on Sep 9, 2008 (gmt 0)

10+ Year Member



thanks Jim,

so now I have:

# Force Trailing Slash
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_URI} !(\.[a-zA-Z0-9]{1,5}¦/)$
RewriteCond %{REQUEST_FILENAME} !^\/(admin¦user¦node¦edit)
RewriteRule (.*)$ http://www.example.com/$1/ [R=301,L]

The trailing slash still works but I cant get the exclusion to work.

www.example.com/page/13/edit will still redirect to www.example.com/page/13/edit/

I must have done something wrong.

[edited by: jdMorgan at 3:44 pm (utc) on Sep. 9, 2008]
[edit reason] example.com [/edit]

jdMorgan

3:48 pm on Sep 9, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Wrong variable, no escape needed on "/" :

RewriteCond %{REQUEST_URI} !^/(admin¦user¦node¦edit)

The %{REQUEST_FILENAME) variable is a full server filepath (e.g. /var/users/<user>/www/html/your_admin_filepath_here), and it is certain that your pattern is missing the entire server filepath and so will not match. So you need to test the requested URL-path, not the filepath.

Change the broken pipe characters to solid pipes before use, as always.

Jim

Pass the Dutchie

7:59 pm on Sep 9, 2008 (gmt 0)

10+ Year Member



Jim you're right, that fixed the issue and disables rewrite on those directories. However, what I am trying to do is correct a problem when uploading an image via a module php upload script. This rewrite now disables the upload function.

I need to force the / and I need the upload image function. I am stumped and my developer has run out of ideas.

Any advise would be very appreciated.

Thanks again,

dutchie

jdMorgan

2:05 am on Sep 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Like mod_rewrite, we can only speak in terms of URLs here. So we need to know what URL(s) are used for "uploading an image." If not easily-findable, you can get this information by using the "Live HTTP Headers" add-on for Firefox/Mozilla browsers; Enable the "recording screen" then upload an image, and the add-on will capture all transactions between your browser and the server.

We can then discuss adding an additional exclusion to the rule so that uploads are not interfered with.

Jim

Pass the Dutchie

7:13 am on Sep 10, 2008 (gmt 0)

10+ Year Member



that was the magic ticket! Seeing the upload process allowed me to see the offending URL. In this case it was /filefield/

The solution was to amend the following:

RewriteCond %{REQUEST_FILENAME} !^\/(admin¦user¦node¦edit¦filefield)

(replaced ¦ with pipe)

Jim you are a star - thanks alot :)

g1smd

8:24 am on Sep 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Live HTTP Headers wins again.

You can't hope to do serious work without it, or something similar.

It has caught a number of errors that weren't immediately visible in ordinary testing.

jdMorgan

1:40 pm on Sep 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Live Headers is definitely "essential kit" for testing rewriting and redirecting. Good for checking cache-control headers, cookies, MIME-types, and everything else related to client and server headers as well.

Jim