Forum Moderators: phranque

Message Too Old, No Replies

add htm extension to requesting URI

         

Jack1014

11:19 pm on Sep 14, 2017 (gmt 0)

5+ Year Member



Requesting URI's that have no extension get a 404 error; I would like to add ".htm" to them in the htaccess file. Our files are all ".htm" (It's an old site. We use "RedirectMatch 301" to change requests with an html extension to htm, but I can't figure out how one can change no extension to htm.

whitespace

12:13 am on Sep 15, 2017 (gmt 0)

10+ Year Member Top Contributors Of The Month



Please show your current RedirectMatch directive and anything else you have in your .htaccess file.

Do any of your existing URLs contain a dot elsewhere in the URL?

keyplyr

12:23 am on Sep 15, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hello Jack1014 and welcome to WebmasterWorld [webmasterworld.com]

For your own security, please do not publish the contents of your entire htaccess file.

lucy24

2:04 am on Sep 15, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Wait, wait. Who is requesting the nonexistent files, and why?

If your URLs have never ended in .html, or you have never been extensionless, a 404 is a perfectly reasonable response. Are you by chance asking specifically about /index.html requests? Or are people on other sites linking to some specific pages, giving incorrect extensions?

Whitespace's question is leading up to “easy Regular Expression vs. complicated Regular Expression”. If your URLs never contain literal . periods (domain name and extension don't count) the pattern is pretty straightforward. If any of your URLs do contain literal periods--it's perfectly legal and sometimes even necessary--it gets more complicated.

RedirectMatch means mod_alias. You can only use this if your htaccess contains no RewriteRules (mod_rewrite), because the two mods don't play nice together. It is theoretically possible not to use mod_rewrite at all, but it's awfully uncommon.

Jack1014

2:47 am on Sep 15, 2017 (gmt 0)

5+ Year Member



Argh. Didn't know about Rewrite vs Redirect. Our htaccess begins:
RewriteEngine On
RewriteCond %{HTTPS} !=on
RewriteRule ^(.*)$ https://%{HTTP_HOST}%{REQUEST_URI} [L,R=301]

RedirectMatch 301 (.*).html$ https://sizes.com$1.htm
Our problem: we define units. If people request ourdomain.com/acre.htm, they get the page. If they request ourdomain.com/acre.html, they got a 404 error until we added the above RedirectMatch. If they request ourdomain.acre, they get a 404 error. It is the last item that I was trying to correct.

[edited by: bill at 4:10 am (utc) on Sep 15, 2017]
[edit reason] de-link code [/edit]

Jack1014

3:00 am on Sep 15, 2017 (gmt 0)

5+ Year Member



Just to be clear: It isn't links in other sites, like Wikipedia, that's the problem. Those links are generally right. Our problem is that what people type into search engines like google very often doesn't get the extension right.

lucy24

4:05 am on Sep 15, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



But if they're typing something into a search engine

:: pause here for obligatory rant about ordinary humans' inability to distinguish between search box and address bar, and browsers' intentional encouragement of the error ::

doesn't the search engine return the actual URL of the actual page that actually exists which they've actually crawled? No? Well, ### to the search engines.

I see you do already have a RewriteRule in place, meaning that you need to use RewriteRule instead of RedirectMatch for any other redirects. One option is
RewriteRule ^([^.]+\.htm)l https://www.example.com/$1 [R=301,L]
There are others; see above about literal . in an URL.

The https redirect always goes at the very end of all other redirects; the default ordering is from most specific to most general. But, as long as we're here, the existing form isn't optimal. You should merge it with the domain-name-canonicalization redirect, yielding (if it's a with-www site)
RewriteCond %{HTTPS} !on [OR]
RewriteCond %{HTTP_HOST} !^(www\.example\.com)?$
RewriteRule (.+) https://www.example.com/$1 [R=301,L]
That's assuming you only have one hostname using this particular htaccess. If there are many, it gets more complicated.

Edit:
If they request ourdomain.acre, they get a 404 error. It is the last item that I was trying to correct.
wtf? Please tell me something got garbled here. If someone requests example.org, they won't end up on example.com.

Jack1014

9:44 pm on Sep 15, 2017 (gmt 0)

5+ Year Member



Apologies:
A) I was thinking browser and wrote search engine. As you realized, I meant the address in the browser's address bar.
B) I should have written ourdomain.com/acre

This is a VPS with a single domain on it. The only periods are dot com and dot htm (css,js, svg,jpg, etc). No periods before the dot in dot com. We shed the www a few years ago but other sites have links to us that do contain the www. Domain has a fixed IP.

As you can easily tell, I am in way over my head here, resorting to zombie, cut-and-paste coding. Cutting and pasting your suggestions seems to lead to something like this (or does it?):

RewriteEngine on

# remove www
RewriteCond %{HTTP_HOST} ^www\.example\.com [NC]
RewriteRule (.*) https://example.com/$1 [R=301,L]

# prevent google from indexing both IP and domain
RewriteCond %{HTTP_HOST} ^128.XXX.XXX.xxx
RewriteRule (.*) https://example.com/$1 [R=301,L]

# change dot html to dot htm
RewriteRule ^([^.]+\.htm)l https://example.com/$1 [R=301,L]

# change http:// to https://
RewriteCond %{HTTPS} !on [OR]
RewriteCond %{HTTP_HOST} !^(example\.com)?$
RewriteRule (.+) https://example.com/$1 [R=301,L]

I'm still in the dark about how one adds an extension to a REQUEST_URI that lacks one.
Thanks for your patience.


[edited by: not2easy at 10:45 pm (utc) on Sep 15, 2017]
[edit reason] unlinked/removed extra . (?) [/edit]

not2easy

10:52 pm on Sep 15, 2017 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Just a note, this section:
# change dot html to dot htm
RewriteRule ^([^.]+\.htm)l https://example.com/$1 [R=301,L]
had an extra '.' after the // and before the example.com part of the Rewrite target URL. Hopefully that was a typo?

I removed it to prevent the accidental linking it caused, to make it readable. Please clarify whether that is actually in place or an accidental addition here.

whitespace

11:59 pm on Sep 15, 2017 (gmt 0)

10+ Year Member Top Contributors Of The Month




# change http:// to https://
RewriteCond %{HTTPS} !on [OR]
RewriteCond %{HTTP_HOST} !^(example\.com)?$
RewriteRule (.+) https://example.com/$1 [R=301,L]


Note that if you use this then you don't need your previous "remove www" and "prevent google from indexing both IP and domain" directive blocks. This bit of code does both of those and more.

However, I'm not sure why lucy24 used (.+) instead of (.*) in the RewriteRule pattern? If you use + (1 or more) instead of * (0 or more) then it will miss the document root when used in a .htaccess context. (?)

The only periods are dot com and dot htm (css,js, svg,jpg, etc)


If the only dots in the URL-path appear before the file extension then you can simply check to see if there are no dots. If there are no dots in the URL-path then append the ".htm" extension. For example:


RewriteRule ^([^.]+)$ https://example.com/$1.htm [R=301,L]


Or, maybe just check to see if there no file extension (of 2-4 letters) at the end of the URL (using a negated regex)? (Assumes all your file extensions are lowercase and only letters.)


RewriteRule !\.[a-z]{2,4}$ https://example.com%{REQUEST_URI}.htm [R=301,L]

lucy24

1:31 am on Sep 16, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If you use + (1 or more) instead of * (0 or more) then it will miss the document root when used in a .htaccess context. (?)
Oh, right. I really should make a habit of copying-and-pasting from my own htaccess files--which by now should have it right--rather than just typ(o)ing from scratch every time.

Edit: OP, are you also getting requests for extensionless URLs? Honestly those may be simpler to leave as 404s, unless you are getting a whole lot for some reason and you don't want to risk losing them. The pattern for extensionless is
^([^.]+[^./])$
BUT you will notice that this is also the pattern for legitimate (physical) directories minus the final / slash--a pattern that search engines can and do ask for. Normally mod_dir handles it--that's one of its two jobs--but you sure as ### don't want to bother with a -d test every time.

Jack1014

3:29 am on Sep 16, 2017 (gmt 0)

5+ Year Member



not2easy -- It was a typo. Thanks for the edit.
whitespace -- It would never have ocurred to me to count the dots. Thank you.
Lucy24-- It's a user problem. People come to the site looking for definitions of units. Uppermost in their mind is the name of the unit. So they type into the address bar "example.com/units/acre", and they get a 404 error. We get a lot of these.
I'm not sure how the effect of your pattern difers from that of whitespace.
Here's my current version:
RewriteEngine on

# change dot html to dot htm
RewriteRule ^([^.]+\.htm)l https://example.com/$1 [R=301,L]

# add dot htm extension to requests lacking an extension
RewriteRule ^([^.]+)$ https://example.com/$1.htm [R=301,L]

# change http:// to https://
RewriteCond %{HTTPS} !on [OR]
RewriteCond %{HTTP_HOST} !^(example\.com)?$
RewriteRule (.*) https://example.com/$1 [R=301,L]

keyplyr

3:54 am on Sep 16, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



So they type into the address bar "example.com/units/acre", and they get a 404 error. We get a lot of these.
Jack1014 - do you have a Site Search?

After I installed a Site Search utility at the top of all my pages, 90% of that address bar activity stopped. Still get one or two per month, but most everyone gets to the right page using the Site Search now.

lucy24

5:37 am on Sep 16, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



RewriteRule ^([^.]+)$ https://example.com/$1.htm [R=301,L]
No, no, you can't do this--unless your site's URL structure is perfectly flat, with no directories whatsoever, which your example of “example.com/units/acre” indicates is not the case. Otherwise any request for
example.com/directory/
will be redirected to
example.com/directory/.htm
while requests for
example.com/directory
will be redirected to
example.com/directory.htm

At an absolute minimum, you will need a RewriteCond excluding requests for any and all actual directories. (If you don't have too many, it is easiest just to list them by name.) It will look something like
RewriteCond %{REQUEST_URI} !^(realdir|otherdir|thirddir|dir/subdir|dir/othersub)
RewriteRule ^([^.]+[^./])$ https://www.example.com/$1.htm [R=301,L]
but the exclusions in the RewriteCond should be set up to deal with requests that you actually get. And if the invalid requests are only for certain directories, the body of the rule should be constrained to something that looks like
RewriteRule ^(onedir|otherdir)/([^.]+[^./])$ https://www.example.com/$1/$2.htm [R=301,L]
Matter of fact, the ([^.]+[^./])$ part may well be reducible to (\w+)$. Again, this depends on which URLs on your site actually exist.

Further afield...

Some people at this point would argue that you are doing the whole thing ### backward and you ought to be using extensionless URLs, rewritten ([L] flag alone) to add the extension. Personally my reaction to an extensionless URL is to tell it to go back in the server and put some clothes on, but that's me.

whitespace

8:44 am on Sep 16, 2017 (gmt 0)

10+ Year Member Top Contributors Of The Month



... this is also the pattern for legitimate (physical) directories


Ah yes - directories! (Hhmm I overlooked that!) If you have many directories then it may be easier to use a less efficient catch-all, for example:


RewriteCond %{REQUEST_FILENAME} !-d


But listing/constraining the directories, as lucy24 suggests, would be preferable.


RewriteCond %{REQUEST_URI} !^(realdir|otherdir|thirddir|dir/subdir|dir/othersub)


But note that the REQUEST_URI server variable starts with a slash, so the CondPattern would need to start !^/(realdir|other.....

Jack1014

8:07 pm on Sep 16, 2017 (gmt 0)

5+ Year Member



keyplyr: We do have site search, using Wrensoft's Zoom engine (replacing Google's site search), and our experience was the same as yours, it cut down the number of 404 errors.

lucy24 + whitespace: Ooops. I counted the number of directories. There are 27 directories one level below the root directory. Most of those directories have at least two directories of their own. In other words, flat it isn't. Listing them all (and maintaining the list) would be a nightmare.

Frankly, I can't imagine why a real user would request one of these directories, or what they would get if they did, but I guess I am naive. Basically, it would be better if users could NOT request a directory; the directories are just a convenient way for us to organize subject matter.

Actually, because of scraping, robots (Chinese robots, lately) request every conceivable permutation of the urllist, including directories, sub-directories, sub-sub-directories. So there is someone looking at directories.

Extensionless URLS? But then, how would you know whether to add ".htm" or ".php"? Plus, wouldn't eliminating R mean the search engines wouldn't record the change?

Version with correction. I gather performance takes a big hit due to not listing the directories, but ....

RewriteEngine on

# change dot html to dot htm
RewriteRule ^([^.]+\.htm)l https://example.com/$1 [R=301,L]

# add dot htm extension to requests lacking an extension
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^([^.]+)$ https://example.com/$1.htm [R=301,L]

# change http:// to https://
RewriteCond %{HTTPS} !on [OR]
RewriteCond %{HTTP_HOST} !^(example\.com)?$
RewriteRule (.*) https://example.com/$1 [R=301,L]

keyplyr

8:20 pm on Sep 16, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Actually, because of scraping, robots (Chinese robots, lately) request every conceivable permutation of the urllist, including directories, sub-directories, sub-sub-directories. So there is someone looking at directories.
IMO pandering to bots is futile. Ignore the 404 errors caused by these UAs, they can be endless; unless there is a valid reason of course.

But personally, I wouldn't add rewrite rules just because a few visitors use the URL box instead of the Site Search.. or because of rogue UAs requesting unknown files from various directories. It's a waste of time chasing all these.

Jack1014

8:28 pm on Sep 16, 2017 (gmt 0)

5+ Year Member



I would like to ignore them but our ISP doesn't. Because ours is a VPS, huuuge loads on our site can affect the other VPS'es on the same server, and so when one of these hits we get emails from a robot at the ISP saying you're using too much memory (and we pay for an amount of memory which is 10 times the actual size of the site, which is almost all static pages).

keyplyr

9:16 pm on Sep 16, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If you are "using too much memory" serving a 404 you should consider serving a smaller 404 file.

lucy24

9:38 pm on Sep 16, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Jack, earlier you gave the example of “example.com/units/acre” where the erroneous request comes in the directory /units/. Presumably you have other directories that don't suffer from this problem, because type-ins are rare--for example, nobody would be typing in a string of numbers to an /info/ directory. So the question is: how many of each? And also, is it always the same depth? Here: example.com/directory/filename and-that's-all.

For present purposes, let's suppose all your file- and directory names consist only of alphanumerics, which can be expressed tidily as \w. This term includes lowlines, but excludes hyphens; if you do have hyphens it would be [\w-] instead.

If you have a lot of directories that result in spurious type-ins, and just a few exceptions, a possible ruleset is
RewriteCond %{REQUEST_URI} !^/(onedir|otherdir)
RewriteRule ^(\w+/\w+)$ https://example.com/$1.htm [R=301,L]
If, on the other hand, the spurious type-ins are limited to just a few directories, a possible rule--conditionless--is
RewriteRule ^((onedir|otherdir|thirddir)/\w+)$ https://example.com/$1.htm [R=301,L]

As you can see, this is not a case for One Size Fits All. You'll need a ruleset that is tailored to the particular circumstances of your site.

If you decide it's appropriate, you could even lump the two types of mistake into a single ruleset, with pattern
^((onedir|otherdir|thirddir)/\w+)(?:\.html)?$
(i.e. optional incorrect .html which is not captured) but this is probably not the most efficient way.

Edit: If your host is yapping a lot about memory usage, that's another reason to try very hard not to involve -f and -d tests, since it basically means the server is checking the same thing twice on every request (once while executing the RewriteRule, and then again a millisecond later when it has finished all the mods and is ready to serve up the requested file, if in fact it exists).

Jack1014

10:30 pm on Sep 16, 2017 (gmt 0)

5+ Year Member



Ahh. Okay, no -d. While the html instead of htm problem occurs sitewide, the missing extension problem is limited to the units directory.

The units directory contains files of the form acre.htm, yard.htm, sa-perunjong.htm, sana_lamjel.htm, etc..
It also contains the following subdirectories:
uimages: the images used in pages in the units directory
audio: as above, but sounds instead of pix.
about: htm files ordinarily reached by clicking on an "about" link in one of the pages in the units directory. Chronology of contributors.
symbol: htm files decoding abbreviations, with one php file that searches them
cheating: htm files describing incidents of cheating with wts and measures
charts: htm files containing graphical representations of systems of units. These are frequently accessed directly, i.e., not from a link on an htm page in the units directory.
country: htm files on historical metrology of a particular country/region. Currently weak and not much consulted.
general: htm files, catchall for pages on principles, conversion disasters, and so forth. Often accessed directly.

So, if I am following you correctly, I should be using something like:

RewriteRule ^((units)/\w-)$ https://example.com/$1.htm [R=301,L]

lucy24

2:48 am on Sep 17, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yes, except that if it's only /units/ you don't need parentheses around it:
^(units/\w-)$
Gosh, that suddenly got a whole lot simpler didn't it :)

But with all those subdirectories--darn! you just have to keep complicating things don't you--you'll need to add a Condition:
RewriteCond %{REQUEST_URI} !^/units/(uimages|audio|about|symbol|cheating|charts|country|general)$
It is standard practice for search engines to make requests like
example.com/units/subdir (no final slash)
just to verify that they get redirected to
example.com/units/subdir/ (with final slash)
This redirect is handled by mod_dir. You don't have to do anything about it. I've also met a few non-search-engine robots that habitually ask for directories without final / slash. They only do it to annoy, because they know it teases.

Edit:
sana_lamjel.html
Ha! So you actually do use lowlines. Those are covered by the \w locution.

:: wandering off to learn what the heck a sana_lamjel is when it's at home ::

(Returning) Ooh, how interesting. Manipur? Is that somewhere near Assam? Yeah, probably.

Jack1014

9:46 pm on Sep 20, 2017 (gmt 0)

5+ Year Member



Well, this has taken up too much of your time, but I am deeply appreciative. Here's what I'm planning to upload. I sense there's some redundancy, but couldn't say where.

RewriteEngine on

# change dot html to dot htm
RewriteRule ^([^.]+\.htm)l https:/example.com/$1 [R=301,L]

# add dot htm extension to requests lacking an extension
RewriteCond %{REQUEST_URI} !^/units/(uimages|audio|about|symbol|cheating|charts|country|general)$
RewriteRule ^(units/\w-)$ https://example.com/$1.htm [R=301,L]

# change http:// to https://
RewriteCond %{HTTPS} !on [OR]
RewriteCond %{HTTP_HOST} !^(example\.com)?$
RewriteRule (.*) https://example.com/$1 [R=301,L]

# fix old too-clever spelling of directory
RewriteRule ^materls/(.*)$ https://example.com/materials/$1 [R=301,L]

#Gzip
<ifmodule mod_deflate.c>
AddOutputFilter DEFLATE js css
AddOutputFilterByType DEFLATE text/text text/html text/plain text/xml application/javascript
#The following lines are to avoid bugs with some browsers
BrowserMatch ^Mozilla/4 gzip-only-text/html
BrowserMatch ^Mozilla/4\.0[678] no-gzip
BrowserMatch \bMSIE !no-gzip !gzip-only-text/html
</ifmodule>
#End Gzip

Thanks again.

lucy24

10:27 pm on Sep 20, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Looks good, except...
^(units/\w-)$

Whoops! You meant
^(units/[\w-]+)$


Edit: This mistake is partly my fault, because I overlooked it in an earlier copy-and-paste. (Scroll back 3 or 4 posts.)

phranque

11:11 pm on Sep 20, 2017 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



i would reorder the rulesets from most specific to most general.

like this:
# fix old too-clever spelling of directory
# add dot htm extension to requests lacking an extension# change dot html to dot htm
# change dot html to dot htm
# change http:// to https://



<ifmodule mod_deflate.c>

no ifmodule necessary - either you've got mod_deflate installed or you remove these directives.