Welcome to WebmasterWorld Guest from 34.204.191.31

Forum Moderators: Ocean10000 & phranque

Message Too Old, No Replies

add htm extension to requesting URI

     
11:19 pm on Sep 14, 2017 (gmt 0)

New User

joined:Sept 14, 2017
posts: 9
votes: 0


Requesting URI's that have no extension get a 404 error; I would like to add ".htm" to them in the htaccess file. Our files are all ".htm" (It's an old site. We use "RedirectMatch 301" to change requests with an html extension to htm, but I can't figure out how one can change no extension to htm.
12:13 am on Sept 15, 2017 (gmt 0)

Full Member

Top Contributors Of The Month

joined:Apr 11, 2015
posts: 328
votes: 24


Please show your current RedirectMatch directive and anything else you have in your .htaccess file.

Do any of your existing URLs contain a dot elsewhere in the URL?
12:23 am on Sept 15, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 893


Hello Jack1014 and welcome to WebmasterWorld [webmasterworld.com]

For your own security, please do not publish the contents of your entire htaccess file.
2:04 am on Sept 15, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15893
votes: 876


Wait, wait. Who is requesting the nonexistent files, and why?

If your URLs have never ended in .html, or you have never been extensionless, a 404 is a perfectly reasonable response. Are you by chance asking specifically about /index.html requests? Or are people on other sites linking to some specific pages, giving incorrect extensions?

Whitespace's question is leading up to “easy Regular Expression vs. complicated Regular Expression”. If your URLs never contain literal . periods (domain name and extension don't count) the pattern is pretty straightforward. If any of your URLs do contain literal periods--it's perfectly legal and sometimes even necessary--it gets more complicated.

RedirectMatch means mod_alias. You can only use this if your htaccess contains no RewriteRules (mod_rewrite), because the two mods don't play nice together. It is theoretically possible not to use mod_rewrite at all, but it's awfully uncommon.
2:47 am on Sept 15, 2017 (gmt 0)

New User

joined:Sept 14, 2017
posts: 9
votes: 0


Argh. Didn't know about Rewrite vs Redirect. Our htaccess begins:
RewriteEngine On
RewriteCond %{HTTPS} !=on
RewriteRule ^(.*)$ https://%{HTTP_HOST}%{REQUEST_URI} [L,R=301]

RedirectMatch 301 (.*).html$ https://sizes.com$1.htm
Our problem: we define units. If people request ourdomain.com/acre.htm, they get the page. If they request ourdomain.com/acre.html, they got a 404 error until we added the above RedirectMatch. If they request ourdomain.acre, they get a 404 error. It is the last item that I was trying to correct.

[edited by: bill at 4:10 am (utc) on Sep 15, 2017]
[edit reason] de-link code [/edit]

3:00 am on Sept 15, 2017 (gmt 0)

New User

joined:Sept 14, 2017
posts: 9
votes: 0


Just to be clear: It isn't links in other sites, like Wikipedia, that's the problem. Those links are generally right. Our problem is that what people type into search engines like google very often doesn't get the extension right.
4:05 am on Sept 15, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15893
votes: 876


But if they're typing something into a search engine

:: pause here for obligatory rant about ordinary humans' inability to distinguish between search box and address bar, and browsers' intentional encouragement of the error ::

doesn't the search engine return the actual URL of the actual page that actually exists which they've actually crawled? No? Well, ### to the search engines.

I see you do already have a RewriteRule in place, meaning that you need to use RewriteRule instead of RedirectMatch for any other redirects. One option is
RewriteRule ^([^.]+\.htm)l https://www.example.com/$1 [R=301,L]
There are others; see above about literal . in an URL.

The https redirect always goes at the very end of all other redirects; the default ordering is from most specific to most general. But, as long as we're here, the existing form isn't optimal. You should merge it with the domain-name-canonicalization redirect, yielding (if it's a with-www site)
RewriteCond %{HTTPS} !on [OR]
RewriteCond %{HTTP_HOST} !^(www\.example\.com)?$
RewriteRule (.+) https://www.example.com/$1 [R=301,L]
That's assuming you only have one hostname using this particular htaccess. If there are many, it gets more complicated.

Edit:
If they request ourdomain.acre, they get a 404 error. It is the last item that I was trying to correct.
wtf? Please tell me something got garbled here. If someone requests example.org, they won't end up on example.com.
9:44 pm on Sept 15, 2017 (gmt 0)

New User

joined:Sept 14, 2017
posts: 9
votes: 0


Apologies:
A) I was thinking browser and wrote search engine. As you realized, I meant the address in the browser's address bar.
B) I should have written ourdomain.com/acre

This is a VPS with a single domain on it. The only periods are dot com and dot htm (css,js, svg,jpg, etc). No periods before the dot in dot com. We shed the www a few years ago but other sites have links to us that do contain the www. Domain has a fixed IP.

As you can easily tell, I am in way over my head here, resorting to zombie, cut-and-paste coding. Cutting and pasting your suggestions seems to lead to something like this (or does it?):

RewriteEngine on

# remove www
RewriteCond %{HTTP_HOST} ^www\.example\.com [NC]
RewriteRule (.*) https://example.com/$1 [R=301,L]

# prevent google from indexing both IP and domain
RewriteCond %{HTTP_HOST} ^128.XXX.XXX.xxx
RewriteRule (.*) https://example.com/$1 [R=301,L]

# change dot html to dot htm
RewriteRule ^([^.]+\.htm)l https://example.com/$1 [R=301,L]

# change http:// to https://
RewriteCond %{HTTPS} !on [OR]
RewriteCond %{HTTP_HOST} !^(example\.com)?$
RewriteRule (.+) https://example.com/$1 [R=301,L]

I'm still in the dark about how one adds an extension to a REQUEST_URI that lacks one.
Thanks for your patience.


[edited by: not2easy at 10:45 pm (utc) on Sep 15, 2017]
[edit reason] unlinked/removed extra . (?) [/edit]

10:52 pm on Sept 15, 2017 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:4524
votes: 350


Just a note, this section:
# change dot html to dot htm
RewriteRule ^([^.]+\.htm)l https://example.com/$1 [R=301,L]
had an extra '.' after the // and before the example.com part of the Rewrite target URL. Hopefully that was a typo?

I removed it to prevent the accidental linking it caused, to make it readable. Please clarify whether that is actually in place or an accidental addition here.
11:59 pm on Sept 15, 2017 (gmt 0)

Full Member

Top Contributors Of The Month

joined:Apr 11, 2015
posts: 328
votes: 24



# change http:// to https://
RewriteCond %{HTTPS} !on [OR]
RewriteCond %{HTTP_HOST} !^(example\.com)?$
RewriteRule (.+) https://example.com/$1 [R=301,L]


Note that if you use this then you don't need your previous "remove www" and "prevent google from indexing both IP and domain" directive blocks. This bit of code does both of those and more.

However, I'm not sure why lucy24 used (.+) instead of (.*) in the RewriteRule pattern? If you use + (1 or more) instead of * (0 or more) then it will miss the document root when used in a .htaccess context. (?)

The only periods are dot com and dot htm (css,js, svg,jpg, etc)


If the only dots in the URL-path appear before the file extension then you can simply check to see if there are no dots. If there are no dots in the URL-path then append the ".htm" extension. For example:


RewriteRule ^([^.]+)$ https://example.com/$1.htm [R=301,L]


Or, maybe just check to see if there no file extension (of 2-4 letters) at the end of the URL (using a negated regex)? (Assumes all your file extensions are lowercase and only letters.)


RewriteRule !\.[a-z]{2,4}$ https://example.com%{REQUEST_URI}.htm [R=301,L]
1:31 am on Sept 16, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15893
votes: 876


If you use + (1 or more) instead of * (0 or more) then it will miss the document root when used in a .htaccess context. (?)
Oh, right. I really should make a habit of copying-and-pasting from my own htaccess files--which by now should have it right--rather than just typ(o)ing from scratch every time.

Edit: OP, are you also getting requests for extensionless URLs? Honestly those may be simpler to leave as 404s, unless you are getting a whole lot for some reason and you don't want to risk losing them. The pattern for extensionless is
^([^.]+[^./])$
BUT you will notice that this is also the pattern for legitimate (physical) directories minus the final / slash--a pattern that search engines can and do ask for. Normally mod_dir handles it--that's one of its two jobs--but you sure as ### don't want to bother with a -d test every time.
3:29 am on Sept 16, 2017 (gmt 0)

New User

joined:Sept 14, 2017
posts: 9
votes: 0


not2easy -- It was a typo. Thanks for the edit.
whitespace -- It would never have ocurred to me to count the dots. Thank you.
Lucy24-- It's a user problem. People come to the site looking for definitions of units. Uppermost in their mind is the name of the unit. So they type into the address bar "example.com/units/acre", and they get a 404 error. We get a lot of these.
I'm not sure how the effect of your pattern difers from that of whitespace.
Here's my current version:
RewriteEngine on

# change dot html to dot htm
RewriteRule ^([^.]+\.htm)l https://example.com/$1 [R=301,L]

# add dot htm extension to requests lacking an extension
RewriteRule ^([^.]+)$ https://example.com/$1.htm [R=301,L]

# change http:// to https://
RewriteCond %{HTTPS} !on [OR]
RewriteCond %{HTTP_HOST} !^(example\.com)?$
RewriteRule (.*) https://example.com/$1 [R=301,L]
3:54 am on Sept 16, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 893


So they type into the address bar "example.com/units/acre", and they get a 404 error. We get a lot of these.
Jack1014 - do you have a Site Search?

After I installed a Site Search utility at the top of all my pages, 90% of that address bar activity stopped. Still get one or two per month, but most everyone gets to the right page using the Site Search now.
5:37 am on Sept 16, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15893
votes: 876


RewriteRule ^([^.]+)$ https://example.com/$1.htm [R=301,L]
No, no, you can't do this--unless your site's URL structure is perfectly flat, with no directories whatsoever, which your example of “example.com/units/acre” indicates is not the case. Otherwise any request for
example.com/directory/
will be redirected to
example.com/directory/.htm
while requests for
example.com/directory
will be redirected to
example.com/directory.htm

At an absolute minimum, you will need a RewriteCond excluding requests for any and all actual directories. (If you don't have too many, it is easiest just to list them by name.) It will look something like
RewriteCond %{REQUEST_URI} !^(realdir|otherdir|thirddir|dir/subdir|dir/othersub)
RewriteRule ^([^.]+[^./])$ https://www.example.com/$1.htm [R=301,L]
but the exclusions in the RewriteCond should be set up to deal with requests that you actually get. And if the invalid requests are only for certain directories, the body of the rule should be constrained to something that looks like
RewriteRule ^(onedir|otherdir)/([^.]+[^./])$ https://www.example.com/$1/$2.htm [R=301,L]
Matter of fact, the ([^.]+[^./])$ part may well be reducible to (\w+)$. Again, this depends on which URLs on your site actually exist.

Further afield...

Some people at this point would argue that you are doing the whole thing ### backward and you ought to be using extensionless URLs, rewritten ([L] flag alone) to add the extension. Personally my reaction to an extensionless URL is to tell it to go back in the server and put some clothes on, but that's me.
8:44 am on Sept 16, 2017 (gmt 0)

Full Member

Top Contributors Of The Month

joined:Apr 11, 2015
posts: 328
votes: 24


... this is also the pattern for legitimate (physical) directories


Ah yes - directories! (Hhmm I overlooked that!) If you have many directories then it may be easier to use a less efficient catch-all, for example:


RewriteCond %{REQUEST_FILENAME} !-d


But listing/constraining the directories, as lucy24 suggests, would be preferable.


RewriteCond %{REQUEST_URI} !^(realdir|otherdir|thirddir|dir/subdir|dir/othersub)


But note that the REQUEST_URI server variable starts with a slash, so the CondPattern would need to start !^/(realdir|other.....
8:07 pm on Sept 16, 2017 (gmt 0)

New User

joined:Sept 14, 2017
posts: 9
votes: 0


keyplyr: We do have site search, using Wrensoft's Zoom engine (replacing Google's site search), and our experience was the same as yours, it cut down the number of 404 errors.

lucy24 + whitespace: Ooops. I counted the number of directories. There are 27 directories one level below the root directory. Most of those directories have at least two directories of their own. In other words, flat it isn't. Listing them all (and maintaining the list) would be a nightmare.

Frankly, I can't imagine why a real user would request one of these directories, or what they would get if they did, but I guess I am naive. Basically, it would be better if users could NOT request a directory; the directories are just a convenient way for us to organize subject matter.

Actually, because of scraping, robots (Chinese robots, lately) request every conceivable permutation of the urllist, including directories, sub-directories, sub-sub-directories. So there is someone looking at directories.

Extensionless URLS? But then, how would you know whether to add ".htm" or ".php"? Plus, wouldn't eliminating R mean the search engines wouldn't record the change?

Version with correction. I gather performance takes a big hit due to not listing the directories, but ....

RewriteEngine on

# change dot html to dot htm
RewriteRule ^([^.]+\.htm)l https://example.com/$1 [R=301,L]

# add dot htm extension to requests lacking an extension
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^([^.]+)$ https://example.com/$1.htm [R=301,L]

# change http:// to https://
RewriteCond %{HTTPS} !on [OR]
RewriteCond %{HTTP_HOST} !^(example\.com)?$
RewriteRule (.*) https://example.com/$1 [R=301,L]
8:20 pm on Sept 16, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 893


Actually, because of scraping, robots (Chinese robots, lately) request every conceivable permutation of the urllist, including directories, sub-directories, sub-sub-directories. So there is someone looking at directories.
IMO pandering to bots is futile. Ignore the 404 errors caused by these UAs, they can be endless; unless there is a valid reason of course.

But personally, I wouldn't add rewrite rules just because a few visitors use the URL box instead of the Site Search.. or because of rogue UAs requesting unknown files from various directories. It's a waste of time chasing all these.
8:28 pm on Sept 16, 2017 (gmt 0)

New User

joined:Sept 14, 2017
posts: 9
votes: 0


I would like to ignore them but our ISP doesn't. Because ours is a VPS, huuuge loads on our site can affect the other VPS'es on the same server, and so when one of these hits we get emails from a robot at the ISP saying you're using too much memory (and we pay for an amount of memory which is 10 times the actual size of the site, which is almost all static pages).
9:16 pm on Sept 16, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 893


If you are "using too much memory" serving a 404 you should consider serving a smaller 404 file.
9:38 pm on Sept 16, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15893
votes: 876


Jack, earlier you gave the example of “example.com/units/acre” where the erroneous request comes in the directory /units/. Presumably you have other directories that don't suffer from this problem, because type-ins are rare--for example, nobody would be typing in a string of numbers to an /info/ directory. So the question is: how many of each? And also, is it always the same depth? Here: example.com/directory/filename and-that's-all.

For present purposes, let's suppose all your file- and directory names consist only of alphanumerics, which can be expressed tidily as \w. This term includes lowlines, but excludes hyphens; if you do have hyphens it would be [\w-] instead.

If you have a lot of directories that result in spurious type-ins, and just a few exceptions, a possible ruleset is
RewriteCond %{REQUEST_URI} !^/(onedir|otherdir)
RewriteRule ^(\w+/\w+)$ https://example.com/$1.htm [R=301,L]
If, on the other hand, the spurious type-ins are limited to just a few directories, a possible rule--conditionless--is
RewriteRule ^((onedir|otherdir|thirddir)/\w+)$ https://example.com/$1.htm [R=301,L]

As you can see, this is not a case for One Size Fits All. You'll need a ruleset that is tailored to the particular circumstances of your site.

If you decide it's appropriate, you could even lump the two types of mistake into a single ruleset, with pattern
^((onedir|otherdir|thirddir)/\w+)(?:\.html)?$
(i.e. optional incorrect .html which is not captured) but this is probably not the most efficient way.

Edit: If your host is yapping a lot about memory usage, that's another reason to try very hard not to involve -f and -d tests, since it basically means the server is checking the same thing twice on every request (once while executing the RewriteRule, and then again a millisecond later when it has finished all the mods and is ready to serve up the requested file, if in fact it exists).
10:30 pm on Sept 16, 2017 (gmt 0)

New User

joined:Sept 14, 2017
posts: 9
votes: 0


Ahh. Okay, no -d. While the html instead of htm problem occurs sitewide, the missing extension problem is limited to the units directory.

The units directory contains files of the form acre.htm, yard.htm, sa-perunjong.htm, sana_lamjel.htm, etc..
It also contains the following subdirectories:
uimages: the images used in pages in the units directory
audio: as above, but sounds instead of pix.
about: htm files ordinarily reached by clicking on an "about" link in one of the pages in the units directory. Chronology of contributors.
symbol: htm files decoding abbreviations, with one php file that searches them
cheating: htm files describing incidents of cheating with wts and measures
charts: htm files containing graphical representations of systems of units. These are frequently accessed directly, i.e., not from a link on an htm page in the units directory.
country: htm files on historical metrology of a particular country/region. Currently weak and not much consulted.
general: htm files, catchall for pages on principles, conversion disasters, and so forth. Often accessed directly.

So, if I am following you correctly, I should be using something like:

RewriteRule ^((units)/\w-)$ https://example.com/$1.htm [R=301,L]
2:48 am on Sept 17, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15893
votes: 876


Yes, except that if it's only /units/ you don't need parentheses around it:
^(units/\w-)$
Gosh, that suddenly got a whole lot simpler didn't it :)

But with all those subdirectories--darn! you just have to keep complicating things don't you--you'll need to add a Condition:
RewriteCond %{REQUEST_URI} !^/units/(uimages|audio|about|symbol|cheating|charts|country|general)$
It is standard practice for search engines to make requests like
example.com/units/subdir (no final slash)
just to verify that they get redirected to
example.com/units/subdir/ (with final slash)
This redirect is handled by mod_dir. You don't have to do anything about it. I've also met a few non-search-engine robots that habitually ask for directories without final / slash. They only do it to annoy, because they know it teases.

Edit:
sana_lamjel.html
Ha! So you actually do use lowlines. Those are covered by the \w locution.

:: wandering off to learn what the heck a sana_lamjel is when it's at home ::

(Returning) Ooh, how interesting. Manipur? Is that somewhere near Assam? Yeah, probably.
9:46 pm on Sept 20, 2017 (gmt 0)

New User

joined:Sept 14, 2017
posts: 9
votes: 0


Well, this has taken up too much of your time, but I am deeply appreciative. Here's what I'm planning to upload. I sense there's some redundancy, but couldn't say where.

RewriteEngine on

# change dot html to dot htm
RewriteRule ^([^.]+\.htm)l https:/example.com/$1 [R=301,L]

# add dot htm extension to requests lacking an extension
RewriteCond %{REQUEST_URI} !^/units/(uimages|audio|about|symbol|cheating|charts|country|general)$
RewriteRule ^(units/\w-)$ https://example.com/$1.htm [R=301,L]

# change http:// to https://
RewriteCond %{HTTPS} !on [OR]
RewriteCond %{HTTP_HOST} !^(example\.com)?$
RewriteRule (.*) https://example.com/$1 [R=301,L]

# fix old too-clever spelling of directory
RewriteRule ^materls/(.*)$ https://example.com/materials/$1 [R=301,L]

#Gzip
<ifmodule mod_deflate.c>
AddOutputFilter DEFLATE js css
AddOutputFilterByType DEFLATE text/text text/html text/plain text/xml application/javascript
#The following lines are to avoid bugs with some browsers
BrowserMatch ^Mozilla/4 gzip-only-text/html
BrowserMatch ^Mozilla/4\.0[678] no-gzip
BrowserMatch \bMSIE !no-gzip !gzip-only-text/html
</ifmodule>
#End Gzip

Thanks again.
10:27 pm on Sept 20, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15893
votes: 876


Looks good, except...
^(units/\w-)$

Whoops! You meant
^(units/[\w-]+)$


Edit: This mistake is partly my fault, because I overlooked it in an earlier copy-and-paste. (Scroll back 3 or 4 posts.)
11:11 pm on Sept 20, 2017 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11846
votes: 242


i would reorder the rulesets from most specific to most general.

like this:
# fix old too-clever spelling of directory
# add dot htm extension to requests lacking an extension# change dot html to dot htm
# change dot html to dot htm
# change http:// to https://



<ifmodule mod_deflate.c>

no ifmodule necessary - either you've got mod_deflate installed or you remove these directives.