special characters in html5

Forum Moderators: open

Message Too Old, No Replies

special characters in html5

smallcompany

8:31 pm on Apr 30, 2014 (gmt 0)

I'm working on a redesign of a website. The old one had this for the markup:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

The new one is this:

<!doctype html>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

The new site also has language setting as it's multilingual.

I noticed with the new site that even after pasting from notepad, all special characters get into the source code "as is" while I would always get it encoded with the old site. I.e. – ® and non-English letters like ø vs. ø and so on.

Is this about HTML or charset encoding or both? I see that HTML5's standard is UTF-8.

Thanks

P.S.
I use Dreamweaver CS3.

lucy24

9:04 pm on Apr 30, 2014 (gmt 0)

Is this about HTML or charset encoding or both?

Both.

Part of your question really has to do with the specific behavior of DreamWeaver and/or Notepad, but I'm going to set that aside.

They Who Decide have decreed that in html5, the only entities are
>
<
&
--that is, the three characters that have meaning within html itself. So no entities for characters like — –   that are invisible to the naked eye.

Can you tell that I don't approve of this decision?* Fortunately there is still Backward Compatibility, so any entities you choose to use will be represented correctly for the foreseeable future.

In UTF8 encoding, visible characters don't need to be escaped just because they're non-ASCII. So it's absolutely appropriate that you're getting ø, é, ® and so on. You're saving bytes and also making your source code more readable.

* I also think that certain actions of the Unicode Consortium have ranged from silly to outrageous, but never mind that.

smallcompany

9:57 pm on Apr 30, 2014 (gmt 0)

Thank you.

So no entities for characters like — –   that are invisible to the naked eye.

I should not use it in this way? I ask because how do I get an empty paragraph which I would do like <p> </p>?

Other than this, I'm totally fine to read source code as it looks in the browser - I like it that way.

BTW, I missed to mention in my initial post that & for example gets posted in Dreamweaver as & (same for other two) so it seems to be in line with the three entities you mentioned.

Thanks

lucy24

12:12 am on May 1, 2014 (gmt 0)

My two cents: Keep right on using any entities that you consider appropriate. The html5 validator may kick up a fuss, but nobody else will care.

how do I get an empty paragraph which I would do like <p> </p>?

Step next door to the css forum, where you will master tricks like

p.setoff {margin-top: 1.5em;}
...
<p class = "setoff">blahblah</p>

graeme_p

5:38 am on May 1, 2014 (gmt 0)

I also think that certain actions of the Unicode Consortium have ranged from silly to outrageous, but never mind that.

Lucy, you cannot make a comment like that and say "never mind". PLEASE elaborate. Start another thread if you feel its necessary, but I want to know what is silly about unicode!

lucy24

7:10 am on May 1, 2014 (gmt 0)

Benign example: the out-of-whole-cloth invention of the word "caron" to denote the diacritic that literally everyone else on the planet knows as a hacek.

One item in my randomized set of e-mail signatures is "Save codepoint 1400!"

graeme_p

4:09 pm on May 1, 2014 (gmt 0)

The proportion of people on the planet who know what a hacek is called under either name is probably pretty small... I would probably have called it something like "an upside down circumflex like thingy" up till now.

mattur

2:46 pm on May 1, 2014 (gmt 0)

They Who Decide have decreed that in html5, the only entities are
>
<
&
--that is, the three characters that have meaning within html itself. So no entities for characters like — –   that are invisible to the naked eye.

I think you're getting mixed up with XML 1.0, where the only predefined entity references are < & > " '

There's 2,231 named character references supported in HTML, including — –   -
[whatwg.org...]

lucy24

6:01 pm on May 1, 2014 (gmt 0)

Heh, forgot about " and ' -- but unlike & > < * they have never been mandatory entities. Of course neither one should occur at all, except possibly in <code> sections, but that's a whole nother subject ;)

* If you are not concerned with validation, using > as-is is harmless. It's only the opening < that leads to calamity. I've also found by experiment that naked ampersands don't bother the validator; it's only when they're followed by another alphanumeric that there are problems. But most people probably don't spend a lot of time with ebooks using the "&c." form.

mattur

6:18 pm on May 3, 2014 (gmt 0)

* If you are not concerned with validation, using > as-is is harmless.

Unencoded ">" is not a validation error either ;)

A ">" character only has a special meaning when the HTML parser is in tag open state/end tag open state, so any ">" that appear outside tags are unambiguous - they're not closing a tag begun with "<" because there's no open start or end tag to close.

[whatwg.org...]

Unencoded ampersands are unambiguous (and not a validation error) if they're followed by a space, ">", "&" or EOF:

[whatwg.org...]

It's probably easier to just encode them :)

lucy24

7:14 pm on May 3, 2014 (gmt 0)

Oh, I always encode ampersands. I have unencoded > when I'm quoting code, because it's a lot more readable if only the opening < is encoded. And, of course, saves three bytes a pop ;)

Unencoded ampersands are unambiguous (and not a validation error) if they're followed by a space

Wasn't that what I said?