homepage Welcome to WebmasterWorld Guest from 50.19.169.37
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / HTML
Forum Library, Charter, Moderators: incrediBILL

HTML Forum

    
special characters in html5
smallcompany




msg:4667350
 8:31 pm on Apr 30, 2014 (gmt 0)

I'm working on a redesign of a website. The old one had this for the markup:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

The new one is this:

<!doctype html>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

The new site also has language setting as it's multilingual.

I noticed with the new site that even after pasting from notepad, all special characters get into the source code "as is" while I would always get it encoded with the old site. I.e. &ndash; &reg; and non-English letters like &oslash; vs. and so on.

Is this about HTML or charset encoding or both? I see that HTML5's standard is UTF-8.

Thanks

P.S.
I use Dreamweaver CS3.

 

lucy24




msg:4667361
 9:04 pm on Apr 30, 2014 (gmt 0)

Is this about HTML or charset encoding or both?

Both.

Part of your question really has to do with the specific behavior of DreamWeaver and/or Notepad, but I'm going to set that aside.

They Who Decide have decreed that in html5, the only entities are
&gt;
&lt;
&amp;
--that is, the three characters that have meaning within html itself. So no entities for characters like &mdash; &ndash; &nbsp; that are invisible to the naked eye.

Can you tell that I don't approve of this decision?* Fortunately there is still Backward Compatibility, so any entities you choose to use will be represented correctly for the foreseeable future.

In UTF8 encoding, visible characters don't need to be escaped just because they're non-ASCII. So it's absolutely appropriate that you're getting , , and so on. You're saving bytes and also making your source code more readable.


* I also think that certain actions of the Unicode Consortium have ranged from silly to outrageous, but never mind that.

smallcompany




msg:4667373
 9:57 pm on Apr 30, 2014 (gmt 0)

Thank you.

So no entities for characters like &mdash; &ndash; &nbsp; that are invisible to the naked eye.


I should not use it in this way? I ask because how do I get an empty paragraph which I would do like <p>&nbsp;</p>?

Other than this, I'm totally fine to read source code as it looks in the browser - I like it that way.

BTW, I missed to mention in my initial post that & for example gets posted in Dreamweaver as &amp; (same for other two) so it seems to be in line with the three entities you mentioned.

Thanks

lucy24




msg:4667387
 12:12 am on May 1, 2014 (gmt 0)

My two cents: Keep right on using any entities that you consider appropriate. The html5 validator may kick up a fuss, but nobody else will care.

how do I get an empty paragraph which I would do like <p>&nbsp;</p>?

Step next door to the css forum, where you will master tricks like

p.setoff {margin-top: 1.5em;}
...
<p class = "setoff">blahblah</p>

graeme_p




msg:4667437
 5:38 am on May 1, 2014 (gmt 0)

I also think that certain actions of the Unicode Consortium have ranged from silly to outrageous, but never mind that.


Lucy, you cannot make a comment like that and say "never mind". PLEASE elaborate. Start another thread if you feel its necessary, but I want to know what is silly about unicode!

lucy24




msg:4667472
 7:10 am on May 1, 2014 (gmt 0)

Benign example: the out-of-whole-cloth invention of the word "caron" to denote the diacritic that literally everyone else on the planet knows as a hacek.

One item in my randomized set of e-mail signatures is "Save codepoint 1400!"

graeme_p




msg:4667519
 4:09 pm on May 1, 2014 (gmt 0)

The proportion of people on the planet who know what a hacek is called under either name is probably pretty small... I would probably have called it something like "an upside down circumflex like thingy" up till now.

mattur




msg:4667589
 2:46 pm on May 1, 2014 (gmt 0)

They Who Decide have decreed that in html5, the only entities are
&gt;
&lt;
&amp;
--that is, the three characters that have meaning within html itself. So no entities for characters like &mdash; &ndash; &nbsp; that are invisible to the naked eye.

I think you're getting mixed up with XML 1.0, where the only predefined entity references are &lt; &amp; &gt; &quot; &apos;

There's 2,231 named character references supported in HTML, including &mdash; &ndash; &nbsp; -
[whatwg.org...]

lucy24




msg:4667662
 6:01 pm on May 1, 2014 (gmt 0)

Heh, forgot about &quot; and &apos; -- but unlike &amp; &gt; &lt; * they have never been mandatory entities. Of course neither one should occur at all, except possibly in <code> sections, but that's a whole nother subject ;)


* If you are not concerned with validation, using > as-is is harmless. It's only the opening < that leads to calamity. I've also found by experiment that naked ampersands don't bother the validator; it's only when they're followed by another alphanumeric that there are problems. But most people probably don't spend a lot of time with ebooks using the "&c." form.

mattur




msg:4668264
 6:18 pm on May 3, 2014 (gmt 0)

* If you are not concerned with validation, using > as-is is harmless.

Unencoded ">" is not a validation error either ;)

A ">" character only has a special meaning when the HTML parser is in tag open state/end tag open state, so any ">" that appear outside tags are unambiguous - they're not closing a tag begun with "<" because there's no open start or end tag to close.

[whatwg.org...]

Unencoded ampersands are unambiguous (and not a validation error) if they're followed by a space, ">", "&" or EOF:

[whatwg.org...]

It's probably easier to just encode them :)

lucy24




msg:4668271
 7:14 pm on May 3, 2014 (gmt 0)

Oh, I always encode ampersands. I have unencoded > when I'm quoting code, because it's a lot more readable if only the opening < is encoded. And, of course, saves three bytes a pop ;)

Unencoded ampersands are unambiguous (and not a validation error) if they're followed by a space

Wasn't that what I said?

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / HTML
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved