homepage Welcome to WebmasterWorld Guest from 54.197.183.230
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Code, Content, and Presentation / HTML
Forum Library, Charter, Moderators: incrediBILL

HTML Forum

    
Local UTF-8 pages coming out as plain text
ChanandlerBong

5+ Year Member



 
Msg#: 4496067 posted 12:52 am on Sep 17, 2012 (gmt 0)

What a nightmare weekend!

I have changed all the encoding in all my web pages to UTF-8, had enough of those pesky little black diamonds for unknown characters. Added AddDefaultCharset UTF-8 to htaccess too.

But now I have one final problem. When I view webpages locally, they're coming out as plain text. When I view "page info" in FF, it says:

Type: text/plain
Render Mode: quirks mode
Encoding: UTF-8

The very same page viewed on my server is fine: text/html, Standards compliance mode, UTF-8

why the difference? Because of the htaccess file? Is that not being applied locally? Will I need to add content-type meta tags to all my pages in order to be able to view them locally as webpages? When I added this meta tag to one page, it came out fine locally. Interestingly, it was the content="charset=UTF-8" that got it working, rather than the content="text/html;". So it seems that although it's now encoded in UTF-8, Firefox doesn't realise that unless it's told.

Why did the change to the UTF-8 encoding make this difference? I never before had problems viewing webpages locally but my text editor encoded them as "Dos/Windows" (whatever that is!)

I would prefer not to add content-type meta tags to 3k pages, just to be able to view them locally. Any ideas?

 

tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4496067 posted 3:16 am on Sep 17, 2012 (gmt 0)

The "Type" field you are seeing is sent by the server - in the HTTP header that precedes the actual HTML file. So that information is not coming from the <head> section of your page's HTML or from its meta tags.

The server itself is set up this way (in error, IMO) and that is where the change needs to happen.

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4496067 posted 3:30 am on Sep 17, 2012 (gmt 0)

when you say "view locally" do you mean using file protocol or are you requesting a page using web protocol on http://localhost/...?

[edited by: phranque at 3:34 am (utc) on Sep 17, 2012]

ChanandlerBong

5+ Year Member



 
Msg#: 4496067 posted 3:34 am on Sep 17, 2012 (gmt 0)

yes, I do nothing in the htaccess for mime type - so you're saying that my server is sending a default mime type and that, technically, if I don't declare the type in a meta tag nor in the htaccess, that the webpage should always get seen as plain text (whether locally or via the server) unless I specifically declare the mime type?

at the moment, my pages are being served fine online, but having them come out in plain text locally is driving me crazy and the only way I can prevent that (now all pages are correctly encoded as UTF-8) is to add the content-type meta tag.

[edited by: ChanandlerBong at 3:38 am (utc) on Sep 17, 2012]

ChanandlerBong

5+ Year Member



 
Msg#: 4496067 posted 3:37 am on Sep 17, 2012 (gmt 0)

phranque, I mean "open in browser" from my text editor, NoteTab. It opens in my default browser, FF, and is plain text. Just off local disk, not using localhost/xampp or anything like that. When I copy and paste from the address bar and open in IE, same thing so this isn't a FF oddity.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4496067 posted 6:11 am on Sep 17, 2012 (gmt 0)

"Open in browser" is not the same thing as "serving" the pages with a HTTP header.

For testing, you should "serve" them.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4496067 posted 8:08 am on Sep 17, 2012 (gmt 0)

If it's a properly coded page it should open perfectly well from a local copy. The only difference is if your stylesheet uses a site-absolute link ("/stylesheet.css"). The page will come out styleless-- and similarly for images. But it should still be html. So there's some other problem. The reference to "quirks mode" makes me uneasy; it makes it sound as if something is missing from the dtd or elsewhere in the head.

That's assuming the pages don't have SSIs or php stuff that will only work if they're coming from a server.

Digression:
Thanks to MAMP, I've now got three count 'em three varieties of "local" viewing.
--Just-like-real via the pseudo-server.
--Almost-like real in a browser-- somewhere along the line they've all learned to show directory indexes, so links ending in "directory/" involve only a brief stopover to find the index.html file, not a terminal "Ain't no such file on this computer".
--Less accurate but instantly responsive using SubEthaEdit's www preview-- built with webkit, so it pretty much looks like Safari. Perfect for ebooks, where everything is either internal or relative.

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4496067 posted 9:47 am on Sep 17, 2012 (gmt 0)

Handling character encodings in HTML and CSS:
http://www.w3.org/International/tutorials/tutorial-char-enc/ [w3.org]

Always declare the encoding of your document. Use the HTTP header if you can. Always use an in-document declaration too.

ChanandlerBong

5+ Year Member



 
Msg#: 4496067 posted 11:42 am on Sep 17, 2012 (gmt 0)

I do use an HTML 5 DTD, short and sweet:


<!DOCTYPE html>
<HTML>
<HEAD>


I do use php includes on the page, but that's not the issue. Have always done that and, no they don't get included (header, footer, menus, etc) but the rest of the page would come out fine, not as plain text.

Phranque, are you saying I do need to bite the bullet and put a content-type meta tag on each page declaring both the MIME type and charset?

I said before, off the server, the pages are fine, the MIME type is detected as text/html, the charset as utf-8. As a test last night, I took off the AddDefaultCharset from my htaccess and the pages are still served fine. None of the pages have a content-type meta tag, so at that point, what is the difference between locally served files and the pages from the server? As has suggested already in this thread, is something automatically set on my server to do that?

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4496067 posted 1:34 pm on Sep 17, 2012 (gmt 0)

you should use something like the Live HTTP Headers add-on for Firefox.
this will show you the HTTP Response headers that are returned with the document by the server.
in this case the Content-Type header is relevant and you should look for a header that looks like:

Content-Type: text/html; charset=UTF-8


as far as your offline problem, you need to change the default encoding in your browser.
it should be something like:
Tools/Options/Content/Fonts & Colors - Advanced.../Default Character Encoding/Unicode (UTF-8)

that will solve the problem for your browser but the general solution is the meta http-equiv content-type element.

ChanandlerBong

5+ Year Member



 
Msg#: 4496067 posted 2:08 pm on Sep 17, 2012 (gmt 0)

it seems FF is not seeing the doc as html 5.

This:

<meta charset="UTF-8">

locally, is getting the file opened as plain text. Page info gives "text/plain" as Type.

When I add either:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

and in fact just

<meta http-equiv="Content-Type" charset=utf-8">

it looks fine locally. FF Page Info then gives "text/html" as Type.

I altered the default character encoding in FF and it made no difference unfortunately.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4496067 posted 4:46 pm on Sep 17, 2012 (gmt 0)

This:

<meta charset="UTF-8">

locally, is getting the file opened as plain text. Page info gives "text/plain" as Type.

Isn't happening over here. What FF version and OS are you on? html5 is supposed to be all lower case, but my FF doesn't seem to care.

Setting a default character encoding should have no effect. That's just to tell the browser what to use when
:: cough-cough ::
a page doesn't include a charset declaration.

<meta http-equiv="Content-Type" charset=utf-8">

Did something fall out of the cut & paste there? As printed, I would expect it to make things worse ;)

ChanandlerBong

5+ Year Member



 
Msg#: 4496067 posted 5:40 pm on Sep 17, 2012 (gmt 0)

yes, I know it would normally include the text/html part, but I was testing and it seemed to open OK with just that stub.

I'm on Vista. The only thing I've changed since last week when they were opening fine locally is the UTF-8 encoding. Even updated FF just in case. Same on IE too, so it's something in the file itself. If I add the old style content-type meta, it's great. Everything online, from the server, is great.

I know this will be something really dumb. :)

tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4496067 posted 6:49 pm on Sep 17, 2012 (gmt 0)

So (if I have this right) the problem only occurs with local files.
  1. Do you use a local server, or just open the HTML file in a browser?
  2. What is the document's DTD?
  3. What is the file extenstion? For example, aspx or html or none or...?

ChanandlerBong

5+ Year Member



 
Msg#: 4496067 posted 7:27 pm on Sep 17, 2012 (gmt 0)

1. Just open file in browser. Whether I click "open in browser" from text editor or double click on file itself, same plain text result.

2. html 5 DTD:

<!DOCTYPE html>

3. All files are php extensions. Never caused a problem before (and there's tiny amount of actual php in there, few includes).

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4496067 posted 7:33 pm on Sep 17, 2012 (gmt 0)

the file extension shouldn't matter - it's an encoding issue not a rendering issue.
it's the charset that needs to be specified.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4496067 posted 10:20 pm on Sep 17, 2012 (gmt 0)

But that would have no effect on whether the document opens as html or not. At worst, if the doc contains non-ASCII characters, they might display as garbage.

As far as I can tell, I ran up the identical document-- that is, identical dtd and head section with just the "meta charset" version-- and it worked fine in FF. I included a minimalist table to make sure.

Look again at this:
<meta http-equiv="Content-Type" charset=utf-8">

I wasn't referring to the absent "text/html" ;) when I asked about cut & paste. If that comes through as html, you'd think anything would. Matter of fact, it works perfectly well in Camino, which is essentially Firefox Lite.




OK, stop the presses. Have you looked at FF's Error Console?
Timestamp: 17/09/12 3:10:32 PM
Error: An unsupported character encoding was declared for the HTML document using a meta tag. The declaration was ignored.

That's the, ahem, incomplete version. The minimalist version with
<HTML>
<HEAD>

<meta charset="UTF-8">

didn't raise a peep.

Incidentally, I discovered through opening another random experimental doc that if you don't give a "charset" at all, the FF Error Console will scream
Error: The character encoding of the HTML document was not declared. The document will render with garbled text in some browser configurations if the document contains characters from outside the US-ASCII range. The character encoding of the page must to be declared in the document or in the transfer protocol.

This is pretty funny because it reminds me of the boilerplate I include at the beginning of every e-text for the benefit of readers with antiquated browsers :) There of course I do specify a charset; it's the browser and/or OS that may not be up to speed.

But even then, the doc will display as HTML.

ChanandlerBong

5+ Year Member



 
Msg#: 4496067 posted 12:15 am on Sep 18, 2012 (gmt 0)

FF error console no help. Nothing showing when I load the file/page.

I did what I usually do in these situations and began stripping the page down to bare essentials. I actually got all the way to this:

<!DOCTYPE html>
<html>
<head>
<title>Test Page</title>
<meta charset="UTF-8">

</head>
<body>

hello world.

</body>
</html>


this opens as plain text in FF 15. When I double click on the file and when I "open in browser" from notetab, my text editor. It's encoded in UTF-8, I know that. On the server, all is fine. Locally, it's a car wreck! :)

Now, where it gets bizarre.

I copy that entire page and paste into new empty document, create a new page in the same folder. "open in browser"...shows as html.

So then I restore the faulty page back up to its full size (still opens as plain text), copy and paste that into a new file, save as test.php, opens as html. Any new page I create with the exact same code works fine.

Is my PC haunted?

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4496067 posted 1:12 am on Sep 18, 2012 (gmt 0)

have you tried resaving the file?
maybe the file is corrupted or it's a BOM issue.

i would move the charset definition to immediately after the head and definitely before the title in case you have UTF-8 characters in your title.

it's an encoding issue not a rendering issue

no clue what i was talking about there...
=8)

ChanandlerBong

5+ Year Member



 
Msg#: 4496067 posted 1:53 am on Sep 18, 2012 (gmt 0)

Yes, I had resaved it before (and ensuring it was being saved as UTF-8 too). NoteTab is a very good text editor and I don't think it's doing anything odd. When I re-save the test page (contaning identical html code), also as UTF-8, it opens, as expected, as html.

Moved the charset declaration up to very top of head - no dice.

I still do think ultimately it's an encoding issue. For some reason FF is seeing it as text/plain. When I copy the code into new page and save, it's seeing that new page as text/html. There has to be something about the original page, some invisible byte or two, that's doing that.

here's one other among a basketful of oddities.

I explained earlier that when I put a html 4.01 content-type meta tag into one of these pages being rendered as plain text, they then showed as html. Then when I take out the content-type meta tag and put back in the html charset definition, it still renders fine as html. I presume that's caching? However, even after clearing cache, I still see that behaviour.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4496067 posted 2:07 am on Sep 18, 2012 (gmt 0)

Is my PC haunted?

Could well be. I guess "the file is corrupted" is the grownup way of saying it ;)

Does your text editor have an option for showing invisible text, or searching for characters by code range? That does seem like the likeliest explanation at this point. I don't think I've ever had it happen in html, but FutureBasic used to pull that on me all the time. Add one absolutely innocuous line of code, and suddenly the compiled version starts crashing every time it hits some other, completely unrelated line. Sigh.

If there's a lurking BOM-- there really shouldn't be-- it will show up if you manually change the page's encoding to ISO-Latin-1. You'll then see

{iumlaut; right-pointing guillemet; inverted question mark}

at the very beginning of the file. That's %ef%bb%bf --it seems safer not to write it out, or there may be encoding disasters in this very post!

Then again, you may prefer to just toss the old file and think no more about it. Heh.

when I put a html 4.01 content-type meta tag into one of these pages being rendered as plain text, they then showed as html. Then when I take out the content-type meta tag and put back in the html charset definition, it still renders fine as html. I presume that's caching

Or it could be that the guilty corrupt character isn't picked up when you cut and paste, so it has simply disappeared, never to return :)

ChanandlerBong

5+ Year Member



 
Msg#: 4496067 posted 2:23 am on Sep 18, 2012 (gmt 0)

I selected the "show invisible text" and nothing was there at all.

I chose the option to re-save the page in "ANSI", text editor toolbar at foot of page showed "Encoding: DOS/Windows" and the page displayed as html, I sort of knew it would as the old encoding for all these pages was 'DOS/Windows' and the problem emerged (on every page of my site...not just a few) only when I altered the encoding to UTF-8.

Caching issue again when I return that page from ANSI to UTF-8: it displays as html. I have 3000 pages in my site all with the same issue, but once I get any of them to display as html (either by adding meta content-type tag or by encoding it to ANSI) it continues to display as html even when I revert back to html 5 charset definition or utf-8 encoding. So then I need a new test page.

driving me crazy. :(

ChanandlerBong

5+ Year Member



 
Msg#: 4496067 posted 11:53 pm on Sep 22, 2012 (gmt 0)

I saved one of my web pages today on my mac, saved it as html file, opened in mac's text edit application and found the following characters at the very start of the file



this is apparently a BOM and I"m guessing is what is causing my browser to show it as plain text locally (still having NO problems online, off the server, only locally)

My text editor I use on my Win laptop doesn't show this at all, but Mac's Text Edit does.

1. Is this the problem?
2. How do I do a search and replace for a "character" my text editor doesn't see?
3. How did this BOM get there yet when I create and encode new pages in utf-8, I don't seem to have a problem?

This is continuing to frustrate the living daylights out of me, which is why I'm continuing to dig into it. This seems like progress.

this part of the BOM wiki page seems to suggest problems caused by its presence:

[en.wikipedia.org ]

ChanandlerBong

5+ Year Member



 
Msg#: 4496067 posted 11:57 pm on Sep 22, 2012 (gmt 0)

lucy, realised now you mentioned BOM. I had used "show invisible characters" and nothing showed in my text editor, on my mac now so will try what you suggested (altering encoding to see it) tomorrow back on my normal windows laptop where my site/text editor are. I have 3k files with this little so-and-so in it so need to search and replace with nothing to get rid of it.

I think I know how it got in there. I used an application (some cheesy name like "UTF-8 Convert" to do the initial conversion of all the pages to UTF-8 so I think that's the bad guy in this situation). I'll find that app's forum/site and see if others have reported this issue and see if I can re-convert without the BOM.

still unsure if this is 100% the culprit but seems a heavily smoking gun. :)

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4496067 posted 2:12 am on Sep 23, 2012 (gmt 0)

found the following characters at the very start of the file ...

Punch line: The very existence of this piece of text sent my browser into UTF-8 encoding, so all I saw was a blank line. Convert manually to Latin-1 and there they are.

I once found a www page that contained three separate BOMs at various places. But it also had an explicit "8859-1" declaration, so they came through as text.

There exist utilities that let you edit files without opening them one by one. (This is the sum total of my knowledge, so don't ask.)

Leosghost

WebmasterWorld Senior Member leosghost us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4496067 posted 2:30 am on Sep 23, 2012 (gmt 0)

Editing files without opening them ..and in batches too..

You can try notepad++ for windows ( freeware )..and nautilus on ubuntu ( even freer ware :) and derivatives ( I don't know for sure if they will edit out BOM though )..and on mac? ..joss sticks and repeating "it just works" ? ;)

ChanandlerBong

5+ Year Member



 
Msg#: 4496067 posted 2:46 pm on Sep 23, 2012 (gmt 0)

lucy/leosghost, cheers, will get onto this later today.

lucy, when you talk about converting to 8859-1 and me seeing the BOM characters, I presume this is not the same as altering the charset encoding which is now utf-8. Do you mean just within the actual editor so I can see the characters? You don't want me to alter the utf-8 charset that all my pages are now?

I think my text editor is alright, saving to utf-8 fine without the BOM - hence why I was getting the issue I reported earlier in the thread that I was copying and pasting a "bad page" html into a "new file", saving with text editor and the page was saving fine and opening in a browser okay. So I definitely think the utf-8 converter tool I downloaded is the guilty party, it went and added BOM to 3k files....doh!

I will report back - would love to see this thread get the "closure" it deserves. :)

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4496067 posted 11:02 pm on Sep 23, 2012 (gmt 0)

There are two different things: convert and reinterpret. (These are the words my text editor uses. I don't know if there is a universally accepted standard usage.)

convert = keep the same visible characters, but change how they are stored behind the scenes.

reinterpret = keep the same behind-the-scenes data, but change what you see. Some programs can do this "on the fly"; others can only do it when you first open the file.

-- When the BOM comes through as {iumlaut; right-pointing guillemet; inverted question mark} you are reinterpreting from UTF-8 to Latin-1. The stored data was and remains EFBBBF. The "EF" element tells the text reader that the complete character will take up three bytes (16^2 three times).

-- Conversely, if you start out in UTF-8 and type {iumlaut; right-pointing guillemet; inverted question mark}, the data gets stored as C3AFC2B8C2BF. This time the "C2" or "C3" component says that the complete character will take up two bytes. (So does leading D. Leading F is four bytes.) If you reinterpret this as Latin-1-- either intentionally or by accident-- you end up with {Atilde; macron; Acircumflex; right-pointing guillemet; Acircumflex; inverted question mark} because each byte is a new character.

-- If you start out in Latin-1 and type {iumlaut} et cetera and then try to reinterpret to UTF-8, your text will disappear, because the BOM is a non-displaying character.

-- And if you type the same thing, only with spaces between the three letters, you will get an error message or a string of angry question marks within black diamonds-- the UTF-8 "I can't display this" character-- because the three individual characters EF BB BF have no meaning.

You can keep converting and reinterpreting until the cows come home, blowing each original letter into more and more bytes. But you are probably starting to get a headache. I am, anyway ;)

ChanandlerBong

5+ Year Member



 
Msg#: 4496067 posted 1:09 am on Sep 24, 2012 (gmt 0)

problem solved. Removed BOM from all my pages and they now open locally as html.

Still confused how modern browsers should trip over this, especially as I've read elsewhere "most modern browsers will not have problems with the existence of a BOM at the start of a file".

Anyway, thanks for help and tips from everyone.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / HTML
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved