Forum Moderators: phranque

Message Too Old, No Replies

best practice for HTML entities vs. unaccented letters?

what should we use in title tags etc

         

httpwebwitch

5:12 am on Jun 3, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



We had a lively discussion today regarding what to do in the case of English words of foreign origin that use accented letters. For example, "café" vs. "cafe"

Google returns different results for searches using the proper accented form vs. the unaccented form of the word. So the two are certainly treated differently. That premise was proven with searches for other words like "jalapeńo" and "über"

In English, I have to assume that most people type in "Hard Rock Cafe" (no accent), and I'd guess that a large portion of unilingual english users don't know how to use extended characters.

What is the right thing to do? I would like to use the proper accented form "café" on the page, but I would also like to show up when someone searches for the "flattened" version: "cafe". Top ranking sites for "cafe" and "café" seem to use a variety of forms on the page, some with the accent and some without.

We also discussed whether using HTML entities to represent the characters had an effect on search matching. Of course in browser rendering, é or é will be displayed as "é". Is there a difference between "café" and "café" and "café"? Will a search for "café" return my page, if I consistently use the HTML entity and not the actual ISO 8859-1 character?

On a related note,
I did recently notice that using <sup> tags in a title is BAD - where the title is E=mc<sup>2</sup>

For instance,
<title>E=mc<sup>2</sup></title>

shows up in the window and in Google SERPs as
"E=mc<sup>2</sup>", not "E=mc2"

Using <sup> with CSS within a display:block element also screws things up, since the superscript pushes the height of the line above the designated line-height property. The result is text that starts above the ascender, pushes the whole line down into misalignment, and the words get truncated across the bottom.

See for yourself: take this and look at how it renders in IE:


<style>
.t{
font-size:14px;
line-height:14px;
border:1px solid black;
background:#000;
color:#fff;
width:200px;
}
</style>
<div class="t">this text is <sup>not</sup> good</div>
<br>
<div class="t">this text is good</div>
<br>
<div class="t">this text is good this text is good this text is good this text is good this text is <sup>not</sup> good this text is good this text is good this text is good </div>

There may be a simple CSS hack to solve this problem - if there is I'd like to know what it is. Still, shame on the browser that can't render a superscript without mussing up my line spacing.

Any thoughts?

httpwebwitch

1:16 am on Jun 7, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



bump

rjohara

1:23 am on Jun 7, 2005 (gmt 0)

10+ Year Member



I used to use some special punctuation (actually just correct punctuation) in my TITLE elements: curved apostrophes and accented characters, for example. I did, that is, until I noticed they were *not* being picked up in regular Google searches for the "plain" form. I reverted all TITLE elements to low ascii only, and referrals jumped way up.

Bottom line: Google *should* know that ' and apostrophe are the same, but it doesn't. I would not use anything but low ascii in TITLE elements for the time being.

Also: I didn't check the standard, but I'm pretty sure no other markup is permitted within the TITLE element, so I definitely wouldn't try that approach. There are superscript characters in Unicode, so it is possible to include such a character directly. It will display correctly, but I wouldn't count on it being found by a search.

iamlost

5:01 pm on Jun 7, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The different SEs have differing ability to recognise/render the HTML Character Set. Probably because the big-3 are American (English speaking) companies and to keep bots as simple as possible - poor thinking in an "international" market but reality.

I have made a study of this over the years and suggest the following: enter each code i.e. "&#178;" or "&sup2;" directly in each SE's search box and look at the results. You can quickly build up a grid or DB of correctly recognise/render or not for each.

For instance "&#178;" is correctly recognised/rendered by G, M, and Y but only Y r/r "#sup2;".

It is sobering to see the apparent lack of test/check when looking at such SERPs. People obviously just "expect" everything to display as on their machine/browser. Very large sites make the same mistake(s). The SE bots are not browsers.

On the matter of Character Codes in regards to SERP (i.e. &#69;, &#101;, &#200;, &#201;, &#202;, &#203;, &#232;, &#233;, &#234;, &#235;) are they recognised/rendered and if so are they ranked individually or lumped together? Again the SEs differ (surprise!) somewhat.

On some sites the importance (i.e. cafe, café) is limited and easily accommodated. On some sites the requirement for multiple instances of many of these "higher" Character Codes is paramount and meeting the necessity challenging.

As long as title, description, or snippets containing HTML Characters may be variously displayed by bots and browsers the "best practice" is knowing and designing around the differences.

Yet another cross-browser cross-SE cross web developer issue.

httpwebwitch

8:16 pm on Jun 7, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Interesting. a G search for "café" yeilds the same results as "caf&#233;". Identical.

But you can not search for "allintitle:caf#&233;"

So the point is made: "e" and "é" are not identical. Culinary Cafe ranks for "café" - yet nowhere does it actually use the accented character on the page (the accent is used in the graphic images).

The best practice (though not the most editorially sound) may be to use the safest option - the unaccented letter - and encourage backlinks that use a variety of the two.

encourage backlinks with:
<a href="mysite">cafe</a>
<a href="mysite">café</a>

or even:
<a href="mysite">caf&#233;</a>

I'm guessing that enough people have linked to the Culinary Cafe using the accented letter that it has achieved ranking for a word that does not appear anywhere on the site.

Or maybe I'm seeing LSI at work in a hard-to-predict way?