Characters or bytes? (I think I've asked this before, but nobody seems to know.)
Actually, somebody knows. Short answer. It's always characters. But what does that mean?
As always with character encodings, it's not simple. Counting characters consistently is *very* hard in a low-level language without importing a library for doing so and deciding what basis you're going to use for character normalization.
Twitter uses the example of counting café where the é can be represented internally as either a single two-byte character or a three-byte composed sequence of two characters depending your Twitter client. So depending on how you're counting, you could have:
- 1 character (visually looks like that in all cases),
- 2 characters composed of an "e" plus a diacritical which are separate Unicode characters and which is how many accented characters are represented in some clients and some accented characters are represented in all clients (if they don't have an assigned two-byte encoding).
- finally, if they were counting bytes, which they aren't, it would be 2 or 3 bytes.
So depending on how you handle characters, café could be four characters (normalized character count), five characters (strict count of number of characters, assuming a client that uses composed characters), or 5-6 "characters" (assuming a simple byte count).
Twitter counts the length of a Tweet using the Normalization Form C (NFC) version of the text. This type of normalization favors the use of a fully combined character (0xC3 0xA9 from the café example) over the long-form version (0x65 0xCC 0x81).
[
developer.twitter.com...]
So there's your answer: characters are characters within the limits of Normalization Form C.