Julio Biason
5 years ago
4 changed files with 63 additions and 2 deletions
@ -0,0 +1,55 @@
|
||||
+++ |
||||
title = "Things I Learnt The Hard Way - Always Use UTF-8 For Your Strings" |
||||
date = 2019-07-01 |
||||
|
||||
[taxonomies] |
||||
tags = ["en-au", "books", "things i learnt", "utf-8"] |
||||
+++ |
||||
|
||||
Long gone are the days where [ASCII](https://en.wikipedia.org/wiki/ASCII) was |
||||
enough for everyone. Long gone are the days where you can deal with strings |
||||
with no "weird" or "funny" characters. |
||||
|
||||
<!-- more --> |
||||
|
||||
I was born in a time when the only encoding we had was ASCII. You could encode |
||||
all strings in sequences of bytes, 'cause all characters you could use where |
||||
encoded from 1 to 255 (well, from 32 [space] to 93 [close brackets] and you |
||||
still have a few latin-accented characters in some higher positions, although |
||||
not all accents where there). |
||||
|
||||
Today, accepting characters beyond that is not the exception, but the norm. To |
||||
cope with all that, we have things like |
||||
[Unicode](https://en.wikipedia.org/wiki/Unicode) and |
||||
[uTF-8](https://en.wikipedia.org/wiki/UTF-8) for encoding that in reasonable |
||||
memory space (UTF-16 is also a good option here, but that would depend on your |
||||
language). |
||||
|
||||
So, as much as you to make your system simple, you will have to keep the |
||||
internal representation of your strings in UTF-8/UTF-16. Surely, you may not |
||||
receive the data as UTF-8/UTF-16, but you'll have to encode it and keep |
||||
transmitting it around as UTF-8/UTF-16 till you have to display it, at which |
||||
point you'll convert from UTF-8/UTF-16 to whatever your display supports |
||||
(maybe it even supports displaying in UTF-8/UTF-16, so you're good already). |
||||
|
||||
At this point, I believe most languages do support UTF-8, which is great. You |
||||
may still have problems with inputs coming from other systems that are not |
||||
UTF-8 (old Windows versions, for example), but that's fairly easy to convert |
||||
-- the hard part is figuring out the input _encoding_, though. Also, most |
||||
developers tend to ignore this and only accept ASCII characters, or ignore |
||||
UTF-8/whatever-encoding and get a bunch of weird characters on their printing, |
||||
'cause they completely ignored the conversion on the output point. That's why |
||||
I'm repeating the mantra of UTF-8: To remind you to always capture your input, |
||||
encode it in UTF-8 and _then_ convert in the output. |
||||
|
||||
One thing to keep in mind is that UTF-8 is not a "cost free" encoding as |
||||
ASCII: While in ASCII to move to the 10th character, you'd just jump 10 bytes |
||||
from the start of the string, with UTF-8 you can't, due some characters being |
||||
encoded as two or more bytes (you should read the Wikipedia page; the encoding |
||||
is pretty simple and makes a lot of sense) and, due this, you can't simply |
||||
jump 10 characters 'cause you may end up in second byte that represents a |
||||
single character. Walking through the whole string would require traversing |
||||
the string character by character, instead of simply jumping straight to the |
||||
proper position. But that's a price worth paying, in the long run. |
||||
|
||||
{{ chapters(prev_chapter_link="/books/things-i-learnt/use-timezones", prev_chapter_title="Always Use Timezones With Your Dates", next_chapter_link="/books/things-i-learnt/languages-are-more", next_chapter_title="A Language Is Much More Than A Language") }} |
Loading…
Reference in new issue