New Chapter: Use UTF8

5 years ago · a94980d60e
4 changed files with 63 additions and 2 deletions
--- a/content/books/things-i-learnt/_index.md
+++ b/content/books/things-i-learnt/_index.md
@ -40,6 +40,7 @@ template = "section-contentless.html"
 		* [Don't Mess With Things Outside Your Project](outside-project)
 		* [Resist The Temptation Of Easy](resist-easy)
 		* [Always Use Timezones With Your Dates](use-timezones)
+		* [Always Use UTF-8 For Your Strings](use-utf8)
 * Community/Teams
 	* [A Language Is Much More Than A Language](languages-are-more)
 	* [Understand And Stay Away From Cargo Cult](cargo-cult)
--- a/content/books/things-i-learnt/languages-are-more/index.md
+++ b/content/books/things-i-learnt/languages-are-more/index.md
@ -39,4 +39,4 @@ surface of what the whole of a language encapsulates and if you ignore the
 other elements in it, you may find yourself with a cute language in a
 community that is always fighting and never going forward.

-{{ chapters(prev_chapter_link="/books/things-i-learnt/use-timezones", prev_chapter_title="Always Use Timezones With Your Dates", next_chapter_link="/books/things-i-learnt/outside-project", next_chapter_title="Don't Mess With Things Outside Your Project") }}
+{{ chapters(prev_chapter_link="/books/things-i-learnt/use-utf8", prev_chapter_title="Always Use UTF-8 For Your Strings", next_chapter_link="/books/things-i-learnt/outside-project", next_chapter_title="Don't Mess With Things Outside Your Project") }}
--- a/content/books/things-i-learnt/use-timezones/index.md
+++ b/content/books/things-i-learnt/use-timezones/index.md
@ -32,4 +32,9 @@ timezone as soon as possible and carry it around in all operations.
 Modules/classes that don't support timezones for dates/times should, as soon
 as possible, removed from the system.

-{{ chapters(prev_chapter_link="/books/things-i-learnt/use-timezones", prev_chapter_title="Always Use Timezones With Your Dates", next_chapter_link="/books/things-i-learnt/languages-are-more", next_chapter_title="A Language Is Much More Than A Language") }}
+Developers a bit more seasoned -- and by "seasoned" I meant "Had to deal with
+times before" -- will probably claim "Hey, this is _obvious_!" And I'd have to
+agree. But it's annoying how many times I got bitten by some stupid bug 'cause
+we decided that "well, everything is in the same timezone, so it's all good".
+
+{{ chapters(prev_chapter_link="/books/things-i-learnt/resist-easy", prev_chapter_title="Resist The Temptation Of Easy", next_chapter_link="/books/things-i-learnt/utf-utf8", next_chapter_title="Always Use UTF-8 For Your Strings") }}
--- a/content/books/things-i-learnt/use-utf8/index.md
+++ b/content/books/things-i-learnt/use-utf8/index.md
@ -0,0 +1,55 @@
+++
+title = "Things I Learnt The Hard Way - Always Use UTF-8 For Your Strings"
+date = 2019-07-01
+
+[taxonomies]
+tags = ["en-au", "books", "things i learnt", "utf-8"]
+++
+
+Long gone are the days where [ASCII](https://en.wikipedia.org/wiki/ASCII) was
+enough for everyone. Long gone are the days where you can deal with strings
+with no "weird" or "funny" characters.
+
+<!-- more -->
+
+I was born in a time when the only encoding we had was ASCII. You could encode
+all strings in sequences of bytes, 'cause all characters you could use where
+encoded from 1 to 255 (well, from 32 [space] to 93 [close brackets] and you
+still have a few latin-accented characters in some higher positions, although
+not all accents where there).
+
+Today, accepting characters beyond that is not the exception, but the norm. To
+cope with all that, we have things like
+[Unicode](https://en.wikipedia.org/wiki/Unicode) and
+[uTF-8](https://en.wikipedia.org/wiki/UTF-8) for encoding that in reasonable
+memory space (UTF-16 is also a good option here, but that would depend on your
+language).
+
+So, as much as you to make your system simple, you will have to keep the
+internal representation of your strings in UTF-8/UTF-16. Surely, you may not
+receive the data as UTF-8/UTF-16, but you'll have to encode it and keep
+transmitting it around as UTF-8/UTF-16 till you have to display it, at which
+point you'll convert from UTF-8/UTF-16 to whatever your display supports
+(maybe it even supports displaying in UTF-8/UTF-16, so you're good already).
+
+At this point, I believe most languages do support UTF-8, which is great. You
+may still have problems with inputs coming from other systems that are not
+UTF-8 (old Windows versions, for example), but that's fairly easy to convert
+-- the hard part is figuring out the input _encoding_, though. Also, most
+developers tend to ignore this and only accept ASCII characters, or ignore
+UTF-8/whatever-encoding and get a bunch of weird characters on their printing,
+'cause they completely ignored the conversion on the output point. That's why
+I'm repeating the mantra of UTF-8: To remind you to always capture your input,
+encode it in UTF-8 and _then_ convert in the output.
+
+One thing to keep in mind is that UTF-8 is not a "cost free" encoding as
+ASCII: While in ASCII to move to the 10th character, you'd just jump 10 bytes
+from the start of the string, with UTF-8 you can't, due some characters being
+encoded as two or more bytes (you should read the Wikipedia page; the encoding
+is pretty simple and makes a lot of sense) and, due this, you can't simply
+jump 10 characters 'cause you may end up in second byte that represents a
+single character. Walking through the whole string would require traversing
+the string character by character, instead of simply jumping straight to the
+proper position. But that's a price worth paying, in the long run.
+
+{{ chapters(prev_chapter_link="/books/things-i-learnt/use-timezones", prev_chapter_title="Always Use Timezones With Your Dates", next_chapter_link="/books/things-i-learnt/languages-are-more", next_chapter_title="A Language Is Much More Than A Language") }}