New Chapter: Use UTF8

5 years ago · a94980d60e
4 changed files with 63 additions and 2 deletions
--- a/content/books/things-i-learnt/_index.md
+++ b/content/books/things-i-learnt/_index.md
@ -40,6 +40,7 @@ template = "section-contentless.html"
 		* [Don't Mess With Things Outside Your Project](outside-project)
 		* [Resist The Temptation Of Easy](resist-easy)
 		* [Always Use Timezones With Your Dates](use-timezones)
 		* [Always Use UTF-8 For Your Strings](use-utf8)
 * Community/Teams
 	* [A Language Is Much More Than A Language](languages-are-more)
 	* [Understand And Stay Away From Cargo Cult](cargo-cult)
--- a/content/books/things-i-learnt/languages-are-more/index.md
+++ b/content/books/things-i-learnt/languages-are-more/index.md
@ -39,4 +39,4 @@ surface of what the whole of a language encapsulates and if you ignore the
 other elements in it, you may find yourself with a cute language in a
 community that is always fighting and never going forward.
-{{ chapters(prev_chapter_link="/books/things-i-learnt/use-timezones", prev_chapter_title="Always Use Timezones With Your Dates", next_chapter_link="/books/things-i-learnt/outside-project", next_chapter_title="Don't Mess With Things Outside Your Project") }}
+{{ chapters(prev_chapter_link="/books/things-i-learnt/use-utf8", prev_chapter_title="Always Use UTF-8 For Your Strings", next_chapter_link="/books/things-i-learnt/outside-project", next_chapter_title="Don't Mess With Things Outside Your Project") }}
--- a/content/books/things-i-learnt/use-timezones/index.md
+++ b/content/books/things-i-learnt/use-timezones/index.md
@ -32,4 +32,9 @@ timezone as soon as possible and carry it around in all operations.
 Modules/classes that don't support timezones for dates/times should, as soon
 as possible, removed from the system.
-{{ chapters(prev_chapter_link="/books/things-i-learnt/use-timezones", prev_chapter_title="Always Use Timezones With Your Dates", next_chapter_link="/books/things-i-learnt/languages-are-more", next_chapter_title="A Language Is Much More Than A Language") }}
+Developers a bit more seasoned -- and by "seasoned" I meant "Had to deal with
 times before" -- will probably claim "Hey, this is _obvious_!" And I'd have to
 agree. But it's annoying how many times I got bitten by some stupid bug 'cause
 we decided that "well, everything is in the same timezone, so it's all good".
 {{ chapters(prev_chapter_link="/books/things-i-learnt/resist-easy", prev_chapter_title="Resist The Temptation Of Easy", next_chapter_link="/books/things-i-learnt/utf-utf8", next_chapter_title="Always Use UTF-8 For Your Strings") }}
--- a/content/books/things-i-learnt/use-utf8/index.md
+++ b/content/books/things-i-learnt/use-utf8/index.md
@ -0,0 +1,55 @@
 +++
 title = "Things I Learnt The Hard Way - Always Use UTF-8 For Your Strings"
 date = 2019-07-01
 [taxonomies]
 tags = ["en-au", "books", "things i learnt", "utf-8"]
 +++
 Long gone are the days where [ASCII](https://en.wikipedia.org/wiki/ASCII) was
 enough for everyone. Long gone are the days where you can deal with strings
 with no "weird" or "funny" characters.
 <!-- more -->
 I was born in a time when the only encoding we had was ASCII. You could encode
 all strings in sequences of bytes, 'cause all characters you could use where
 encoded from 1 to 255 (well, from 32 [space] to 93 [close brackets] and you
 still have a few latin-accented characters in some higher positions, although
 not all accents where there).
 Today, accepting characters beyond that is not the exception, but the norm. To
 cope with all that, we have things like
 [Unicode](https://en.wikipedia.org/wiki/Unicode) and
 [uTF-8](https://en.wikipedia.org/wiki/UTF-8) for encoding that in reasonable
 memory space (UTF-16 is also a good option here, but that would depend on your
 language).
 So, as much as you to make your system simple, you will have to keep the
 internal representation of your strings in UTF-8/UTF-16. Surely, you may not
 receive the data as UTF-8/UTF-16, but you'll have to encode it and keep
 transmitting it around as UTF-8/UTF-16 till you have to display it, at which
 point you'll convert from UTF-8/UTF-16 to whatever your display supports
 (maybe it even supports displaying in UTF-8/UTF-16, so you're good already).
 At this point, I believe most languages do support UTF-8, which is great. You
 may still have problems with inputs coming from other systems that are not
 UTF-8 (old Windows versions, for example), but that's fairly easy to convert
 -- the hard part is figuring out the input _encoding_, though. Also, most
 developers tend to ignore this and only accept ASCII characters, or ignore
 UTF-8/whatever-encoding and get a bunch of weird characters on their printing,
 'cause they completely ignored the conversion on the output point. That's why
 I'm repeating the mantra of UTF-8: To remind you to always capture your input,
 encode it in UTF-8 and _then_ convert in the output.
 One thing to keep in mind is that UTF-8 is not a "cost free" encoding as
 ASCII: While in ASCII to move to the 10th character, you'd just jump 10 bytes
 from the start of the string, with UTF-8 you can't, due some characters being
 encoded as two or more bytes (you should read the Wikipedia page; the encoding
 is pretty simple and makes a lot of sense) and, due this, you can't simply
 jump 10 characters 'cause you may end up in second byte that represents a
 single character. Walking through the whole string would require traversing
 the string character by character, instead of simply jumping straight to the
 proper position. But that's a price worth paying, in the long run.
 {{ chapters(prev_chapter_link="/books/things-i-learnt/use-timezones", prev_chapter_title="Always Use Timezones With Your Dates", next_chapter_link="/books/things-i-learnt/languages-are-more", next_chapter_title="A Language Is Much More Than A Language") }}