diff --git a/content/books/things-i-learnt/_index.md b/content/books/things-i-learnt/_index.md index 8e15213..bda025c 100644 --- a/content/books/things-i-learnt/_index.md +++ b/content/books/things-i-learnt/_index.md @@ -40,6 +40,7 @@ template = "section-contentless.html" * [Don't Mess With Things Outside Your Project](outside-project) * [Resist The Temptation Of Easy](resist-easy) * [Always Use Timezones With Your Dates](use-timezones) + * [Always Use UTF-8 For Your Strings](use-utf8) * Community/Teams * [A Language Is Much More Than A Language](languages-are-more) * [Understand And Stay Away From Cargo Cult](cargo-cult) diff --git a/content/books/things-i-learnt/languages-are-more/index.md b/content/books/things-i-learnt/languages-are-more/index.md index f53fa5f..c0ac487 100644 --- a/content/books/things-i-learnt/languages-are-more/index.md +++ b/content/books/things-i-learnt/languages-are-more/index.md @@ -39,4 +39,4 @@ surface of what the whole of a language encapsulates and if you ignore the other elements in it, you may find yourself with a cute language in a community that is always fighting and never going forward. -{{ chapters(prev_chapter_link="/books/things-i-learnt/use-timezones", prev_chapter_title="Always Use Timezones With Your Dates", next_chapter_link="/books/things-i-learnt/outside-project", next_chapter_title="Don't Mess With Things Outside Your Project") }} +{{ chapters(prev_chapter_link="/books/things-i-learnt/use-utf8", prev_chapter_title="Always Use UTF-8 For Your Strings", next_chapter_link="/books/things-i-learnt/outside-project", next_chapter_title="Don't Mess With Things Outside Your Project") }} diff --git a/content/books/things-i-learnt/use-timezones/index.md b/content/books/things-i-learnt/use-timezones/index.md index 976fc33..89e4a91 100644 --- a/content/books/things-i-learnt/use-timezones/index.md +++ b/content/books/things-i-learnt/use-timezones/index.md @@ -32,4 +32,9 @@ timezone as soon as possible and carry it around in all operations. Modules/classes that don't support timezones for dates/times should, as soon as possible, removed from the system. -{{ chapters(prev_chapter_link="/books/things-i-learnt/use-timezones", prev_chapter_title="Always Use Timezones With Your Dates", next_chapter_link="/books/things-i-learnt/languages-are-more", next_chapter_title="A Language Is Much More Than A Language") }} +Developers a bit more seasoned -- and by "seasoned" I meant "Had to deal with +times before" -- will probably claim "Hey, this is _obvious_!" And I'd have to +agree. But it's annoying how many times I got bitten by some stupid bug 'cause +we decided that "well, everything is in the same timezone, so it's all good". + +{{ chapters(prev_chapter_link="/books/things-i-learnt/resist-easy", prev_chapter_title="Resist The Temptation Of Easy", next_chapter_link="/books/things-i-learnt/utf-utf8", next_chapter_title="Always Use UTF-8 For Your Strings") }} diff --git a/content/books/things-i-learnt/use-utf8/index.md b/content/books/things-i-learnt/use-utf8/index.md new file mode 100644 index 0000000..ef47acc --- /dev/null +++ b/content/books/things-i-learnt/use-utf8/index.md @@ -0,0 +1,55 @@ ++++ +title = "Things I Learnt The Hard Way - Always Use UTF-8 For Your Strings" +date = 2019-07-01 + +[taxonomies] +tags = ["en-au", "books", "things i learnt", "utf-8"] ++++ + +Long gone are the days where [ASCII](https://en.wikipedia.org/wiki/ASCII) was +enough for everyone. Long gone are the days where you can deal with strings +with no "weird" or "funny" characters. + + + +I was born in a time when the only encoding we had was ASCII. You could encode +all strings in sequences of bytes, 'cause all characters you could use where +encoded from 1 to 255 (well, from 32 [space] to 93 [close brackets] and you +still have a few latin-accented characters in some higher positions, although +not all accents where there). + +Today, accepting characters beyond that is not the exception, but the norm. To +cope with all that, we have things like +[Unicode](https://en.wikipedia.org/wiki/Unicode) and +[uTF-8](https://en.wikipedia.org/wiki/UTF-8) for encoding that in reasonable +memory space (UTF-16 is also a good option here, but that would depend on your +language). + +So, as much as you to make your system simple, you will have to keep the +internal representation of your strings in UTF-8/UTF-16. Surely, you may not +receive the data as UTF-8/UTF-16, but you'll have to encode it and keep +transmitting it around as UTF-8/UTF-16 till you have to display it, at which +point you'll convert from UTF-8/UTF-16 to whatever your display supports +(maybe it even supports displaying in UTF-8/UTF-16, so you're good already). + +At this point, I believe most languages do support UTF-8, which is great. You +may still have problems with inputs coming from other systems that are not +UTF-8 (old Windows versions, for example), but that's fairly easy to convert +-- the hard part is figuring out the input _encoding_, though. Also, most +developers tend to ignore this and only accept ASCII characters, or ignore +UTF-8/whatever-encoding and get a bunch of weird characters on their printing, +'cause they completely ignored the conversion on the output point. That's why +I'm repeating the mantra of UTF-8: To remind you to always capture your input, +encode it in UTF-8 and _then_ convert in the output. + +One thing to keep in mind is that UTF-8 is not a "cost free" encoding as +ASCII: While in ASCII to move to the 10th character, you'd just jump 10 bytes +from the start of the string, with UTF-8 you can't, due some characters being +encoded as two or more bytes (you should read the Wikipedia page; the encoding +is pretty simple and makes a lot of sense) and, due this, you can't simply +jump 10 characters 'cause you may end up in second byte that represents a +single character. Walking through the whole string would require traversing +the string character by character, instead of simply jumping straight to the +proper position. But that's a price worth paying, in the long run. + +{{ chapters(prev_chapter_link="/books/things-i-learnt/use-timezones", prev_chapter_title="Always Use Timezones With Your Dates", next_chapter_link="/books/things-i-learnt/languages-are-more", next_chapter_title="A Language Is Much More Than A Language") }}