You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
134 lines
5.6 KiB
134 lines
5.6 KiB
11 months ago
|
<!DOCTYPE html>
|
||
|
<html lang="en">
|
||
|
<head>
|
||
|
<meta http-equiv="X-UA-Compatible" content="IE=edge">
|
||
|
<meta http-equiv="content-type" content="text/html; charset=utf-8">
|
||
|
|
||
|
<!-- Enable responsiveness on mobile devices-->
|
||
|
<!-- viewport-fit=cover is to support iPhone X rounded corners and notch in landscape-->
|
||
|
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1, viewport-fit=cover">
|
||
|
|
||
|
<title>Julio Biason .Me 4.3</title>
|
||
|
|
||
|
<!-- CSS -->
|
||
|
<link rel="stylesheet" href="https://blog.juliobiason.me/print.css" media="print">
|
||
|
<link rel="stylesheet" href="https://blog.juliobiason.me/poole.css">
|
||
|
<link rel="stylesheet" href="https://blog.juliobiason.me/hyde.css">
|
||
|
<link rel="stylesheet" href="https://fonts.googleapis.com/css?family=PT+Sans:400,400italic,700|Abril+Fatface">
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
</head>
|
||
|
|
||
|
<body class=" ">
|
||
|
|
||
|
<div class="sidebar">
|
||
|
<div class="container sidebar-sticky">
|
||
|
<div class="sidebar-about">
|
||
|
|
||
|
<a href="https://blog.juliobiason.me"><h1>Julio Biason .Me 4.3</h1></a>
|
||
|
|
||
|
<p class="lead">Old school dev living in a 2.0 dev world</p>
|
||
|
|
||
|
|
||
|
</div>
|
||
|
|
||
|
<ul class="sidebar-nav">
|
||
|
|
||
|
|
||
|
<li class="sidebar-nav-item"><a href="/">English</a></li>
|
||
|
|
||
|
<li class="sidebar-nav-item"><a href="/pt">Português</a></li>
|
||
|
|
||
|
<li class="sidebar-nav-item"><a href="/tags">Tags (EN)</a></li>
|
||
|
|
||
|
<li class="sidebar-nav-item"><a href="/pt/tags">Tags (PT)</a></li>
|
||
|
|
||
|
|
||
|
</ul>
|
||
|
</div>
|
||
|
</div>
|
||
|
|
||
|
|
||
|
<div class="content container">
|
||
|
|
||
|
<div class="post">
|
||
|
<h1 class="post-title">Things I Learnt The Hard Way - Always Use UTF-8 For Your Strings</h1>
|
||
|
<span class="post-date">
|
||
|
2019-07-01
|
||
|
|
||
|
<a href="https://blog.juliobiason.me/tags/books/">#books</a>
|
||
|
|
||
|
<a href="https://blog.juliobiason.me/tags/things-i-learnt/">#things i learnt</a>
|
||
|
|
||
|
<a href="https://blog.juliobiason.me/tags/utf-8/">#utf-8</a>
|
||
|
|
||
|
</span>
|
||
|
<p>Long gone are the days where <a href="https://en.wikipedia.org/wiki/ASCII">ASCII</a> was
|
||
|
enough for everyone. Long gone are the days where you can deal with strings
|
||
|
with no "weird" or "funny" characters.</p>
|
||
|
<span id="continue-reading"></span>
|
||
|
<p>I became a developer in a time when the only encoding we had was ASCII. You
|
||
|
could encode all strings in sequences of bytes, 'cause all characters you
|
||
|
could use where encoded from 1 to 255 (well, from 32 [space] to 93 [close
|
||
|
brackets] and you still have a few latin-accented characters in some higher
|
||
|
positions, although not all accents where there).</p>
|
||
|
<p>Today, accepting characters beyond that is not the exception, but the norm. To
|
||
|
cope with all that, we have things like
|
||
|
<a href="https://en.wikipedia.org/wiki/Unicode">Unicode</a> and
|
||
|
<a href="https://en.wikipedia.org/wiki/UTF-8">uTF-8</a> for encoding that in reasonable
|
||
|
memory space (UTF-16 is also a good option here, but that would depend on your
|
||
|
language).</p>
|
||
|
<p>So, as much as you to make your system simple, you will have to keep the
|
||
|
internal representation of your strings in UTF-8/UTF-16. You may not receive
|
||
|
the data as UTF-8/UTF-16, but you'll have to encode it and keep transmitting
|
||
|
it around as UTF-8/UTF-16 till you have to display it, at which point you'll
|
||
|
convert from UTF-8/UTF-16 to whatever your display supports (maybe it even
|
||
|
supports displaying in UTF-8/UTF-16, so you're good already).</p>
|
||
|
<p>Today, I believe most languages do support UTF-8, which is great. You
|
||
|
may still have problems with inputs coming from other systems that are not
|
||
|
UTF-8 (old Windows versions, for example), but that's fairly easy to convert
|
||
|
-- the hard part is figuring out the input <em>encoding</em>, though. Also, most
|
||
|
developers tend to ignore this and assume the input is in ASCII, or ignore the
|
||
|
input encoding and get a bunch of weird characters on their printing,
|
||
|
'cause they completely ignored the conversion on the output point. That's why
|
||
|
I'm repeating the mantra of UTF-8: To remind you to always capture your input,
|
||
|
encode it in UTF-8 and <em>then</em> convert in the output.</p>
|
||
|
<p>One thing to keep in mind is that UTF-8 is not a "cost free" encoding as
|
||
|
ASCII: While in ASCII to move to the 10th character, you'd just jump 10 bytes
|
||
|
from the start of the string, with UTF-8 you can't, due some characters being
|
||
|
encoded as two or more bytes (you should read the Wikipedia page; the encoding
|
||
|
is pretty simple and makes a lot of sense) and, due this, you can't simply
|
||
|
jump 10 characters 'cause you may end up in second byte that represents a
|
||
|
single character. Walking through the whole string would require traversing
|
||
|
the string character by character, instead of simply jumping straight to the
|
||
|
proper position. But that's a price worth paying, in the long run.</p>
|
||
|
<div>
|
||
|
|
||
|
<div style="float:left">
|
||
|
<< <a href="/books/things-i-learnt/use-timezones">Always Use Timezones With Your Dates</a>
|
||
|
</div>
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
<div style="float:right">
|
||
|
<a href="/books/things-i-learnt/optimization">Optimization Is For Compilers</a> >>
|
||
|
</div>
|
||
|
|
||
|
</div>
|
||
|
|
||
|
</div>
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
</div>
|
||
|
|
||
|
</body>
|
||
|
|
||
|
</html>
|