What the “UTF”?

In this edition of my dissertation on ASCII I am going to devolve into the background of Unicode. This is the second of two parts of my presentation on ASCII. ASCII data is something we encounter every single day but most of us don’t really appreciate the complexity behind it or what’s happening behind the scenes. If you think this isn’t important you’d be wrong. ASCII data moves a lot of important information on the factory floor and that’s the reason our ASCII to PLC Gateway is so popular. Our customers also move a lot of ASCII data over EtherNet/IP™ and ProfiNet IO. ASCII still rules though software programmers may not want to hear that.

I’d venture many software programmers read my blog. But I’d venture of the thousands or tens of thousands of programmers working in the industrial automation industry, just a small handful, if that, would be able to develop a good internationally capable automation application. Ask them to support Japanese, Malaysian or Indonesian and they’d be lost. The reason for that is that they don’t understand the Unicode character sets.

Last month I talked about code pages which specified what the ASCII characters from 128 to 255 looked like and the mess that became of that. Everybody had a code page for their particular language implementation. I described how there were thousands and thousands of these code pages and how it sort of worked. For example, if your Greek document was never transported outside Greece since everyone used the same code page, it would work. But once the Internet happened and some of those Greek documents ended up in Holland, the text looked like something you wrote in Chinese in a drunken stupor.

Few people appreciate how hard encoding characters in a computer is. Linguists might but they’d be the only ones. Do you know that in German there are letters that change shape when they appear at the end of the sentence? Tell me, is that two different letters? Or is it the same letter? In Arabic they consider that the same letter. In Hebrew, they consider it a new letter. [Maybe we’ve just stumbled on the cause for the entire century’s old, Arab/Israeli conflict? Might Unicode be the solution to world peace?]

Let’s look at the core of Unicode (www.unicode.org). In the Unicode representation, every symbol is represented by something called a code-point in the form U+xxxx where xxxx is a hexadecimal value. The English A has been assigned U+0041. In the Limbu language, this letter in the little box that I can’t pronounce is assigned code-point U+0691. The incredibly persistent people at the Unicode foundation have tediously mapped every single letter and symbol of every language into a code-point for years now. They have tables and tables of mappings on their websites. It’s fascinating to read – that is if you’re a fan of the most boring movie of all-time (An Affair to Remember with Cary Grant and Debra Carr).

These ASCII code-points are mappings and only mappings. The tables don’t say anything about how these code-points are stored in memory. They don’t describe if they are big-endian (MSB first) or little-endian (LSB-first) or how many bytes they occupy. This is where things got really crazy. They invented a byte order mark to precede a string of code-points. An FE FF is the standard indicating the string is little-endian. If you read a string with two leading bytes of “FF FE”, you would know it is big-endian.

Great, huh?

Americans for the most part didn’t like it. All our strings were much longer but didn’t encode any additional information. No juice for the squeeze to implement these sophisticated code mappings you might say. With code mappings, the string HELLO went from 5 bytes to 12 bytes – we added a bunch of zeros in front of each byte and lost 7 bytes of memory. For a long time it was just ignored by American programmers.

But eventually American programmers looked at this and did what we always do. We made things easy (and hard on all those foreigners not smart enough to code every string in English). We created the UTF-8 standard where our most used characters (00 to 7F) would still be 8 bits, but all those funny characters at 80 and above would use 2 bytes. Surprise! Our standard ASCII strings that we’ve used for the last forty years are identical to UTF-8 encoding. Only the funny characters above 80 have to change which is a very small percentage of our text strings.

Worked great for us, but not so well for the rest of the universe. If you’re encoding Klingon letters you’ll have to use multiple bytes and work hard at the translation, but that’s not our problem.

There’s more to the UTF standards and I’ll talk about that in my next ASCII column.