What is UTF

We use Unicode Transformation Format (UTF) every day, though most of us call it ASCII. However, we really don’t appreciate what it is or what’s going on behind the scenes. ASCII data is the reason that our ASCII to PLC Gateway is so popular. There is a ton of bar code, printer labels, and other ASCII data on the factory floor. And ASCII isn’t going away.

ASCII is really one of computer science’s great untold stories. After decades of angst, frustration, and computer printouts littered with “?????????” because the software program doesn’t understand the character codes, there is a standard that allows every spoken language including Russian, Chinese, and even Klingon, and all those graphics and special characters to be encoded. It’s truly a fascinating story which I’m about to tell in several blogs in the coming weeks.

The story begins in the early days of computers. In those days, there wasn’t any agreement on how to represent a character. In fact, there wasn’t any agreement on how big the element size of a computer should be. Five seemed to be enough to some of the early computer vendors. That’s 32 characters (5 bits) and there are only 26 letters in the alphabet (a little shortsighted, weren’t they?). At that time, computers were built with different element sizes: four, five, and six were common. It took a while before everyone agreed to the 8-bit standard.

But once that happened, there wasn’t any agreement on what each of those 256 codes meant. Everyone, more or less, agreed on the codes for letters of the alphabet, but what did the rest of the codes mean? Agreement came slowly on the first 127 including things like carriage return, linefeed, and control characters like ACK until the federal government stepped in. In 1961, IBM convinced the American National Standards Institute (ANSI) to adopt a single communications standard. That effort led to ASCII – the American Standard Code for Information Interchange. Even after President Lyndon Johnson signed a memorandum in 1968 adopting ASCII as the standard for federal computers, it wasn’t until Intel invented 8-bit microprocessors that eight bits and the ASCII we know today became common.

You might think that this ended the matter, but nothing is ever that simple. Questions arose about other character sets. How do we encode Russian, Japanese, or Chinese? What about the characters for codes above 127? Some people used them to support other languages. Others used them to contain graphics characters like vertical and horizontal lines so they could draw silly pictures on their printer.

For a while we had things called code pages. These were characters sets with identical character codes for the first 127 codes and a special set of related characters for codes above 127. The IBM PCs of the 1980s had dozens and dozens of different code pages. You could have code pages for other languages (at least the simple ones), but that still didn’t solve the problem. Japanese and Chinese have thousands of characters. They were never going to fit into eight bits. And what about the languages where the letter changes shape at the end of the word like in German. Is that the same letter or a different one? Yuck!

Needless to say this has caused an endless amount of hassle for people over the years. It’s why you might get an email where the “from,” the “subject,” or the “to” fields looks like “?????????????” When you see that kind of notation, it means that the programmer is unaware of the something I’ll discuss in the next ASCII blog, Unicode, which, if you want to be prepared, you can read about on the Unicode web page (http://www.unicode.org/).