Term of the Week: Unicode – The Language of Localization

What is it?

A character encoding standard that provides a cross-platform, uniform, and robust digital representation of the scripts for the world’s languages.

Why is it important?

Unicode has become the de facto way in which characters for the scripts of the world’s languages are represented in modern digital devices, meaning that Unicode is a prerequisite for all digital text.

Why does a business professional need to know this?

Unicode provides the foundation for anything related to text data. The Unicode standard is developed and maintained by the Unicode Consortium. The uniform representation for all of Unicode’s 137,374 characters – as of Version 11.0, released on June 5, 2018 – helps to ensure interoperability and translatability of any text-related tasks that you might encounter, whether it is for multilingual user interface (UI) strings or translations of entire manuals (see Code Charts).

Any implementation that handles text but does not support Unicode is a completely wasted effort, because its text data cannot easily interoperate with Unicode-based implementations that are now commonplace (see the Unicode FAQ).

It is important to understand that Unicode is much more than a huge bucket of characters covering 146 scripts that are used by an even larger number of the world’s languages (including Egyptian hieroglyphics):

Unicode defines several properties that determine how its characters are to behave.
The UCD (Unicode Character Database) is the primary source for these properties, which are documented in UAX #44. Some of the properties include line breaking, casing, bidirectionality, inherent width, and so on.

Closely related to Unicode are the following two important and useful projects:

ICU (International Components for Unicode) which provides robust libraries that implement many functions for properly handling Unicode-based text data according to the UCD.
CLDR (Common Locale Data Repository) which provides an enormous amount of locale data that are used by an increasingly large number of OSes and apps.

Both projects are frequently updated, and Unicode itself is now on an annual release cycle.

References

[Unicode] The Unicode Standard. Unicode Consortium. The latest version of The Unicode Standard.
[Unicode TR44] The Unicode Standard Annex #44. Unicode Consortium. The documentation for the Unicode Character Database (UCD).
[Unicode FAQ] Unicode FAQ. Unicode Consortium. The Unicode FAQ covers a wide variety of topics that benefit both users and developers.
[Unicode Charts] Unicode Character Code Charts. Unicode Consortium. The latest Unicode Character Code Charts.
[Unicode CLDR] CLDR. Unicode Consortium. Common Locale Data Repository.
[Unicode ICU] ICU. International Components for Unicode.
[Unicode UCD] UCD. Unicode Consortium. The latest Unicode Character Database.
[Lunde 2009] CJKV Information Processing. Lunde, Ken. (2009). O’Reilly Media. Chinese, Japanese, Korean, and Vietnamese language processing.

About Ken Lunde

Dr. Ken Lunde has worked at Adobe for over 25 years, specializing in CJKV Type Development. He architected the open source Source Han Sans and Source Han Serif Pan-CJK typeface families that were released in 2014 and 2017, respectively. He is the author of CJKV Information Processing and also regularly publishes articles to Adobe's CJK Type Blog.

Ken received the 2018 Unicode Bulldog Award and became a Unicode Technical Director in the same year.

Term: Unicode

Email: lunde@adobe.com

Website: blogs.adobe.com/CCJKType/

Twitter: @ken_lunde

LinkedIn: linkedin.com/in/kenlunde/

Facebook: facebook.com/ken.lunde

2 thoughts on “Term of the Week: Unicode”

Dr. Ken Lunde December 20, 2018 at 9:05 am

In terms of the printed book, the essay for this particular term was written when Version 10.0 was the current version of Unicode. As reflected above, Version 11.0 was subsequently released on 2018-06-05, which added 684 new characters, bringing the total number to 137,374. Seven new scripts—Hanifi Rohingya, Old Sogdian, Sogdian, Dogra, Gunjala Gondi, Makasar, and Medefaidrin—were added, bringing the total number to 146. Unicode Version 12.0, which will be released in early March of 2019, will add four new scripts, bringing the total to an even 150.
1. Richard Hamilton December 20, 2018 at 5:50 pm
  
  Thanks for the update. The page now has the updates, but I'm glad we have your comment, too, since it adds useful detail.

Comments are closed.