Insurances.net

subject: Performing Unicode Comparison Through The Encodings [print this page]

The Unicode characters have been really useful for those who are working on with foreign writing systems. As we all know, when we look at the normal keyboard, we can only see the typical numbers and letters that make up the English language. So what if we are going to write characters that are from, say, Korean? Where will you get them? This is now possible because of Unicode. This enables people to make use of different writing systems including Korean, Chinese and Japanese as well as Latin, Spanish and German. Dealing with these languages is quite easier at the present time.

When it comes to Unicode comparison, we can do this through the encodings. There are two situations that have to be considered first. These are the environments that do not allow the usage of byte values that contain high bit sets and the 8 bit clean as well. In the former, an example of this is the Simple Mail Transfer Protocol. Before contrasting them, there is a need to know the character encodings associated with Unicode. These are the UTF-8, UTF-1, UTF-7, UTF-16, UTF-32, UTF-EBCDIC, BOCU-1, Punycode (IDN), CESU-8, GB 18030 and others.

In this Unicode comparison, there are some issues that have to be covered. First is all about the compatibility. In UTF-8, the files that include only ASCII characters will be similar to ASCII file. Most programs can handle files that have been encoded in UTF-8 even if they have characters that are not in ASCII. Meanwhile, UTF-32 and UTF-16 are not compatible with the ASCII files. Programs that are based on Unicode are therefore needed. This way, the files can be displayed, printed and even controlled by the user. Examples of the UTF-16 systems are Java and Windows. These are represented by text objects including program code that contain 8 bit encodings. This means that it does not necessarily have UTF-16 but instead, there are ASCII and UTF-8. XML is also encoded in UTF-8 but there are also times when UTF-16 is used.

Another issue that has to be tackled is about the size. When it comes to UTF-32, the requirement here is four bytes in order to encode a single character. Meanwhile, UTF-16 makes use of two bytes for any character within the BMP or the basic multilingual plane. Otherwise, it would take four bytes to do so. On the other hand, with the UTF-8, this may only take one byte but can also go all the way up to four bytes for a single character to be encoded. Specifically, with the ASCII characters, UTF-8 will only use one byte for such. This means that it can conserve almost two times the half of the space in UTF-16. With Latin characters, the UTF-8 will require two bytes. The printable characters that are found in the UTF-EBCDIC can make use of the same bytes in UTF-8.

Also included in the issues that have to be resolved is about the processing. It is required that a format is easy to find, trim and make the process safe entirely.

by: Willie Greg

welcome to Insurances.net (https://www.insurances.net)

(php7, mysql8 recode on 2018)