Yazılım Çorbası: UTF-16

Giriş
UTF-16 daha eski ve sabit boyutta (2 byte) bir kodek olan UCS-2'den esinlenmiş.

UTF-16 Yetersiz Kalıyor

UTF-16'nın yetersiz kalmasında Unicode konsorsiyumunun genişlemeyi tahmin edememesi yatıyor. Açıklaması şöyle

Cultural factors played a role. Western creators of Unicode expected that unifying East Asian ideographs would be similar to unifying Western fonts (roman, italics...) and that they get to decide where the line between modern and unusual characters lies. This proved to be too optimistic. For the unification, the devil was in the details, but also in the need of processing of multi-lingual documents and databases: variation selectors were already breaking the 16 bits per character principle, like UTF-16 later did.

In 2000, PRC published their GB 18030 standard as mandatory for all software applications marketed for China. At this point, at the latest, it became obvious that Unicode Consortium wasn't going to have a free hand in inclusion of characters if they want their standard to remain relevant world wide.

Byte Order Mark - BOM
UTF-16 dosyalarında başında bazen BOM bulunur. Açıklaması şöyle

UNICODE uses 2 bytes for one character, so it has big or little endian difference.

BOM UCS-2'den miras kalmış. Açıklaması şöyle

UCS-2 had big-endian and little-endian because it directly represented the codepoint as a 16-bit 'uint16_t' or 'short int' number, like in C and other programming languages. It's not so much an 'encoding' as a direct memory representation of the numeric values, and as an uint16_t can be either BE or LE on different machines, so is UCS-2. The later UTF-16 just inherited the same mess for compatibility.

Açıklaması şöyle

So the people were forced to come up with the bizarre convention of storing a FE FF at the beginning of every Unicode string; this is called a Unicode Byte Order Mark and if you are swapping your high and low bytes it will look like a FF FE and the person reading your string will know that they have to swap every other byte. Phew. Not every Unicode string in the wild has a byte order mark at the beginning.

CodePoint'ten Çevrim

Açıklaması şöyle.

Unicode characters outside the Basic Multilingual Plane, that is characters with code above 0xFFFF, are encoded in UTF-16 by pairs of 16-bit code units called surrogate pairs, by the following scheme:

0x010000 is subtracted from the code point, leaving a 20-bit number in the range 0..0x0FFFFF;

the top ten bits (a number in the range 0..0x03FF) are added to 0xD800 to give the first code unit or high surrogate, which will be in the range 0xD800..0xDBFF;

the low ten bits (also in the range 0..0x03FF) are added to 0xDC00 to give the second code unit or low surrogate, which will be in the range 0xDC00..0xDFFF.

Codepoint'ten 2 tane 16 bit değere (surrogate key) çevirmek için şöyle yaparız.

#include <cstdint>

using codepoint = std::uint32_t;
using utf16 = std::uint16_t;

struct surrogate {
    utf16 high; // Leading
    utf16 low;  // Trailing
};

constexpr surrogate split(codepoint const in) noexcept {
    auto const inMinus0x10000 = (in - 0x10000);
    surrogate const r{
            static_cast<utf16>((inMinus0x10000 / 0x400) + 0xd800), // High
            static_cast<utf16>((inMinus0x10000 % 0x400) + 0xdc00)}; // Low
    return r;
}

2 tane 16 bit değerden (surrogate key) codepoint'e çevirmek için şöyle yaparız.

constexpr codepoint combine(surrogate const s) noexcept {
    return static_cast<codepoint>(
            ((s.high - 0xd800) * 0x400) + (s.low - 0xdc00) + 0x10000);
}

Yazılım Çorbası

6 Nisan 2020 Pazartesi

UTF-16

Hiç yorum yok:

Yorum Gönder

Blog Arşivi