GB 2312

From Infogalactic: the planetary knowledge core
Jump to: navigation, search

GB2312 is the registered internet name for a key official character set of the People's Republic of China, used for simplified Chinese characters. GB abbreviates Guojia Biaozhun (国家标准), which means national standard in Chinese.

GB2312 (1980) has been superseded by GBK and GB18030, which include additional characters, but GB2312 is nonetheless still in widespread use.

While GB2312 covers 99.75% of the characters used for Chinese input, historical texts and many names remain out of scope. GB2312 includes 6,763 Chinese characters (on two levels: the first is arranged by reading, the second by radical then number of strokes), along with symbols and punctuation, Japanese kana, the Greek and Cyrillic alphabets, Zhuyin, and a double-byte set of Pinyin letters with tone marks. 1.5% of all web pages use GB2312 in July 2014.[1]

There is an analogous character set known as GB/T 12345, closely related to GB2312, but with traditional character forms replacing simplified forms. GB-encoded fonts often come in pairs, one with the GB 2312 (simplified) character set and the other with the GB/T 12345 (traditional) character set.


Characters

Characters in GB2312 are arranged in a 94x94 grid (as in ISO 2022), and the two-byte codepoint of each character is expressed in the kuten (or quwei) form, which specifies a row (ku or qu) and the position of the character within the row (ten or wei).

The rows (numbered from 1 to 94) contain characters as follows:

  • 01-09, comprising punctuation and other special characters; also Hiragana, Katakana, Greek, Cyrillic, Pinyin, Bopomofo
  • 16-55, the first plane for Chinese characters, arranged according to Pinyin. (3755 characters).
  • 56-87, the second plane for Chinese characters, arranged according to radical and strokes. (3008 characters).
  • 88-89, further Chinese characters. (103 characters). Defined only for GB/T 12345, not GB 2312.

The rows 10-15 and 90-94 are unassigned.

Encodings of GB2312

EUC-CN

EUC-CN is often used as the character encoding (i.e. for external storage) in programs that deal with GB2312, thus maintaining compatibility with ASCII. Two bytes are used to represent every character not found in ASCII. The value of the first byte is from 0xA1-0xF7 (161-247), while the value of the second byte is from 0xA1-0xFE (161-254).

Compared to UTF-8, GB2312 (whether native or encoded in EUC-CN) is more storage efficient, this because no bits are reserved to indicate three or four byte sequences, and no bit is reserved for detecting tailing bytes.

To map the code points to bytes, add 160 (0xA0) to the 1000's and 100's value of the code point to form the high byte, and add 160 (0xA0) to the 10's and 1's value of the code point to form the low byte.

For example, if you have the GB2312 code point 4566 ("外", which means foreign), the high byte will come from 45 (4500), and the low byte will come from 66 (0066). For the high byte, add 45 to 160, giving 205 or 0xCD. For the low byte do the same, add 66 to 160, giving 226 or 0xE2. So, the full encoding is 0xCDE2.

HZ

HZ is another encoding of GB2312 that is used mostly for Usenet postings.

Two implementations of GB2312

There are two implementations of GB2312 which differ in few code points.

bytes Implementation A Implementation B
A1A4 U+00B7 MIDDLE DOT U+30FB KATAKANA MIDDLE DOT
A1AA U+2014 EM DASH U+2015 HORIZONTAL BAR

Implementation A is compatible with GBK and GB18030, while Implementation B is not.

As of 2015, Microsoft .Net Framework is using Implementation A. iconv-1.14, php-5.6, ActivePerl-5.20, Java 1.7 and Python 3.4 are using Implementation B.[2] Ruby 2.2 is compatible with both Implementation A and Implementation B, it internally converts the conflictive characters to Implementation A.

See also

References

External links