Collate synonym5/30/2023 ![]() Some characters which are not unified ideographs are considered equivalent to unified ideographs. ![]() 2.1.1 Extension of Unihan Properties to Non-Unihan Characters Note, however, that the "Syntax" descriptions below, used for validation of field values, operate on Normalization Form D (NFD), primarily because that makes the regular expressions simpler. For historical reasons, they all start with a lowercase “k”.Īll data in the Unihan database is stored in UTF-8 using Normalization Form C (NFC). The fields are all named, and the names consist entirely of ASCII letters and digits with no spaces or other punctuation except for underscore. The database consists of a number of fields containing data for each Han ideograph in the Unicode Standard. This document is a guide to that data, describing the mechanics of the Unihan database, the nature of its contents, and the status of the various fields. The Unihan database therefore includes structural analyses and definitions for ideographs. Most ideographs are divided into a determinative, which gives a vague sense of meaning, and a phonetic, which gives a vague sense of pronunciation. This isn’t to say that ideographs are truly ideographic, in that they represent abstract ideas but they generally have one root meaning from which the others derive, and generally retain the bulk of their semantic content across linguistic boundaries. Unlike characters in Western scripts such as Latin and Greek, whose basic property is their sound, which stays largely constant across languages, the basic property for Han ideographs is their meaning. Beyond all this, it’s important to track not only what properties a given ideograph has, but who claims it has those properties. Relationships between ideographs need to be defined to allow for fuzzy string matching. Data in character sets not included in the world of international standards bodies needs to be converted. Input methods require information such as pronunciations, as do collation algorithms. ![]() In practice, implementation of ideographs requires large amounts of ancillary data. That is, the Unicode Standard does not formally define what the ideograph U+4E00 is rather, it defines it as being the equivalent of, say, 0x523B in GB/T 2312, 0x14421 in CNS 11643, 0x306C in JIS X 0208, and so on. It contains mapping data to allow conversion to and from other coded character sets and additional information to help implement support for the various languages which use the Han ideographic script.įormally, ideographs are defined within the Unicode Standard via their mappings. The Unihan database is the repository for the Unicode Consortium’s collective knowledge regarding the CJK Unified Ideographs contained in the Unicode Standard.
0 Comments
Leave a Reply. |