Collate synonym

5/30/2023

Some characters which are not unified ideographs are considered equivalent to unified ideographs.

2.1.1 Extension of Unihan Properties to Non-Unihan Characters Note, however, that the "Syntax" descriptions below, used for validation of field values, operate on Normalization Form D (NFD), primarily because that makes the regular expressions simpler. For historical reasons, they all start with a lowercase “k”.Īll data in the Unihan database is stored in UTF-8 using Normalization Form C (NFC). The fields are all named, and the names consist entirely of ASCII letters and digits with no spaces or other punctuation except for underscore. The database consists of a number of fields containing data for each Han ideograph in the Unicode Standard. This document is a guide to that data, describing the mechanics of the Unihan database, the nature of its contents, and the status of the various fields. The Unihan database therefore includes structural analyses and definitions for ideographs. Most ideographs are divided into a determinative, which gives a vague sense of meaning, and a phonetic, which gives a vague sense of pronunciation. This isn’t to say that ideographs are truly ideographic, in that they represent abstract ideas but they generally have one root meaning from which the others derive, and generally retain the bulk of their semantic content across linguistic boundaries. Unlike characters in Western scripts such as Latin and Greek, whose basic property is their sound, which stays largely constant across languages, the basic property for Han ideographs is their meaning. Beyond all this, it’s important to track not only what properties a given ideograph has, but who claims it has those properties. Relationships between ideographs need to be defined to allow for fuzzy string matching. Data in character sets not included in the world of international standards bodies needs to be converted. Input methods require information such as pronunciations, as do collation algorithms.

In practice, implementation of ideographs requires large amounts of ancillary data. That is, the Unicode Standard does not formally define what the ideograph U+4E00 is rather, it defines it as being the equivalent of, say, 0x523B in GB/T 2312, 0x14421 in CNS 11643, 0x306C in JIS X 0208, and so on. It contains mapping data to allow conversion to and from other coded character sets and additional information to help implement support for the various languages which use the Han ideographic script.įormally, ideographs are defined within the Unicode Standard via their mappings. The Unihan database is the repository for the Unicode Consortium’s collective knowledge regarding the CJK Unified Ideographs contained in the Unicode Standard.

4.5 Listing of Additional Sources Used by the Unihan Database.
4.4 Listing of Characters Covered by the Unihan Database.
4.3 Listing by Location within Unihan.zip.
4.2 Listing by Date of Addition to the Unicode Standard.
3.7.1 Simplified and Traditional Chinese Variants.
2.1.2 Sorting Algorithm Used by the Radical-Stroke Charts.
2.1.1 Extension of Unihan Properties to Non-Unihan Characters.
įor any errata which may apply to this annex, see. įor more information about versions of the Unicode Standard, see. įor a list of current Unicode Technical Reports, see. “ Common References for Unicode Standard Annexes.”įor the latest version of the Unicode Standard, see. Related information that is useful in understanding this annex is found in Unicode Standard Annex #41, Please submit corrigenda and other comments with the online reporting The version number of a UAX document corresponds to the version of the Unicode Standard of which it forms a part. The Unicode Standard may require conformance to normative content in a Unicode Standard Annex, if so specified in the Conformance chapter of that version of the Unicode Standard. Not a stable document it is inappropriate to cite this document as otherĪ Unicode Standard Annex (UAX) forms an integral part of the Unicode Standard, but is published online as a separate document. Publication does not imply endorsement by the Unicode Consortium. May be updated, replaced, or superseded by other documents at any time. This document describes the organization and content of the Unihan database. Proposed Update Unicode® Standard Annex #38 Unicode Han Database (Unihan) Version

0 Comments

Collate synonym

Leave a Reply.

Author

Archives

Categories