UNIC Unicode API

August 15, 2018 · View on GitHub

This document introduces UNIC's API for Unicode data and algorithm.

See Unicode and Rust for basic Unicode concepts and how they appear in Rust.

See UNIC API Checklist for common Rust API guidelines that UNIC tries to follow.

Unicode Character Properties

Unicode Character Properties are a major part of the Unicode Standard, defined in Section 3.5, Properties of the Unicode Standard and explained more in Unicode Character Database (UCD) (UAX#44).

Some of these properties are now deprecated, meaning that they are no longer recommended for use. Some other properties are considered contributory properties. Neither of these groups of properties will be supported in UNIC, unless there is clear demand for them.

Other specifications published by the Unicode Consortium, like the Unicode IDNA Compatibility Processing (UTS#46) and the Unicode Emoji (UTS#51) also define their own character properties. These properties are described in respective Unicode Standard Annexes (UAX), Unicode Technical Standards (UTS), or other Unicode Technical Reports (UTR). Some of these specifications are withdrawn, suspended, or superseded by other documents, and therefore will not be implemented by UNIC. See the Unicode Technical Reports page for a complete list of these specifications and their current status.

Naming Convention

The character properties defined in Unicode specifications follow a common naming convention. Each character property and (non-numeric) property value has a name and an abbreviation.

The UNIC API for character properties is based on this convention and tries to stay as close as possible to this naming schemes, making it easier to use the library when familiar with the Unicode conventions.

NOTE: Since Rust does not support aliases for enum variants, only the long names are supported in UNIC components. Property abbreviation names are provided in the documentation (to help using them as variable-name, etc, if desired) and is also used in specific cases to prevent namespace collision.

Example:

Unicode NameUNIC Name
PropertyGeneral_Category (gc)GeneralCategory
Property ValueUppercase_Letter (Lu)UppercaseLetter / is_uppercase_letter()
Property ValueCased_Letter (LC)is_cased_letter()

In UNIC, the common way of accessing property values is using static function of() on the property type (enum, struct). For example, Some_Example property of a character will be available via SomeExample::of(ch).

For property types with numeric values, the number() method will return the numeric value. For example, CanonicalCombiningClass::of(ch).number() returns a u8 number.

Unicode Character Database

The UCD defines various character properties for Unicode characters.

The following table shows their implementation status in UNIC.

Property Name (abbr)Property TypeUNIC ComponentUNIC Implementation
General
Age (age)Catalogunic-ucd-ageenum Age { Assigned(UnicodeVersion), Unassigned }
General_Category (gc)Enumerationunic-ucd-categoryenum GeneralCategory {...}
——
Bidirectional
Bidi_Class (bc)Enumerationunic-ucd-bidienum BidiClass {...}
——
Normalization
Canonical_Combining_Class (ccc)Enumerationunic-ucd-normalstruct CanonicalCombiningClass(u8)
Decomposition_Type (dt)Enumerationunic-ucd-normalenum DecompositionType {...}

Named Unicode Algorithms

Unicode defines Named Unicode Algorithms, that are specified in the Unicode Standard or in other standards published by the Unicode Consortium.

The following table shows their implementation status in UNIC.

NameReferenceUNIC Component
Canonical OrderingSection 3.11unic-ucd-normal
Canonical CompositionSection 3.11unic-ucd-normal
NormalizationSection 3.11unic-ucd-normal
Hangul Syllable CompositionSection 3.12Not Public
Hangul Syllable DecompositionSection 3.12Not Public
Hangul Syllable Name GenerationSection 3.12Not Implemented
Default Case ConversionSection 3.13Not Implemented
Default Case DetectionSection 3.13Not Implemented
Default Caseless MatchingSection 3.13Not Implemented
Bidirectional Algorithm (UBA)UAX #9unic-bidi
Line Breaking AlgorithmUAX #14Not Implemented
Character SegmentationUAX #29Not Implemented
Word SegmentationUAX #29Not Implemented
Sentence SegmentationUAX #29Not Implemented
Hangul Syllable Boundary DeterminationUAX #29Not Implemented
Standard Compression Scheme for Unicode (SCSU)UTS #6Not Implemented
Unicode Collation Algorithm (UCA)UTS #10Not Implemented