You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
1025 lines
41 KiB
1025 lines
41 KiB
<html>
|
|
|
|
<head>
|
|
<meta name="GENERATOR" content="Microsoft FrontPage 4.0">
|
|
<meta name="ProgId" content="FrontPage.Editor.Document">
|
|
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
|
|
<link rel="stylesheet" href="http://www.unicode.org/unicode.css" type="text/css">
|
|
<title>UnicodeData File Format</title>
|
|
</head>
|
|
|
|
<body>
|
|
|
|
<h1>UnicodeData File Format<br>
|
|
Version 3.0.1</h1>
|
|
<table border="1" cellspacing="2" cellpadding="0" height="87" width="100%">
|
|
<tr>
|
|
<td valign="TOP" width="144">Revision</td>
|
|
<td valign="TOP">3.0.1</td>
|
|
</tr>
|
|
<tr>
|
|
<td valign="TOP" width="144">Authors</td>
|
|
<td valign="TOP">Mark Davis and Ken Whistler</td>
|
|
</tr>
|
|
<tr>
|
|
<td valign="TOP" width="144">Date</td>
|
|
<td valign="TOP">2000-08-17</td>
|
|
</tr>
|
|
<tr>
|
|
<td valign="TOP" width="144">This Version</td>
|
|
<td valign="TOP"><a
|
|
href="http://www.unicode.org/Public/3.0-Update1/UnicodeData-3.0.1.html">http://www.unicode.org/Public/3.0-Update1/UnicodeData-3.0.1.html</a></td>
|
|
</tr>
|
|
<tr>
|
|
<td valign="TOP" width="144">Previous Version</td>
|
|
<td valign="TOP"><a
|
|
href="http://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html">http://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html</a></td>
|
|
</tr>
|
|
<tr>
|
|
<td valign="TOP" width="144">Latest Version</td>
|
|
<td valign="TOP"><a
|
|
href="http://www.unicode.org/Public/UNIDATA/UnicodeData.html">http://www.unicode.org/Public/UNIDATA/UnicodeData.html</a></td>
|
|
</tr>
|
|
</table>
|
|
<p align="center">Copyright © 1995-2000 Unicode, Inc. All Rights reserved.<br>
|
|
<i>For more information, including Disclamer and Limitations, see <a
|
|
href="UnicodeCharacterDatabase-3.0.1.html">UnicodeCharacterDatabase-3.0.1.html</a></i></p>
|
|
<p>This document describes the format of the UnicodeData.txt file, which is one
|
|
of the files in the Unicode Character Database. The document is divided into the
|
|
following sections:
|
|
<ul>
|
|
<li><a href="#Field Formats">Field Formats</a>
|
|
<ul>
|
|
<li><a href="#General Category">General Category</a></li>
|
|
<li><a href="#Bidirectional Category">Bidirectional Category</a></li>
|
|
<li><a href="#Character Decomposition">Character Decomposition Mapping</a></li>
|
|
<li><a href="#Canonical Combining Classes">Canonical Combining Classes</a></li>
|
|
<li><a href="#Decompositions and Normalization">Decompositions and
|
|
Normalization</a></li>
|
|
<li><a href="#Case Mappings">Case Mappings</a></li>
|
|
</ul>
|
|
</li>
|
|
<li><a href="#Property Invariants">Property Invariants</a></li>
|
|
<li><a href="#Modification History">Modification History</a></li>
|
|
</ul>
|
|
<p><b>Warning: </b>the information in this file does not completely describe the
|
|
use and interpretation of Unicode character properties and behavior. It must be
|
|
used in conjunction with the data in the other files in the <a
|
|
href="UnicodeCharacterDatabase-3.0.1.html">Unicode Character Database</a>, and
|
|
relies on the notation and definitions supplied in <i><a
|
|
href="http://www.unicode.org/unicode/standard/versions/Unicode3.0.html">The
|
|
Unicode Standard</a></i>. All chapter references are to Version 3.0 of the
|
|
standard.</p>
|
|
<h2><a name="Field Formats"></a>Field Formats</h2>
|
|
<p>The file consists of lines containing fields separated by semicolons. Each
|
|
line represents the data for one encoded character in the Unicode Standard.
|
|
Every encoded character has a data entry, with the exception of certain special
|
|
ranges, as detailed below.
|
|
<ul>
|
|
<li>There are nine special ranges of characters that are represented only by
|
|
their start and end characters, since the properties in the file are
|
|
uniform, except for code values (which are all sequential and assigned).</li>
|
|
<li>The names of CJK ideograph characters and the names and decompositions of
|
|
Hangul syllable characters are algorithmically derivable. (See the Unicode
|
|
Standard and <a href="http://www.unicode.org/unicode/reports/tr15/">Unicode
|
|
Standard Annex #15</a> for more information).</li>
|
|
<li>Surrogate code values and private use characters have no names.</li>
|
|
<li>The Private Use character outside of the BMP (U+F0000..U+FFFFD,
|
|
U+100000..U+10FFFD) are listed as distinct ranges. These correspond to surrogate pairs
|
|
where the first surrogate is in the High Surrogate Private Use section.</li>
|
|
</ul>
|
|
<p>The exact ranges represented by start and end characters are:
|
|
<ul>
|
|
<li>CJK Ideographs Extension A (U+3400 - U+4DB5)</li>
|
|
<li>CJK Ideographs (U+4E00 - U+9FA5)</li>
|
|
<li>Hangul Syllables (U+AC00 - U+D7A3)</li>
|
|
<li>Non-Private Use High Surrogates (U+D800 - U+DB7F)</li>
|
|
<li>Private Use High Surrogates (U+DB80 - U+DBFF)</li>
|
|
<li>Low Surrogates (U+DC00 - U+DFFF)</li>
|
|
<li>The Private Use Area (U+E000 - U+F8FF)</li>
|
|
<li>Plane 15 Private Use Area (U+F0000 - U+FFFFD)</li>
|
|
<li>Plane 16 Private Use Area (U+100000 - U+10FFFD)</li>
|
|
</ul>
|
|
<p>The following table describes the format and meaning of each field in a data
|
|
entry in the UnicodeData file. Fields which contain normative information are so
|
|
indicated.</p>
|
|
<table border="1" cellspacing="2" cellpadding="2">
|
|
<tr>
|
|
<th valign="top" align="LEFT">
|
|
<p align="LEFT">Field</th>
|
|
<th valign="top" align="LEFT">
|
|
<p align="LEFT">Name</th>
|
|
<th valign="top" align="LEFT">
|
|
<p align="LEFT">Status</th>
|
|
<th valign="top" align="LEFT">
|
|
<p align="LEFT">Explanation</th>
|
|
</tr>
|
|
<tr>
|
|
<th valign="top">0</th>
|
|
<td valign="top">Code value</td>
|
|
<td valign="top">normative</td>
|
|
<td valign="top">Code value. For characters in the range U+0000..U+FFFD
|
|
the code value uses a 4-digit hexadecimal format; for characters in the
|
|
range U+10000..U+FFFFD the code value uses a 5-digit hexadecimal format;
|
|
and for characters in the range U+100000..U+10FFFD the code value uses a
|
|
6-digit hexadecimal format.</td>
|
|
</tr>
|
|
<tr>
|
|
<th valign="top">1</th>
|
|
<td valign="top">Character name</td>
|
|
<td valign="top">normative</td>
|
|
<td valign="top">These names match exactly the names published in Chapter 14
|
|
of the Unicode Standard, Version 3.0.</td>
|
|
</tr>
|
|
<tr>
|
|
<th valign="top">2</th>
|
|
<td valign="top"><a href="#General Category">General Category</a></td>
|
|
<td valign="top">normative / informative<br>
|
|
(see below)</td>
|
|
<td valign="top">This is a useful breakdown into various "character
|
|
types" which can be used as a default categorization in
|
|
implementations. See below for a brief explanation.</td>
|
|
</tr>
|
|
<tr>
|
|
<th valign="top">3</th>
|
|
<td valign="top"><a href="#Canonical Combining Classes">Canonical Combining
|
|
Classes</a></td>
|
|
<td valign="top">normative</td>
|
|
<td valign="top">The classes used for the Canonical Ordering Algorithm in
|
|
the Unicode Standard. These classes are also printed in Chapter 4 of the
|
|
Unicode Standard.</td>
|
|
</tr>
|
|
<tr>
|
|
<th valign="top">4</th>
|
|
<td valign="top"><a href="#Bidirectional Category">Bidirectional Category</a></td>
|
|
<td valign="top">normative</td>
|
|
<td valign="top">See the list below for an explanation of the abbreviations
|
|
used in this field. These are the categories required by the Bidirectional
|
|
Behavior Algorithm in the Unicode Standard. These categories are
|
|
summarized in Chapter 3 of the Unicode Standard.</td>
|
|
</tr>
|
|
<tr>
|
|
<th valign="top">5</th>
|
|
<td valign="top"><a href="#Character Decomposition">Character Decomposition
|
|
Mapping</a></td>
|
|
<td valign="top">normative</td>
|
|
<td valign="top">In the Unicode Standard, not all of the mappings are full
|
|
(maximal) decompositions. Recursive application of look-up for
|
|
decompositions will, in all cases, lead to a maximal decomposition. The
|
|
decomposition mappings match exactly the decomposition mappings published
|
|
with the character names in the Unicode Standard.</td>
|
|
</tr>
|
|
<tr>
|
|
<th valign="top">6</th>
|
|
<td valign="top">Decimal digit value</td>
|
|
<td valign="top">normative</td>
|
|
<td valign="top">This is a numeric field. If the character has the decimal
|
|
digit property, as specified in Chapter 4 of the Unicode Standard, the
|
|
value of that digit is represented with an integer value in this field</td>
|
|
</tr>
|
|
<tr>
|
|
<th valign="top">7</th>
|
|
<td valign="top">Digit value</td>
|
|
<td valign="top">normative</td>
|
|
<td valign="top">This is a numeric field. If the character represents a
|
|
digit, not necessarily a decimal digit, the value is here. This covers
|
|
digits which do not form decimal radix forms, such as the compatibility
|
|
superscript digits</td>
|
|
</tr>
|
|
<tr>
|
|
<th valign="top">8</th>
|
|
<td valign="top">Numeric value</td>
|
|
<td valign="top">normative</td>
|
|
<td valign="top">This is a numeric field. If the character has the numeric
|
|
property, as specified in Chapter 4 of the Unicode Standard, the value of
|
|
that character is represented with an integer or rational number in this
|
|
field. This includes fractions as, e.g., "1/5" for U+2155 VULGAR
|
|
FRACTION ONE FIFTH Also included are numerical values for compatibility
|
|
characters such as circled numbers.</td>
|
|
</tr>
|
|
<tr>
|
|
<th valign="top">9</th>
|
|
<td valign="top">Mirrored</td>
|
|
<td valign="top">normative</td>
|
|
<td valign="top">If the character has been identified as a
|
|
"mirrored" character in bidirectional text, this field has the
|
|
value "Y"; otherwise "N". The list of mirrored
|
|
characters is also printed in Chapter 4 of the Unicode Standard.</td>
|
|
</tr>
|
|
<tr>
|
|
<th valign="top">10</th>
|
|
<td valign="top">Unicode 1.0 Name</td>
|
|
<td valign="top">informative</td>
|
|
<td valign="top">This is the old name as published in Unicode 1.0. This name
|
|
is only provided when it is significantly different from the Unicode 3.0
|
|
name for the character.</td>
|
|
</tr>
|
|
<tr>
|
|
<th valign="top">11</th>
|
|
<td valign="top">10646 comment field</td>
|
|
<td valign="top">informative</td>
|
|
<td valign="top">This is the ISO 10646 comment field. It appears in parentheses
|
|
in the 10646 names list, or contains an asterisk to mark an Annex P note.</td>
|
|
</tr>
|
|
<tr>
|
|
<th valign="top">12</th>
|
|
<td valign="top"><a href="#Case Mappings">Uppercase Mapping</a></td>
|
|
<td valign="top">informative</td>
|
|
<td valign="top">Upper case equivalent mapping. If a character is part of an
|
|
alphabet with case distinctions, and has an upper case equivalent, then
|
|
the upper case equivalent is in this field. See the explanation below on
|
|
case distinctions. These mappings are always one-to-one, not one-to-many
|
|
or many-to-one. This field is informative.</td>
|
|
</tr>
|
|
<tr>
|
|
<th valign="top">13</th>
|
|
<td valign="top"><a href="#Case Mappings">Lowercase Mapping</a></td>
|
|
<td valign="top">informative</td>
|
|
<td valign="top">Similar to Uppercase mapping</td>
|
|
</tr>
|
|
<tr>
|
|
<th valign="top">14</th>
|
|
<td valign="top"><a href="#Case Mappings">Titlecase Mapping</a></td>
|
|
<td valign="top">informative</td>
|
|
<td valign="top">Similar to Uppercase mapping</td>
|
|
</tr>
|
|
</table>
|
|
<h3><a name="General Category"></a>General Category</h3>
|
|
<p>The values in this field are abbreviations for the following. Some of the
|
|
values are normative, and some are informative. For more information, see the
|
|
Unicode Standard.</p>
|
|
<p><b>Note:</b> the standard does not assign information to control characters
|
|
(except for certain cases in the Bidirectional Algorithm). Implementations will
|
|
generally also assign categories to certain control characters, notably CR and
|
|
LF, according to platform conventions.</p>
|
|
<h4>Normative Categories</h4>
|
|
<table border="0" cellspacing="2" cellpadding="0">
|
|
<tr>
|
|
<th>
|
|
<p align="LEFT">Abbr.</th>
|
|
<th>
|
|
<p align="LEFT">Description</th>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER">Lu</td>
|
|
<td>Letter, Uppercase</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER">Ll</td>
|
|
<td>Letter, Lowercase</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER">Lt</td>
|
|
<td>Letter, Titlecase</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER">Mn</td>
|
|
<td>Mark, Non-Spacing</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER">Mc</td>
|
|
<td>Mark, Spacing Combining</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER">Me</td>
|
|
<td>Mark, Enclosing</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER">Nd</td>
|
|
<td>Number, Decimal Digit</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER">Nl</td>
|
|
<td>Number, Letter</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER">No</td>
|
|
<td>Number, Other</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER">Zs</td>
|
|
<td>Separator, Space</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER">Zl</td>
|
|
<td>Separator, Line</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER">Zp</td>
|
|
<td>Separator, Paragraph</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER">Cc</td>
|
|
<td>Other, Control</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER">Cf</td>
|
|
<td>Other, Format</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER">Cs</td>
|
|
<td>Other, Surrogate</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER">Co</td>
|
|
<td>Other, Private Use</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER">Cn</td>
|
|
<td>Other, Not Assigned (no characters in the file have this property)</td>
|
|
</tr>
|
|
</table>
|
|
<h4>Informative Categories</h4>
|
|
<table border="0" cellspacing="2" cellpadding="0">
|
|
<tr>
|
|
<th>
|
|
<p align="LEFT">Abbr.</th>
|
|
<th>
|
|
<p align="LEFT">Description</th>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER">Lm</td>
|
|
<td>Letter, Modifier</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER">Lo</td>
|
|
<td>Letter, Other</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER">Pc</td>
|
|
<td>Punctuation, Connector</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER">Pd</td>
|
|
<td>Punctuation, Dash</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER">Ps</td>
|
|
<td>Punctuation, Open</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER">Pe</td>
|
|
<td>Punctuation, Close</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER">Pi</td>
|
|
<td>Punctuation, Initial quote (may behave like Ps or Pe depending on usage)</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER">Pf</td>
|
|
<td>Punctuation, Final quote (may behave like Ps or Pe depending on usage)</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER">Po</td>
|
|
<td>Punctuation, Other</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER">Sm</td>
|
|
<td>Symbol, Math</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER">Sc</td>
|
|
<td>Symbol, Currency</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER">Sk</td>
|
|
<td>Symbol, Modifier</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER">So</td>
|
|
<td>Symbol, Other</td>
|
|
</tr>
|
|
</table>
|
|
<h3><a name="Bidirectional Category"></a>Bidirectional Category</h3>
|
|
<p>Please refer to Chapter 3 for an explanation of the algorithm for
|
|
Bidirectional Behavior and an explanation of the significance of these
|
|
categories. An up-to-date version can be found on <a
|
|
href="http://www.unicode.org/unicode/reports/tr9/">Unicode Standard Annex #9:
|
|
The Bidirectional Algorithm</a>. These values are normative.</p>
|
|
<table border="0" cellpadding="2">
|
|
<tr>
|
|
<th valign="TOP" align="LEFT">
|
|
<p align="LEFT">Type</th>
|
|
<th valign="TOP" align="LEFT">
|
|
<p align="LEFT">Description</th>
|
|
</tr>
|
|
<tr>
|
|
<td valign="TOP"><b>L</b></td>
|
|
<td valign="TOP">Left-to-Right</td>
|
|
</tr>
|
|
<tr>
|
|
<td valign="TOP"><b>LRE</b></td>
|
|
<td valign="TOP">Left-to-Right Embedding</td>
|
|
</tr>
|
|
<tr>
|
|
<td valign="TOP"><b>LRO</b></td>
|
|
<td valign="TOP">Left-to-Right Override</td>
|
|
</tr>
|
|
<tr>
|
|
<td valign="TOP"><b>R</b></td>
|
|
<td valign="TOP">Right-to-Left</td>
|
|
</tr>
|
|
<tr>
|
|
<td valign="TOP"><b>AL</b></td>
|
|
<td valign="TOP">Right-to-Left Arabic</td>
|
|
</tr>
|
|
<tr>
|
|
<td valign="TOP"><b>RLE</b></td>
|
|
<td valign="TOP">Right-to-Left Embedding</td>
|
|
</tr>
|
|
<tr>
|
|
<td valign="TOP"><b>RLO</b></td>
|
|
<td valign="TOP">Right-to-Left Override</td>
|
|
</tr>
|
|
<tr>
|
|
<td valign="TOP"><b>PDF</b></td>
|
|
<td valign="TOP">Pop Directional Format</td>
|
|
</tr>
|
|
<tr>
|
|
<td valign="TOP"><b>EN</b></td>
|
|
<td valign="TOP">European Number</td>
|
|
</tr>
|
|
<tr>
|
|
<td valign="TOP"><b>ES</b></td>
|
|
<td valign="TOP">European Number Separator</td>
|
|
</tr>
|
|
<tr>
|
|
<td valign="TOP"><b>ET</b></td>
|
|
<td valign="TOP">European Number Terminator</td>
|
|
</tr>
|
|
<tr>
|
|
<td valign="TOP"><b>AN</b></td>
|
|
<td valign="TOP">Arabic Number</td>
|
|
</tr>
|
|
<tr>
|
|
<td valign="TOP"><b>CS</b></td>
|
|
<td valign="TOP">Common Number Separator</td>
|
|
</tr>
|
|
<tr>
|
|
<td valign="TOP"><b>NSM</b></td>
|
|
<td valign="TOP">Non-Spacing Mark</td>
|
|
</tr>
|
|
<tr>
|
|
<td valign="TOP"><b>BN</b></td>
|
|
<td valign="TOP">Boundary Neutral</td>
|
|
</tr>
|
|
<tr>
|
|
<td valign="TOP"><b>B</b></td>
|
|
<td valign="TOP">Paragraph Separator</td>
|
|
</tr>
|
|
<tr>
|
|
<td valign="TOP"><b>S</b></td>
|
|
<td valign="TOP">Segment Separator</td>
|
|
</tr>
|
|
<tr>
|
|
<td valign="TOP"><b>WS</b></td>
|
|
<td valign="TOP">Whitespace</td>
|
|
</tr>
|
|
<tr>
|
|
<td valign="TOP"><b>ON</b></td>
|
|
<td valign="TOP">Other Neutrals</td>
|
|
</tr>
|
|
</table>
|
|
<h3><a name="Character Decomposition"></a>Character Decomposition Mapping</h3>
|
|
<p>The decomposition is a normative property of a character. The tags supplied
|
|
with certain decomposition mappings generally indicate formatting information.
|
|
Where no such tag is given, the mapping is designated as canonical. Conversely,
|
|
the presence of a formatting tag also indicates that the mapping is a
|
|
compatibility mapping and not a canonical mapping. In the absence of other
|
|
formatting information in a compatibility mapping, the tag is used to
|
|
distinguish it from canonical mappings.</p>
|
|
<p>In some instances a canonical mapping or a compatibility mapping may consist
|
|
of a single character. For a canonical mapping, this indicates that the
|
|
character is a canonical equivalent of another single character. For a
|
|
compatibility mapping, this indicates that the character is a compatibility
|
|
equivalent of another single character. The compatibility formatting tags used
|
|
are:</p>
|
|
<table border="0" cellspacing="2" cellpadding="0">
|
|
<tr>
|
|
<th>Tag</th>
|
|
<th>
|
|
<p align="LEFT">Description</th>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER"><font> </td>
|
|
<td>A font variant (e.g. a blackletter form).</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER"><noBreak> </td>
|
|
<td>A no-break version of a space or hyphen.</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER"><initial> </td>
|
|
<td>An initial presentation form (Arabic).</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER"><medial> </td>
|
|
<td>A medial presentation form (Arabic).</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER"><final> </td>
|
|
<td>A final presentation form (Arabic).</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER"><isolated> </td>
|
|
<td>An isolated presentation form (Arabic).</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER"><circle> </td>
|
|
<td>An encircled form.</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER"><super> </td>
|
|
<td>A superscript form.</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER"><sub> </td>
|
|
<td>A subscript form.</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER"><vertical> </td>
|
|
<td>A vertical layout presentation form.</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER"><wide> </td>
|
|
<td>A wide (or zenkaku) compatibility character.</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER"><narrow> </td>
|
|
<td>A narrow (or hankaku) compatibility character.</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER"><small> </td>
|
|
<td>A small variant form (CNS compatibility).</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER"><square> </td>
|
|
<td>A CJK squared font variant.</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER"><fraction> </td>
|
|
<td>A vulgar fraction form.</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="CENTER"><compat> </td>
|
|
<td>Otherwise unspecified compatibility character.</td>
|
|
</tr>
|
|
</table>
|
|
<p><b>Reminder: </b>There is a difference between decomposition and
|
|
decomposition mapping. The decomposition mappings are defined in the UnicodeData,
|
|
while the decomposition (also termed "full decomposition") is defined
|
|
in Chapter 3 to use those mappings <i>recursively.</i>
|
|
<ul>
|
|
<li>The canonical decomposition is formed by recursively applying the
|
|
canonical mappings, then applying the canonical reordering algorithm.</li>
|
|
<li>The compatibility decomposition is formed by recursively applying the
|
|
canonical <em>and</em> compatibility mappings, then applying the canonical
|
|
reordering algorithm.</li>
|
|
</ul>
|
|
<h3><a name="Canonical Combining Classes"></a>Canonical Combining Classes</h3>
|
|
<table border="0" cellspacing="2" cellpadding="0">
|
|
<tr>
|
|
<th>
|
|
<p align="LEFT">Value</th>
|
|
<th>
|
|
<p align="LEFT">Description</th>
|
|
</tr>
|
|
<tr>
|
|
<td align="RIGHT">0:</td>
|
|
<td>Spacing, split, enclosing, reordrant, and Tibetan subjoined</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="RIGHT">1:</td>
|
|
<td>Overlays and interior</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="RIGHT">7:</td>
|
|
<td>Nuktas</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="RIGHT">8:</td>
|
|
<td>Hiragana/Katakana voicing marks</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="RIGHT">9:</td>
|
|
<td>Viramas</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="RIGHT">10:</td>
|
|
<td>Start of fixed position classes</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="RIGHT">199:</td>
|
|
<td>End of fixed position classes</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="RIGHT">200:</td>
|
|
<td>Below left attached</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="RIGHT">202:</td>
|
|
<td>Below attached</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="RIGHT">204:</td>
|
|
<td>Below right attached</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="RIGHT">208:</td>
|
|
<td>Left attached (reordrant around single base character)</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="RIGHT">210:</td>
|
|
<td>Right attached</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="RIGHT">212:</td>
|
|
<td>Above left attached</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="RIGHT">214:</td>
|
|
<td>Above attached</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="RIGHT">216:</td>
|
|
<td>Above right attached</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="RIGHT">218:</td>
|
|
<td>Below left</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="RIGHT">220:</td>
|
|
<td>Below</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="RIGHT">222:</td>
|
|
<td>Below right</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="RIGHT">224:</td>
|
|
<td>Left (reordrant around single base character)</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="RIGHT">226:</td>
|
|
<td>Right</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="RIGHT">228:</td>
|
|
<td>Above left</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="RIGHT">230:</td>
|
|
<td>Above</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="RIGHT">232:</td>
|
|
<td>Above right</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="RIGHT">233:</td>
|
|
<td>Double below</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="RIGHT">234:</td>
|
|
<td>Double above</td>
|
|
</tr>
|
|
<tr>
|
|
<td align="RIGHT">240:</td>
|
|
<td>Below (iota subscript)</td>
|
|
</tr>
|
|
</table>
|
|
<p><strong>Note: </strong>some of the combining classes in this list do not
|
|
currently have members but are specified here for completeness.</p>
|
|
<h3><a name="Decompositions and Normalization"></a>Decompositions and
|
|
Normalization</h3>
|
|
<p>Decomposition is specified in Chapter 3. <a
|
|
href="http://www.unicode.org/unicode/reports/tr15/"><i>Unicode Standard Annex
|
|
#15: Unicode Normalization Forms</i></a> specifies the interaction between decomposition
|
|
and normalization. The most up-to-date version is found on <a
|
|
href="http://www.unicode.org/unicode/reports/tr15/">http://www.unicode.org/unicode/reports/tr15/</a>.
|
|
That report specifies how the decompositions defined in UnicodeData.txt are used
|
|
to derive normalized forms of Unicode text.</p>
|
|
<p>Note that as of the 2.1.9 update of the Unicode Character Database, the
|
|
decompositions in the UnicodeData.txt file can be used to recursively derive the
|
|
full decomposition in canonical order, without the need to separately apply
|
|
canonical reordering. However, canonical reordering of combining character
|
|
sequences must still be applied in decomposition when normalizing source text
|
|
which contains any combining marks.</p>
|
|
<h3><a name="Case Mappings"></a>Case Mappings</h3>
|
|
<p>The case mapping is an informative, default mapping. Case itself, on the
|
|
other hand, has normative status. Thus, for example, 0041 LATIN CAPITAL LETTER A
|
|
is normatively uppercase, but its lowercase mapping the 0061 LATIN SMALL LETTER
|
|
A is informative. The reason for this is that case can be considered to be an
|
|
inherent property of a particular character (and is usually, but not always,
|
|
derivable from the presence of the terms "CAPITAL" or
|
|
"SMALL" in the character name), but case mappings between characters
|
|
are occasionally influenced by local conventions. For example, certain
|
|
languages, such as Turkish, German, French, or Greek may have small deviations
|
|
from the default mappings listed in UnicodeData.</p>
|
|
<p>In addition to uppercase and lowercase, because of the inclusion of certain
|
|
composite characters for compatibility, such as 01F1 LATIN CAPITAL LETTER DZ,
|
|
there is a third case, called <i>titlecase</i>, which is used where the first
|
|
letter of a word is to be capitalized (e.g. UPPERCASE, Titlecase, lowercase). An
|
|
example of such a titlecase letter is 01F2 LATIN CAPITAL LETTER D WITH SMALL
|
|
LETTER Z.</p>
|
|
<p>The uppercase, titlecase and lowercase fields are only included for
|
|
characters that have a single corresponding character of that type. Composite
|
|
characters (such as "339D SQUARE CM") that do not have a single
|
|
corresponding character of that type can be cased by decomposition.</p>
|
|
<p>For compatibility with existing parsers, UnicodeData only contains case
|
|
mappings for characters where they are one-to-one mappings; it also omits
|
|
information about context-sensitive case mappings. Information about these
|
|
special cases can be found in a separate data file, <a
|
|
href="http://www.unicode.org/Public/3.0-Update1/">SpecialCasing.txt</a>, which
|
|
has been added starting with the 2.1.8 update to the Unicode data files.
|
|
SpecialCasing.txt contains additional informative case mappings that are either
|
|
not one-to-one or which are context-sensitive.</p>
|
|
<h2><a name="Property Invariants"></a>Property Invariants</h2>
|
|
<p>Values in UnicodeData.txt are subject to correction as errors are found;
|
|
however, some characteristics of the categories themselves can be considered
|
|
invariants. Applications may wish to take these invariants into account when
|
|
choosing how to implement character properties. The following is a partial list
|
|
of known invariants for the Unicode Character Database.</p>
|
|
<h4>Database Fields</h4>
|
|
<ul>
|
|
<li>The number of fields in UnicodeData.txt is fixed.</li>
|
|
<li>The order of the fields is also fixed.
|
|
<ul>
|
|
<li>Any additional information about character properties to be added in
|
|
the future will appear in separate data tables, rather than being added
|
|
on to the existing table or by subdivision or reinterpretation of
|
|
existing fields.</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h4>General Category</h4>
|
|
<ul>
|
|
<li>There will never be more than 32 General Category values.
|
|
<ul>
|
|
<li>It is very unlikely that the Unicode Technical Committee will
|
|
subdivide the General Category partition any further, since that can
|
|
cause implementations to misbehave. Because the General Category is
|
|
limited to 32 values, 5 bits can be used to represent the information,
|
|
and a 32-bit integer can be used as a bitmask to represent arbitrary
|
|
sets of categories.</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h4>Combining Classes</h4>
|
|
<ul>
|
|
<li>Combining classes are limited to the values 0 to 255.
|
|
<ul>
|
|
<li>In practice, there are far fewer than 256 values used. Implementations
|
|
may take advantage of this fact for compression, since only the ordering
|
|
of the non-zero values matters for the Canonical Reordering Algorithm.
|
|
It is possible for up to 256 values to be used in the future; however,
|
|
UTC decisions in the future may restrict the number of values to 128,
|
|
since this has implementation advantages. [Signed bytes can be used
|
|
without widening to ints in Java, for example.]</li>
|
|
</ul>
|
|
</li>
|
|
<li>All characters other than those of General Category M* have the combining
|
|
class 0.
|
|
<ul>
|
|
<li>Currently, all characters other than those of General Category Mn have
|
|
the value 0. However, some characters of General Category Me or Mc may
|
|
be given non-zero values in the future.</li>
|
|
<li>The precise values above the value 0 are not invariant--only the
|
|
relative ordering is considered normative. For example, it is not
|
|
guaranteed in future versions that the class of U+05B4 will be precisely
|
|
14.</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h4>Case</h4>
|
|
<ul>
|
|
<li>Characters of type Lu, Lt, or Ll are called <i>cased</i>. All characters
|
|
with an Upper, Lower, or Titlecase mapping are cased characters.
|
|
<ul>
|
|
<li>However, characters with the General Categories of Lu, Ll, or Lt may
|
|
not always have case mappings, and case mappings may vary by locale.
|
|
(See http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt).</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h4>Canonical Decomposition</h4>
|
|
<ul>
|
|
<li>Canonical mappings are always in canonical order.</li>
|
|
<li>Canonical mappings have only the first of a pair possibly further
|
|
decomposing.</li>
|
|
<li>Canonical decompositions are "transparent" to other character
|
|
data:
|
|
<ul>
|
|
<li><tt>BIDI(a) = BIDI(principal(canonicalDecomposition(a))</tt></li>
|
|
<li><tt>Category(a) = Category(principal(canonicalDecomposition(a))</tt></li>
|
|
<li><tt>CombiningClass(a) =
|
|
CombiningClass(principal(canonicalDecomposition(a))</tt><br>
|
|
where principal(a) is the first character not of type Mn, or the first
|
|
character if all characters are of type Mn.</li>
|
|
</ul>
|
|
</li>
|
|
<li>However, because there are sometimes missing case pairs, and because of
|
|
some legacy characters, it is only generally true that:
|
|
<ul>
|
|
<li><tt>upper(canonicalDecomposition(a)) = canonicalDecomposition(upper(a))</tt></li>
|
|
<li><tt>lower(canonicalDecomposition(a)) = canonicalDecomposition(lower(a))</tt></li>
|
|
<li><tt>title(canonicalDecomposition(a)) = canonicalDecomposition(title(a))</tt></li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
<h2><a name="Modification History"></a>Modification History</h2>
|
|
<p>This section provides a summary of the changes between update versions of the
|
|
Unicode Standard.</p>
|
|
<h3><a
|
|
href="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 3.0.1">Unicode
|
|
3.0.1</a></h3>
|
|
<p>Modifications made for Version 3.0.1 of UnicodeData.txt include:
|
|
<ul>
|
|
<li>Added 5- and 6-digit representation of code points past U+FFFF.</li>
|
|
<li>Added Private Use range definitions for Planes 15 and 16.</li>
|
|
<li>Minor additions for the 10646 comment field.</li>
|
|
</ul>
|
|
<h3><a
|
|
href="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 3.0.0">Unicode
|
|
3.0.0</a></h3>
|
|
<p>Modifications made for Version 3.0.0 of UnicodeData.txt include many new
|
|
characters and a number of property changes. These are summarized in Appendex D
|
|
of <em>The Unicode Standard, Version 3.0.</em></p>
|
|
<h3><a
|
|
href="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.1.9">Unicode
|
|
2.1.9</a></h3>
|
|
<p>Modifications made for Version 2.1.9 of UnicodeData.txt include:
|
|
<ul>
|
|
<li>Corrected combining class for U+05AE HEBREW ACCENT ZINOR.</li>
|
|
<li>Corrected combining class for U+20E1 COMBINING LEFT RIGHT ARROW ABOVE</li>
|
|
<li>Corrected combining class for U+0F35 and U+0F37 to 220.</li>
|
|
<li>Corrected combining class for U+0F71 to 129.</li>
|
|
<li>Added a decomposition for U+0F0C TIBETAN MARK DELIMITER TSHEG BSTAR.</li>
|
|
<li>Added decompositions for several Greek symbol letters:
|
|
U+03D0..U+03D2, U+03D5, U+03D6, U+03F0..U+03F2.</li>
|
|
<li>Removed decompositions from the conjoining jamo block:
|
|
U+1100..U+11F8.</li>
|
|
<li>Changes to decomposition mappings for some Tibetan vowels for consistency
|
|
in normalization. (U+0F71, U+0F73, U+0F77, U+0F79, U+0F81)</li>
|
|
<li>Updated the decomposition mappings for several Vietnamese characters with
|
|
two diacritics (U+1EAC, U+1EAD, U+1EB6, U+1EB7, U+1EC6, U+1EC7, U+1ED8,
|
|
U+1ED9), so that the recursive decomposition can be generated directly in
|
|
canonically reordered form (not a normative change).</li>
|
|
<li>Updated the decomposition mappings for several Arabic compatibility
|
|
characters involving shadda (U+FC5E..U+FC62, U+FCF2..U+FCF4), and two Latin
|
|
characters (U+1E1C, U+1E1D), so that the decompositions are generated
|
|
directly in canonically reordered form (not a normative change).</li>
|
|
<li>Changed BIDI category for: U+00A0 NO-BREAK SPACE, U+2007 FIGURE SPACE,
|
|
U+2028 LINE SEPARATOR.</li>
|
|
<li>Changed BIDI category for extenders of General Category Lm: U+3005,
|
|
U+3021..U+3035, U+FF9E, U+FF9F.</li>
|
|
<li>Changed General Category and BIDI category for the Greek numeral signs:
|
|
U+0374, U+0375.</li>
|
|
<li>Corrected General Category for U+FFE8 HALFWIDTH FORMS LIGHT VERTICAL.</li>
|
|
<li>Added Unicode 1.0 names for many Tibetan characters (informative).</li>
|
|
</ul>
|
|
<h3><a
|
|
href="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.1.8">Unicode
|
|
2.1.8</a></h3>
|
|
<p>Modifications made for Version 2.1.8 of UnicodeData.txt include:
|
|
<ul>
|
|
<li>Added combining class 240 for U+0345 COMBINING GREEK YPOGEGRAMMENI so that
|
|
decompositions involving iota subscript are derivable directly in
|
|
canonically reordered form; this also has a bearing on simplification of
|
|
casing of polytonic Greek.</li>
|
|
<li>Changes in decompositions related to Greek tonos. These result from the
|
|
clarification that monotonic Greek "tonos" should be equated with
|
|
U+0301 COMBINING ACUTE, rather than with U+030D COMBINING VERTICAL LINE
|
|
ABOVE. (All Greek characters in the Greek block involving "tonos";
|
|
some Greek characters in the polytonic Greek in the 1FXX block.)</li>
|
|
<li>Changed decompositions involving dialytika tonos. (U+0390, U+03B0)</li>
|
|
<li>Changed ternary decompositions to binary. (U+0CCB, U+FB2C, U+FB2D) These
|
|
changes simplify normalization.</li>
|
|
<li>Removed canonical decomposition for Latin Candrabindu. (U+0310)</li>
|
|
<li>Corrected error in canonical decomposition for U+1FF4.</li>
|
|
<li>Added compatibility decompositions to clarify collation tables. (U+2100,
|
|
U+2101, U+2105, U+2106, U+1E9A)</li>
|
|
<li>A series of general category changes to assist the convergence of of
|
|
Unicode definition of identifier with ISO TR 10176:
|
|
<ul>
|
|
<li>So > Lo: U+0950, U+0AD0, U+0F00, U+0F88..U+0F8B</li>
|
|
<li>Po > Lo: U+0E2F, U+0EAF, U+3006</li>
|
|
<li>Lm > Sk: U+309B, U+309C</li>
|
|
<li>Po > Pc: U+30FB, U+FF65</li>
|
|
<li>Ps/Pe > Mn: U+0F3E, U+0F3F</li>
|
|
</ul>
|
|
</li>
|
|
<li>A series of bidi property changes for consistency.
|
|
<ul>
|
|
<li>L > ET: U+09F2, U+09F3</li>
|
|
<li>ON > L: U+3007</li>
|
|
<li>L > ON: U+0F3A..U+0F3D, U+037E, U+0387</li>
|
|
</ul>
|
|
</li>
|
|
<li>Add case mapping: U+01A6 <-> U+0280</li>
|
|
<li>Updated symmetric swapping value for guillemets: U+00AB, U+00BB, U+2039,
|
|
U+203A.</li>
|
|
<li>Changes to combining class values. Most Indic fixed position class
|
|
non-spacing marks were changed to combining class 0. This fixes some
|
|
inconsistencies in how canonical reordering would apply to Indic scripts,
|
|
including Tibetan. Indic interacting top/bottom fixed position classes were
|
|
merged into single (non-zero) classes as part of this change. Tibetan
|
|
subjoined consonants are changed from combining class 6 to combining class
|
|
0. Thai pinthu (U+0E3A) moved to combining class 9. Moved two Devanagari
|
|
stress marks into generic above and below combining classes (U+0951,
|
|
U+0952).</li>
|
|
<li>Corrected placement of semicolon near symmetric swapping field. (U+FA0E,
|
|
etc., scattered positions to U+FA29)</li>
|
|
</ul>
|
|
<h3>Version 2.1.7</h3>
|
|
<p><i>This version was for internal change tracking only, and never publicly
|
|
released.</i></p>
|
|
<h3>Version 2.1.6</h3>
|
|
<p><i>This version was for internal change tracking only, and never publicly
|
|
released.</i></p>
|
|
<h3><a
|
|
href="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.1.5">Unicode
|
|
2.1.5</a></h3>
|
|
<p>Modifications made for Version 2.1.5 of UnicodeData.txt include:
|
|
<ul>
|
|
<li>Changed decomposition for U+FF9E and U+FF9F so that correct collation
|
|
weighting will automatically result from the canonical equivalences.</li>
|
|
<li>Removed canonical decompositions for U+04D4, U+04D5, U+04D8, U+04D9,
|
|
U+04E0, U+04E1, U+04E8, U+04E9 (the implication being that no canonical
|
|
equivalence is claimed between these 8 characters and similar Latin
|
|
letters), and updated 4 canonical decompositions for U+04DB, U+04DC, U+04EA,
|
|
U+04EB to reflect the implied difference in the base character.</li>
|
|
<li>Added Pi, and Pf categories and assigned the relevant quotation marks to
|
|
those categories, based on the Unicode Technical Corrigendum on Quotation
|
|
Characters.</li>
|
|
<li>Updating of many bidi properties, following the advice of the ad hoc
|
|
committee on bidi, and to make the bidi properties of compatibility
|
|
characters more consistent.</li>
|
|
<li>Changed category of several Tibetan characters: U+0F3E, U+0F3F,
|
|
U+0F88..U+0F8B to make them non-combining, reflecting the combined opinion
|
|
of Tibetan experts.</li>
|
|
<li>Added case mapping for U+03F2.</li>
|
|
<li>Corrected case mapping for U+0275.</li>
|
|
<li>Added titlecase mappings for U+03D0, U+03D1, U+03D5, U+03D6, U+03F0..
|
|
U+03F2.</li>
|
|
<li>Corrected compatibility label for U+2121.</li>
|
|
<li>Add specific entries for all the CJK compatibility ideographs,
|
|
U+F900..U+FA2D, so the canonical decomposition for each (the URO character
|
|
it is equivalent to) can be carried in the database.</li>
|
|
</ul>
|
|
<h3>Version 2.1.4</h3>
|
|
<p><i>This version was for internal change tracking only, and never publicly
|
|
released.</i></p>
|
|
<h3>Version 2.1.3</h3>
|
|
<p><i>This version was for internal change tracking only, and never publicly
|
|
released.</i></p>
|
|
<h3><a
|
|
href="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.1.2">Unicode
|
|
2.1.2</a></h3>
|
|
<p>Modifications made in updating UnicodeData.txt to Version 2.1.2 for the
|
|
Unicode Standard, Version 2.1 (from Version 2.0) include:
|
|
<ul>
|
|
<li>Added two characters (U+20AC and U+FFFC).</li>
|
|
<li>Amended bidi properties for U+0026, U+002E, U+0040, U+2007.</li>
|
|
<li>Corrected case mappings for U+018E, U+019F, U+01DD, U+0258, U+0275,
|
|
U+03C2, U+1E9B.</li>
|
|
<li>Changed combining order class for U+0F71.</li>
|
|
<li>Corrected canonical decompositions for U+0F73, U+1FBE.</li>
|
|
<li>Changed decomposition for U+FB1F from compatibility to canonical.</li>
|
|
<li>Added compatibility decompositions for U+FBE8, U+FBE9, U+FBF9..U+FBFB.</li>
|
|
<li>Corrected compatibility decompositions for U+2469, U+246A, U+3358.</li>
|
|
</ul>
|
|
<h3>Version 2.1.1</h3>
|
|
<p><i>This version was for internal change tracking only, and never publicly
|
|
released.</i></p>
|
|
<h3><a
|
|
href="http://www.unicode.org/unicode/standard/versions/enumeratedversions.html#Unicode 2.0.0">Unicode
|
|
2.0.0</a></h3>
|
|
<p>The modifications made in updating UnicodeData.txt for the Unicode Standard,
|
|
Version 2.0 include:
|
|
<ul>
|
|
<li>Fixed decompositions with TONOS to use correct NSM: 030D.</li>
|
|
<li>Removed old Hangul Syllables; mapping to new characters are in a separate
|
|
table.</li>
|
|
<li>Marked compatibility decompositions with additional tags.</li>
|
|
<li>Changed old tag names for clarity.</li>
|
|
<li>Revision of decompositions to use first-level decomposition, instead of
|
|
maximal decomposition.</li>
|
|
<li>Correction of all known errors in decompositions from earlier versions.</li>
|
|
<li>Added control code names (as old Unicode names).</li>
|
|
<li>Added Hangul Jamo decompositions.</li>
|
|
<li>Added Number category to match properties list in book.</li>
|
|
<li>Fixed categories of Koranic Arabic marks.</li>
|
|
<li>Fixed categories of precomposed characters to match decomposition where
|
|
possible.</li>
|
|
<li>Added Hebrew cantillation marks and the Tibetan script.</li>
|
|
<li>Added place holders for ranges such as CJK Ideographic Area and the
|
|
Private Use Area.</li>
|
|
<li>Added categories Me, Sk, Pc, Nl, Cs, Cf, and rectified a number of
|
|
mistakes in the database.</li>
|
|
</ul>
|
|
|
|
</body>
|
|
|
|
</html>
|