rfc9839v1.txt   rfc9839.txt 
skipping to change at line 102 skipping to change at line 102
subset. The intended use is to serve as a convenient target for subset. The intended use is to serve as a convenient target for
cross-reference from other specifications whose authors wish to cross-reference from other specifications whose authors wish to
exclude problematic code points from the data format or protocol exclude problematic code points from the data format or protocol
being specified. being specified.
Note that this document only provides guidance on avoiding the use of Note that this document only provides guidance on avoiding the use of
code points that cannot be used for interoperable interchange of code points that cannot be used for interoperable interchange of
Unicode textual data. Dealing with strings, particularly in the Unicode textual data. Dealing with strings, particularly in the
context of user interfaces, requires addressing language, text context of user interfaces, requires addressing language, text
rendering direction, alternate representations of the same abstract rendering direction, alternate representations of the same abstract
character, and so on. These issues, among many others, led to many character, and so on. These issues, among many others, led to
efforts by the Unicode Consortium, efforts by the IETF such as [IDN] efforts by the Unicode Consortium, efforts by the IETF such as [IDN]
and [PRECIS], and internationalization efforts by W3C such as and [PRECIS], and internationalization efforts by W3C such as
[W3C-CHAR]. The results of these efforts should be consulted by [W3C-CHAR]. The results of these efforts should be consulted by
anyone engaging in such work. anyone engaging in such work.
1.1. Notation 1.1. Notation
In this document, the numeric values assigned to Unicode characters In this document, the numeric values assigned to Unicode characters
are provided in hexadecimal. This document uses Unicode's standard are provided in hexadecimal. This document uses Unicode's standard
notation of "U+" followed by four or more hexadecimal digits. For notation of "U+" followed by four or more hexadecimal digits. For
skipping to change at line 143 skipping to change at line 143
storage systems and to specify allowed subsets in specifications. storage systems and to specify allowed subsets in specifications.
There are 1,114,112 (17 * 2^16) code points; as of Unicode 16.0 There are 1,114,112 (17 * 2^16) code points; as of Unicode 16.0
(2024), about 155,000 have been assigned to characters. Since (2024), about 155,000 have been assigned to characters. Since
unassigned code points regularly become assigned when new characters unassigned code points regularly become assigned when new characters
are added to Unicode, it is usually not a good practice to specify are added to Unicode, it is usually not a good practice to specify
that unassigned code points should be avoided. that unassigned code points should be avoided.
2.1. Encoding Forms 2.1. Encoding Forms
Unicode describes a variety of encoding forms, ways to marshal code Unicode describes a variety of encoding forms that can be used to
points into byte sequences. A survey of these is beyond the scope of marshal code points into byte sequences. A survey of these is beyond
this document. However, it is useful to note that "UTF-16" the scope of this document. However, it is useful to note that "UTF-
represents each code point with one or two 16-bit chunks, while "UTF- 16" represents each code point with one or two 16-bit chunks, while
8" uses variable-length byte sequences [RFC3629]. "UTF-8" uses variable-length byte sequences [RFC3629].
The "IETF Policy on Character Sets and Languages", BCP 18 [RFC2277], The "IETF Policy on Character Sets and Languages", BCP 18 [RFC2277],
says "Protocols MUST be able to use the UTF-8 charset", which becomes says "Protocols MUST be able to use the UTF-8 charset", which becomes
a mandate to use UTF-8 for any protocol or data format that specifies a mandate to use UTF-8 for any protocol or data format that specifies
a single encoding form. UTF-8 is widely used for interoperable data a single encoding form. UTF-8 is widely used for interoperable data
formats such as JSON, YAML, CBOR, and XML. formats such as JSON, YAML, CBOR, and XML.
2.2. Problematic Code Points 2.2. Problematic Code Points
This section classifies all the code points that can never represent This section classifies all the code points that can never represent
 End of changes. 2 change blocks. 
6 lines changed or deleted 6 lines changed or added

This html diff was produced by rfcdiff 1.48.