rfc9839v1.txt | rfc9839.txt | |||
---|---|---|---|---|
skipping to change at line 102 ¶ | skipping to change at line 102 ¶ | |||
subset. The intended use is to serve as a convenient target for | subset. The intended use is to serve as a convenient target for | |||
cross-reference from other specifications whose authors wish to | cross-reference from other specifications whose authors wish to | |||
exclude problematic code points from the data format or protocol | exclude problematic code points from the data format or protocol | |||
being specified. | being specified. | |||
Note that this document only provides guidance on avoiding the use of | Note that this document only provides guidance on avoiding the use of | |||
code points that cannot be used for interoperable interchange of | code points that cannot be used for interoperable interchange of | |||
Unicode textual data. Dealing with strings, particularly in the | Unicode textual data. Dealing with strings, particularly in the | |||
context of user interfaces, requires addressing language, text | context of user interfaces, requires addressing language, text | |||
rendering direction, alternate representations of the same abstract | rendering direction, alternate representations of the same abstract | |||
character, and so on. These issues, among many others, led to many | character, and so on. These issues, among many others, led to | |||
efforts by the Unicode Consortium, efforts by the IETF such as [IDN] | efforts by the Unicode Consortium, efforts by the IETF such as [IDN] | |||
and [PRECIS], and internationalization efforts by W3C such as | and [PRECIS], and internationalization efforts by W3C such as | |||
[W3C-CHAR]. The results of these efforts should be consulted by | [W3C-CHAR]. The results of these efforts should be consulted by | |||
anyone engaging in such work. | anyone engaging in such work. | |||
1.1. Notation | 1.1. Notation | |||
In this document, the numeric values assigned to Unicode characters | In this document, the numeric values assigned to Unicode characters | |||
are provided in hexadecimal. This document uses Unicode's standard | are provided in hexadecimal. This document uses Unicode's standard | |||
notation of "U+" followed by four or more hexadecimal digits. For | notation of "U+" followed by four or more hexadecimal digits. For | |||
skipping to change at line 143 ¶ | skipping to change at line 143 ¶ | |||
storage systems and to specify allowed subsets in specifications. | storage systems and to specify allowed subsets in specifications. | |||
There are 1,114,112 (17 * 2^16) code points; as of Unicode 16.0 | There are 1,114,112 (17 * 2^16) code points; as of Unicode 16.0 | |||
(2024), about 155,000 have been assigned to characters. Since | (2024), about 155,000 have been assigned to characters. Since | |||
unassigned code points regularly become assigned when new characters | unassigned code points regularly become assigned when new characters | |||
are added to Unicode, it is usually not a good practice to specify | are added to Unicode, it is usually not a good practice to specify | |||
that unassigned code points should be avoided. | that unassigned code points should be avoided. | |||
2.1. Encoding Forms | 2.1. Encoding Forms | |||
Unicode describes a variety of encoding forms, ways to marshal code | Unicode describes a variety of encoding forms that can be used to | |||
points into byte sequences. A survey of these is beyond the scope of | marshal code points into byte sequences. A survey of these is beyond | |||
this document. However, it is useful to note that "UTF-16" | the scope of this document. However, it is useful to note that "UTF- | |||
represents each code point with one or two 16-bit chunks, while "UTF- | 16" represents each code point with one or two 16-bit chunks, while | |||
8" uses variable-length byte sequences [RFC3629]. | "UTF-8" uses variable-length byte sequences [RFC3629]. | |||
The "IETF Policy on Character Sets and Languages", BCP 18 [RFC2277], | The "IETF Policy on Character Sets and Languages", BCP 18 [RFC2277], | |||
says "Protocols MUST be able to use the UTF-8 charset", which becomes | says "Protocols MUST be able to use the UTF-8 charset", which becomes | |||
a mandate to use UTF-8 for any protocol or data format that specifies | a mandate to use UTF-8 for any protocol or data format that specifies | |||
a single encoding form. UTF-8 is widely used for interoperable data | a single encoding form. UTF-8 is widely used for interoperable data | |||
formats such as JSON, YAML, CBOR, and XML. | formats such as JSON, YAML, CBOR, and XML. | |||
2.2. Problematic Code Points | 2.2. Problematic Code Points | |||
This section classifies all the code points that can never represent | This section classifies all the code points that can never represent | |||
End of changes. 2 change blocks. | ||||
6 lines changed or deleted | 6 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. |