Network Working Group                                            T. Bray
Internet-Draft                                       Textuality Services
Intended status: Standards Track                              P. Hoffman
Expires: 18 March 2024                                             ICANN
                                                       15 September 2023


            Specifying Unicode Character Repertoires in RFCs
                         draft-bray-unichars-04

Abstract

   This document describes four subsets of Unicode characters and their
   use in protocols and data formats.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 18 March 2024.

Copyright Notice

   Copyright (c) 2023 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.






Bray & Hoffman            Expires 18 March 2024                 [Page 1]

Internet-Draft             Specifying Unicode             September 2023


Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
     1.1.  Notation  . . . . . . . . . . . . . . . . . . . . . . . .   3
   2.  Character Concepts  . . . . . . . . . . . . . . . . . . . . .   3
     2.1.  Transformation Formats  . . . . . . . . . . . . . . . . .   3
     2.2.  Problematic Code Point Types  . . . . . . . . . . . . . .   4
       2.2.1.  Surrogates  . . . . . . . . . . . . . . . . . . . . .   4
       2.2.2.  Control Codes . . . . . . . . . . . . . . . . . . . .   4
       2.2.3.  Noncharacters . . . . . . . . . . . . . . . . . . . .   5
   3.  Subsets Defined in the Unicode Standard . . . . . . . . . . .   5
     3.1.  Unicode Code Points . . . . . . . . . . . . . . . . . . .   5
     3.2.  Unicode Scalar Values . . . . . . . . . . . . . . . . . .   5
   4.  Other Subsets . . . . . . . . . . . . . . . . . . . . . . . .   6
     4.1.  XML Characters  . . . . . . . . . . . . . . . . . . . . .   6
     4.2.  Useful Assignables  . . . . . . . . . . . . . . . . . . .   6
   5.  Dealing With Problematic Code Points  . . . . . . . . . . . .   7
   6.  Restricting Character Repertoires . . . . . . . . . . . . . .   8
   7.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   9
   8.  Security Considerations . . . . . . . . . . . . . . . . . . .   9
   9.  Normative References  . . . . . . . . . . . . . . . . . . . .   9
   10. Informative References  . . . . . . . . . . . . . . . . . . .   9
   Acknowledgements  . . . . . . . . . . . . . . . . . . . . . . . .  10
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  10

1.  Introduction

   When a protocol or data format has text fields, that text is normally
   composed of Unicode [UNICODE] characters, to support use by speakers
   of many languages.  Because of the way the Unicode Standard defines
   the term "Unicode character", the "set of all Unicode characters" is
   not always useful for technical specifications.  Instead, subsets
   such as those defined in this document are typically used.

   Protocols and data formats usually need to describe exactly which
   selection of the available Unicode characters are to be used.  The
   term "character repertoire" is a well-understood concept when applied
   to an encoding standard; in this document it describes selected
   subsets of the Unicode characters.  Authors should have a way to
   concisely and exactly reference a stable specification that
   identifies a protocol or data format's character repertoire

   This document describes and names several subsets that have been
   popular choices in specification character repertoires, and suggests
   one new subset.  The goal is to provide a convenient target for
   cross-reference from other specifications which discuss character
   repertoires.




Bray & Hoffman            Expires 18 March 2024                 [Page 2]

Internet-Draft             Specifying Unicode             September 2023


1.1.  Notation

   In this document, the numeric values assigned to Unicode characters
   are provided in hexadecimal.  In the text, Unicode’s standard "U+",
   zero-padded to four places [RFC5137], is used.  For example, "A",
   decimal 65, would be expressed as U+0041, and "😉" (Winking Face),
   decimal 128,521, would be U+1F609.

   Groups of numeric values described in Section 3 and Section 4 are
   given in ABNF [RFC5234].  In ABNF, the hexadecimal values for
   characters are preceded by "%x" rather than "U+".

   All the numeric ranges in this document are inclusive.

2.  Character Concepts

   The Unicode Standard's definition of "Unicode character" is
   conceptual.  However, each Unicode character is assigned an integer
   identifier in the range U+0000-U+10FFFF.  These numbers are used to
   represent the characters in computer memory and storage systems and,
   in specifications, to specify the allowed repertoires of Unicode
   characters.

   The numbers assigned to Unicode characters are called “code points”;
   there are potentially 1,114,112 of them.  As of 2023, fewer than
   150,000 characters have had code points assigned.  It is difficult to
   specify that unassigned code points should be avoided, because they
   regularly become assigned as new characters are added to Unicode.

2.1.  Transformation Formats

   Unicode describes a variety of "transformation formats", ways to
   encode code points in bytes of computer memory.  A survey of
   transformation formats is beyond the scope of this document.
   However, it is useful to note that the "UTF-16" format represents
   each code point with one or two 16-bit chunks, and the “UTF-8” format
   uses variable-length byte sequences.

   The "IETF Policy on Character Sets and Languages", BCP 18 [RFC2277],
   says "Protocols MUST be able to use the UTF-8 charset", which becomes
   a mandate to use UTF-8 for any protocol or data format that specifies
   a single transformation format.  UTF-8 is widely used for
   interoperable data formats such as JSON, YAML, and XML.








Bray & Hoffman            Expires 18 March 2024                 [Page 3]

Internet-Draft             Specifying Unicode             September 2023


2.2.  Problematic Code Point Types

   Definition D10a in section 3.4 of [UNICODE] defines seven code point
   types.  Three types of code points are assigned to constructs which
   are not actually characters or whose value as Unicode characters in
   text fields is questionable: "Control", "Surrogate", and
   "Noncharacter".

2.2.1.  Surrogates

   A total of 2,048 code points, in the range U+D800-U+DFFF, are divided
   into two blocks called "high surrogates" and "low surrogates";
   collectively the 2,048 code points are referred to as "surrogates".
   Surrogates may only be used in Unicode texts encoded in UTF-16, where
   a high-surrogate/low-surrogate pair represents a code point greater
   than U+FFFF.

   A surrogate which occurs in text encoded in any transformation format
   other than UTF-16 has no meaning and may cause malfunction in
   software that encounters it.  In particular, it is impossible to
   represent a surrogate in well-formed UTF-8.

2.2.2.  Control Codes

   Section 23.1 of [UNICODE] introduces the "Control Codes", for
   compatibility with legacy pre-Unicode standards.  They comprise 65
   code points in the ranges U+0000-U+001F ("C0 Controls") and
   U+0080-U+009F (“C1 Controls”), plus U+007F, "DEL".

2.2.2.1.  Useful Controls

   The C0 Controls include newline (U+000A), carriage return (U+000D),
   and tab (U+0009); this document refers to these three characters as
   the "useful controls".

2.2.2.2.  Legacy Controls

   Aside from the useful controls, the control codes are mostly obsolete
   and generally lack interoperable semantics.  This document uses the
   phrase "legacy controls" to describe control codes that are not
   useful controls.

   Since the code points for C0 Controls include the 32 smallest
   integers including zero, they are likely to occur in data as a result
   of programming errors.






Bray & Hoffman            Expires 18 March 2024                 [Page 4]

Internet-Draft             Specifying Unicode             September 2023


2.2.3.  Noncharacters

   Certain code points are classified as "noncharacters", and [UNICODE]
   asserts repeatedly that they are not designed or used for open
   interchange.

   Code points are organized into 17 "planes", each containing 2^16 code
   points.  The last two code points in each plane are noncharacters:
   U+00FFFE, U+00FFFF, U+01FFFE, U+01FFFF, U+02FFFE, U+02FFFF, and so
   on, up to U+10FFFE, U+10FFFF.

   The code points in the range U+FDD0-U+FDEF are noncharacters.

3.  Subsets Defined in the Unicode Standard

   This section describes subsets of the code points that are defined in
   [UNICODE].  Specifications can refer to these repertoires by the
   names "Unicode Code Points" and "Unicode Scalar Values".

3.1.  Unicode Code Points

   Definition D9 in section 3.4 of [UNICODE] defines the term "Unicode
   codespace" as "a range of integers from 0 to 10FFFF_16".  Definition
   D10 defines the term "Code point" as "Any value in the Unicode
   codespace".

   The "Unicode Code Points" subset can be expressed as an ABNF
   production:

   unicode-code-points =
      %x0-10FFFF

   This subset is notable for including all possible code points,
   including those of the problematic types discussed above.  It is the
   default repertoire of JSON [RFC8259].

3.2.  Unicode Scalar Values

   Definition D76 in section 3.9 of [UNICODE] defines the term "Unicode
   scalar value" as "Any Unicode code point except high-surrogate and
   low-surrogate code points."

   The "Unicode Scalar Values" subset can be expressed as an ABNF
   production:

   unicode-scalar-values =
      %x0-D7FF / %xE000-10FFFF  ; exclude surrogates




Bray & Hoffman            Expires 18 March 2024                 [Page 5]

Internet-Draft             Specifying Unicode             September 2023


   This subset is the default character repertoire for I-JSON [RFC7493]
   and CBOR [RFC8949], and has the advantage of excluding surrogates.
   However, it includes legacy controls and noncharacters.

4.  Other Subsets

   This section describes other ways to specify subsets of the code
   points beyond those provided by the Unicode Standard itself.
   Specifications can refer to these repertoires by the names "XML
   Characters" and "Useful Assignables".

4.1.  XML Characters

   The XML 1.0 Specification [XML], in its grammar production labeled
   "Char", specifies a subset of Unicode code points that excludes
   surrogates, legacy C0 Controls, and the noncharacters U+FFFE and
   U+FFFF.

   The "XML Characters" subset can be expressed as an ABNF production:

   xml-chars =
      %x9 / %xA / %xD /   ; useful controls
      %x20-D7FF /         ; exclude surrogates
      %xE000-FFFD/        ; exclude FFFE and FFFF nonchars
      %x100000-10FFFF

   While this subset does not exclude all the problematic code points,
   the C1 Controls are less likely than the C0 Controls to appear
   erroneously in data, and have not been observed to be a frequent
   source of problems.  Also, the noncharacters greater in value than
   U+FFFF are rarely encountered.

4.2.  Useful Assignables

   For convenience, this document defines the "Useful Assignables"
   subset as the Unicode code points, excluding the legacy controls,
   surrogates, and noncharacters.  This comprises all code points that
   are currently assigned, or might in future be assigned, to characters
   that are not legacy control codes, plus the useful controls.

   Useful Assignables can be expressed as an ABNF production:










Bray & Hoffman            Expires 18 March 2024                 [Page 6]

Internet-Draft             Specifying Unicode             September 2023


   useful-assignables =
      %x9 / %xA / %xD /               ; useful controls
      %x20-7E /                       ; exclude C1 Controls and DEL
      %xA0-D7FF /                     ; exclude surrogates
      %xE000-FDCF                     ; exclude FDD0 nonchars
      %xFDF0-FFFD /                   ; exclude FFFE and FFFF nonchars
      %x10000-1FFFD / %x20000-2FFFD / ; (repeat per plane)
      %x30000-3FFFD / %x40000-4FFFD /
      %x50000-5FFFD / %x60000-6FFFD /
      %x70000-7FFFD / %x80000-8FFFD /
      %x90000-9FFFD / %xA0000-AFFFD /
      %xB0000-BFFFD / %xC0000-CFFFD /
      %xD0000-DFFFD / %xE0000-EFFFD /
      %xF0000-FFFFD / %x100000-10FFFD

   This subset excludes all code points whose types are identified as
   problematic above.

5.  Dealing With Problematic Code Points

   Noncharacters and legacy controls are unlikely to cause software
   failures, but they cannot usefully be displayed to humans, and can be
   used in attacks based on misleading human readers of text that
   display them.  [TR36]

   Surrogate characters have been observed to cause software failures.
   The behavior of software which encounters them is unpredictable and
   differs in programming-language implementations, even between
   different API calls in the same language.

   Section 3.9 of [UNICODE] makes it clear that a UTF-8 byte sequence
   which would map to a surrogate is ill-formed.  Thus, in theory, if a
   specification requires that input data be encoded with UTF-8,
   implementors should never have to concern themselves with surrogates.

   Unfortunately, industry experience teaches that problematic code
   points, including surrogates, can and do occur in program input where
   the source of input data is not controlled by the implementor.  For
   example, the following is a legal JSON document:

   {"example": "\u0000\U0089\uDEAD\uD9BF\uDFFF"}










Bray & Hoffman            Expires 18 March 2024                 [Page 7]

Internet-Draft             Specifying Unicode             September 2023


   The value of the "example" field contains the C0 Control NUL, the C1
   Control "CHARACTER TABULATION WITH JUSTIFICATION", an unpaired
   surrogate, and the noncharacter U+7FFFF encoded per JSON rules as two
   escaped UTF-16 surrogate code points.  It is unlikely to be useful as
   the value of a text field.  It cannot be serialized into well-formed
   UTF-8, but the behavior of libraries asked to parse the sample is
   unpredictable; some will silently parse this and generate an ill-
   formed UTF-8 string.

   Implementors who follow the guidance of [RFC9413], "Maintaining
   Robust Protocols", will need to deal with problematic code points.  A
   variety of options are reasonable.  RFC 9413 recommends, by default,
   discarding ill-formed data silently without returning an error
   message, unless this is required by the specification; and further,
   that error messages, if specified, should be explicit.  [UNICODE]
   section 3.2 recommends dealing with ill-formed byte sequences by by
   signaling an error, or replacing problematic code points with �
   (U+FFFD, REPLACEMENT CHARACTER).

   The discussion of error-handling options in [RFC9413] is thorough and
   very helpful in choosing a strategy for dealing with problematic code
   points.

6.  Restricting Character Repertoires

   Many IETF specifications rely on well-known data formats such as
   JSON, I-JSON, CBOR, YAML, and XML.  These formats have default
   character repertoires.  For example, JSON allows object member names
   and string values to include any Unicode code points, including all
   the problematic types.

   It is unlikely that anyone specifying a new data format would choose
   to allow the Unicode Code Points character repertoire.

   A protocol based on JSON can be made more robust and implementor-
   friendly by restricting the contents of object member names and
   string values to Useful Assignables (see Section 4.2).  An equivalent
   restriction is possible for other packaging formats such as I-JSON,
   XML, YAML, and CBOR.

   Note that escaping techniques such as those in the JSON example above
   cannot be used to circumvent this sort of character-repertoire
   restriction, which applies to data content, not textual
   representation in packaging formats.  If a specification restricted a
   JSON field value to the Useful Assignables, the example would remain
   a legal JSON Text but the data it represents would not constitute
   Useful Assignable code points.




Bray & Hoffman            Expires 18 March 2024                 [Page 8]

Internet-Draft             Specifying Unicode             September 2023


7.  IANA Considerations

   This document makes no requests of IANA.

8.  Security Considerations

   Unicode Security Considerations [TR36] is a wide-ranging survey of
   the issues implementors should consider while writing software to
   process Unicode text.  Many of the exploits it discusses are aimed at
   deceiving human readers, but vulnerabilities involving issues such as
   surrogates and noncharacters are also covered, and in fact can
   contribute to human-deceiving exploits.

   Note that the Unicode-character subsets specified in this document
   include a successively-decreasing number of surrogates and
   noncharacters, and thus should be less and less susceptible to
   vulnerabilities.  The Section 4.2 subset, "Useful Assignables",
   excludes all of them.

9.  Normative References

   [RFC5234]  Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
              Specifications: ABNF", STD 68, RFC 5234,
              DOI 10.17487/RFC5234, January 2008,
              <https://www.rfc-editor.org/info/rfc5234>.

   [TR36]     The Unicode Consortium, "Unicode Security Considerations",
              <https://www.unicode.org/reports/tr36/>.  Note that this
              reference is to the latest version of this document,
              rather than to a specific release.  It is not expected
              that future updates will affect the referenced
              discussions.

   [UNICODE]  The Unicode Consortium, "The Unicode Standard",
              <http://www.unicode.org/versions/latest/>.  Note that this
              reference is to the latest version of Unicode, rather than
              to a specific release.  It is not expected that future
              changes in the Unicode Standard will affect the referenced
              definitions.

10.  Informative References

   [RFC2277]  Alvestrand, H., "IETF Policy on Character Sets and
              Languages", BCP 18, RFC 2277, DOI 10.17487/RFC2277,
              January 1998, <https://www.rfc-editor.org/info/rfc2277>.






Bray & Hoffman            Expires 18 March 2024                 [Page 9]

Internet-Draft             Specifying Unicode             September 2023


   [RFC5137]  Klensin, J., "ASCII Escaping of Unicode Characters",
              BCP 137, RFC 5137, DOI 10.17487/RFC5137, February 2008,
              <https://www.rfc-editor.org/info/rfc5137>.

   [RFC7493]  Bray, T., Ed., "The I-JSON Message Format", RFC 7493,
              DOI 10.17487/RFC7493, March 2015,
              <https://www.rfc-editor.org/info/rfc7493>.

   [RFC8259]  Bray, T., Ed., "The JavaScript Object Notation (JSON) Data
              Interchange Format", STD 90, RFC 8259,
              DOI 10.17487/RFC8259, December 2017,
              <https://www.rfc-editor.org/info/rfc8259>.

   [RFC8949]  Bormann, C. and P. Hoffman, "Concise Binary Object
              Representation (CBOR)", STD 94, RFC 8949,
              DOI 10.17487/RFC8949, December 2020,
              <https://www.rfc-editor.org/info/rfc8949>.

   [RFC9413]  Thomson, M. and D. Schinazi, "Maintaining Robust
              Protocols", RFC 9413, DOI 10.17487/RFC9413, June 2023,
              <https://www.rfc-editor.org/info/rfc9413>.

   [XML]      Bray, T., Paoli, J., McQueen, C.M., Maler, E., and F.
              Yergeau, "Extensible Markup Language (XML) 1.0 (Fifth
              Edition)", 26 November 2008,
              <http://www.w3.org/TR/2008/REC-xml-20081126/>.  Note that
              this reference is to a specific release, based on a
              history of previous "Edition" releases having changed this
              production.

Acknowledgements

   Thanks are due to Guillaume Fortin-Debigaré, who filed an Errata
   Report against RFC 8259, The JavaScript Object Notation, noting
   frequent references to "Unicode characters", when in fact the RFC
   formally specifies the use of Unicode Code Points.

   Thanks also to Asmus Freytag for careful review and many constructive
   suggestions aimed at making the language more consistent with the
   structure of the Unicode Standard.

   Thanks also to James Manger for the correctness of the ABNF and JSON
   samples.

Authors' Addresses

   Tim Bray
   Textuality Services



Bray & Hoffman            Expires 18 March 2024                [Page 10]

Internet-Draft             Specifying Unicode             September 2023


   Email: tbray@textuality.com


   Paul Hoffman
   ICANN
   Email: paul.hoffman@icann.org













































Bray & Hoffman            Expires 18 March 2024                [Page 11]