Internet Draft                                               S. Wenger 
 Document: draft-ietf-avt-rtp-h264-01.txt               M.M. Hannuksela 
 Expires: August 2003                                    T. Stockhammer 
                                                          February 2003 
                                                    Expires August 2003 
                                                
  
  
  
                    RTP payload Format for JVT Video 
  
  
  
 Status of this Memo 
     
 This document is an Internet-Draft and is in full conformance with 
 all provisions of Section 10 of RFC2026.  Internet-Drafts are working 
 documents of the Internet Engineering Task Force (IETF), its areas, 
 and its working groups.  Note that other groups may also distribute 
 working documents as Internet-Drafts. 
  
 Internet-Drafts are draft documents valid for a maximum of six months 
 and may be updated, replaced, or obsoleted by other documents at any 
 time.  It is inappropriate to use Internet-Drafts as reference 
 material or to cite them other than as "work in progress." 
  
 The list of current Internet-Drafts can be accessed at 
 http://www.ietf.org/1id-abstracts.txt 
  
 The list of Internet-Draft Shadow Directories can be accessed at 
 http://www.ietf.org/shadow.html 
     
     
     
 Abstract 
     
    This memo describes an RTP Payload format for the ITU-T 
    Recommendation H.264 video codec.  The most up-to-date draft of the 
    video codec was specified in February 2003, is due for final 
    approval at the committee level late March 2003, and is available 
    for public review [1].  This codec was designed as a joint project 
    of the Video Coding Experts Group (VCEG) of ITU-T and the Moving 
    Picture Experts Group (MPEG) of ISO/IEC.  ISO/IEC International 
    Standard 14496-10 will be technically identical to ITU-T 
    Recommendation H.264. 
     
 Wenger et. al.      Expires August 2003            [Page 1] 

 Internet Draft                                          01 March, 2003 
     
 Table of Contents 
     
 1. Introduction......................................................3 
  1.1. The JVT codec..................................................3 
  1.2. Parameter Set Concept..........................................4 
  1.3. Network Abstraction Layer Packet (NALU) Types..................5 
 2. Conventions.......................................................6 
 3. Changes relative to draft-ietf-avt-rtp-h264-00.txt................6 
  3.1. Status of the JVT standardization, and recent changes to JVT...6 
  3.2. Changes relative to draft-ietf-avt-rtp-h264-00.txt.............6 
 4. Scope.............................................................6 
 5. Definitions.......................................................7 
 6. RTP Payload Format................................................7 
  6.1. RTP Header Usage...............................................7 
  6.2. Simple Packet..................................................8 
  6.3. Aggregation Packets............................................9 
  6.4. Fragmentation Units...........................................13 
 7. Packetization Rules..............................................14 
  7.1. Unrestricted Mode (Multiple Picture Model)....................15 
  7.2. Restricted Mode (Single Picture Model)........................16 
 8. De-Packetization Process.........................................16 
 9. MIME Considerations..............................................18 
  9.1. MIME Registration.............................................19 
  9.2. SDP Parameters................................................21 
 10. Security Considerations.........................................21 
 11. Informative Appendix: Application Examples......................22 
  11.1. Video Telephony, no Data Partitioning, no packet aggregation.22 
  11.2. Video Telephony, Interleaved Packetization using Packet 
  Aggregation........................................................22 
  11.3. Video Telephony, with Data Partitioning......................23 
  11.4. Low-Bit-Rate Streaming.......................................23 
  11.5. Robust Packet Scheduling in Video Streaming..................24 
 12. Open Issues.....................................................25 
 13. Full Copyright Statement........................................25 
 14. Intellectual Property Notice....................................25 
 15. References......................................................25 
  15.1. Normative References.........................................25 
  15.2. Informative References.......................................26 
     
     
     
 Wenger et. al.     Expires December 2002                [Page 2] 

 Internet Draft                                          01 March, 2003 
 1.    Introduction 
  
 1.1.      The JVT codec 
  
    This memo specifies an RTP payload specification for a new video 
    codec that is currently under development by the Joint Video Group 
    (JVT), which is formed of video coding experts of MPEG and the ITU-
    T.  After the likely approval by the two parent bodies, the codec 
    specification will have the status of the ITU-T Recommendation 
    H.264 and become part of the MPEG-4 specification (ISO/IEC 14496 
    Part 10).  The current project timeline of the JVT project is such 
    that a technically frozen specification exists since February 2003 
    (pending bug fixes).  It is believed that only very few, if any, 
    technical details will be changed that directly affect this draft 
    in the future. 
    Before JVT was formed in late 2001, this project used the ITU-T 
    project name H.26L and the JVT project inherited all the technical 
    concepts of the H.26L project. 
  
    The JVT video codec has a very broad application range that covers 
    the all forms of digital compressed video from low bit rate 
    Internet Streaming applications to HDTV broadcast and Digital 
    Cinema applications with near loss-less coding.  Most, if not all, 
    relevant companies in all of these fields (including Video-
    Conferencing, Streaming, TV broadcast, and Digital Cinema) have 
    participated in the standardization, which gives hope that this 
    wide application range is more than an illusion and may 
    materialize, probably in a relatively short time frame.  The 
    overall performance of the JVT codec is as such that bit rate 
    savings of 50% or more, compared to the current state of 
    technology, are reported.  Digital Satellite TV quality, for 
    example, was reported to be achievable at 1.5 Mbit/s, compared to 
    the current operation point of MPEG 2 video at around 3.5 Mbit/s 
    [5]. 
     
    The codec specification [1] itself distinguishes conceptually 
    between a video coding layer (VCL), and a network abstraction layer 
    (NAL).  The VCL contains the signal processing functionality of the 
    codec, things such as transform, quantization, motion 
    search/compensation, and the loop filter.  It follows the general 
    concept of most of today's video codecs, a macroblock-based coder 
    that utilizes inter picture prediction with motion compensation, 
    and transform coding of the residual signal.  The output of the VCL 
    are slices: a bit string that contains the macroblock data of an 
    integer number of macroblocks, and the information of the slice 
    header (containing the spatial address of the first macroblock in 
    the slice, the initial quantization parameter, and similar).  
    Macroblocks in slices are ordered in scan order unless a different 
    macroblock allocation is specified, using the so-called Flexible 
    Macroblock Ordering syntax.  In-picture prediction is used only 
    within a slice.   
     
    The NAL encapsulates the slice output of the VCL into Network 
    Abstraction Layer Units (NALUs), which are suitable for the 
    transmission over packet networks or the use in packet oriented 
 Wenger et. al.     Expires December 2002                [Page 3] 

 Internet Draft                                          01 March, 2003 
    multiplex environments.  JVT's Annex B defines an encapsulation 
    process to transmit such NALUs over byte-stream oriented networks.  
    In the scope of this memo Annex B is not relevant. 
     
    Neither VCL nor NAL are claimed to be media or network independent 
    - the VCL needs to know transmission characteristics in order to 
    appropriately select the error resilience strength, slice size, 
    etc., whereas the NAL needs information like the importance of a 
    bit string provided by the VCL to select the appropriate 
    application layer protection. 
     
    Internally, the NAL uses NAL Units or NALUs.  A NALU consists of a 
    one-byte header and the payload byte string.  The header co-serves 
    as the RTP payload header and indicates the type of the NALU, the 
    (potential) presence of bit errors in the NALU payload, and 
    information regarding the relative importance of the NALU for the 
    decoding process.  This RTP payload specification is designed to be 
    unaware of the bit string in the NALU payload. 
     
    One of the main properties of the JVT codec is the complete 
    decoupling of the transmission time, the decoding time, and the 
    sampling or presentation time of slices and pictures.  The codec 
    itself is unaware of time, and does not carry information such as 
    the number of skipped frames (as common in the form of the Temporal 
    Reference in earlier video compression standards).  Also, there are 
    NAL units that are affecting many pictures and are, hence, 
    inherently time-less.  For this reason, the handling of the RTP 
    timestamp requires some special considerations for those NALUs for 
    which the sampling or presentation time is not defined, or, at 
    transmission time, unknown. 
     
     
 1.2.      Parameter Set Concept 
     
    One very fundamental design concept of the JVT codec is to generate 
    self-contained packets, to make mechanisms such as the header 
    duplication of RFC2429 [6] or MPEG-4's HEC [7] unnecessary.  The 
    way how this was achieved is to decouple information that is 
    relevant to more than one slice from the media stream.  This higher 
    layer meta information should be sent reliably, asynchronously and 
    in advance from the RTP packet stream that contains the slice 
    packets.  (Provisions for sending this information in-band are also 
    available for such applications that do not have an out-of-band  
    transport channel appropriate for the purpose).  The combination of 
    the higher level parameters is called a Parameter Set.  The 
    Parameter Set contains information such as 
     
      o picture size, 
      o display window, 
      o optional coding modes employed, 
      o macroblock allocation map, 
      o and others. 
       
    In order to be able to change picture parameters (such as the 
    picture size), without having the need to transmit Parameter Set 
 Wenger et. al.     Expires December 2002                [Page 4] 

 Internet Draft                                          01 March, 2003 
    updates synchronously to the slice packet stream, the encoder and 
    decoder can maintain a list of more than one Parameter Set.  Each 
    slice header contains a codeword that indicates the Parameter Set 
    to be used.   
     
    This mechanism allows to decouple the transmission of the Parameter 
    Sets from the packet stream, and transmit them by external means, 
    e.g. as a side effect of the capability exchange, or through a 
    (reliable or unreliable) control protocol. It may even be possible 
    that they get never transmitted but are fixed by an application 
    design specification. 
     
    Although, conceptually, the Parameter Set updates are not designed 
    to be sent in the synchronous packet stream, this memo contains 
    means to convey them in the RTP packet stream.   
     
     
 1.3.      Network Abstraction Layer Packet (NALU) Types 
  
    Tutorial information on the NAL design can be found in [8], 
    [9] and [10].  For the precise definition of the NAL it is referred 
    to [1]. 
     
    All NALUs consist of a single NALU Type octet, which also co-serves 
    as the payload header.  The payload of a NALU follows immediately.
     
    The NALU type octet has the following format: 
     
    +---------------+ 
    |0|1|2|3|4|5|6|7| 
    +-+-+-+-+-+-+-+-+ 
    |F|NRI|  Type   | 
    +---------------+ 
     
    F: 1 bit 
       The Forbidden bit, when zero, indicates a bit error free NAL 
       unit.  The JVT specification declares a value of 1 as a syntax 
       violation.  Hence, when set, the decoder is advised that bit 
       errors may be present in the payload or in the NALU type octet.  
       A prudent reaction of decoders that are incapable of handling 
       bit errors is to discard such packets. 
        
    NRI: 2 bits 
       NAL Reference IDC.  A value of 00 indicates that the content of 
       the NALU is not used to reconstruct stored pictures (that can be 
       used for future reference).  Such NALUs can be discarded without 
       risking the integrity of the reference pictures.  Values above 
       00 indicate that the decoding of the NALU is required to 
       maintain the integrity of the reference pictures.  Furthermore, 
       values above 00 indicate the relative transport priority, as 
       determined by the encoder.  Intelligent network elements can use 
       this information t protect more important NALUs better than less 
       important NALUs.  11 is the highest transport priority, followed 
       by 10, then by 01 and, finally, 00 is the lowest.
 
 Wenger et. al.     Expires December 2002                [Page 5] 

 Internet Draft                                          01 March, 2003 
    Type: 5 bits 
       The NAL Unit payload type as defined in table 7.1 of [1], and 
       later within this memo.  Note that the NAL unit types defined in 
       this memo are marked as reserved for external use in [1]. 
     
    For a reference of all currently defined NALU types and their 
    semantics please refer to section 7.4.1 in [1].  In particular, 
    note that VCL NAL units refer to coded slice and data partition NAL 
    units as well as filler data NAL units. 
     
     
 2.    Conventions 
  
    The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 
    "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in 
    this document are to be interpreted as described in RFC 2119 [2]. 
     
  
 3.    Changes relative to draft-ietf-avt-rtp-h264-00.txt 
  
    [This section will be removed in a future version of this draft.] 
     
     
 3.1.     Status of the JVT standardization, and recent changes to JVT 
  
    None that affect this draft. 
     
     
 3.2.      Changes relative to draft-ietf-avt-rtp-h264-00.txt 
     
    This memo contains the following technical changes relative to the 
    previous I-D: 
    o The MTAPs with timestamp offset lengths of 8 and 32 bits are 
      removed, as discussed in Atlanta.   
    o The remarks about application layer protection are aligned with 
      the current thinking of the AVT group re congestion control.  
    o A fragmentation NALU has been introduced to allow fragmenting 
      long NALUs into several RTP packets. 
    o Recovering the decoding order of NALUs carried in MTAPs is 
      clarified and transmission of NALUs out of decoding order is 
      allowed, which can be used for robust packet scheduling in 
      streaming systems (see section 11.5 for further details). 
    o The rule of assigning the RTP timestamp to non-slice NALUs has 
      been changed: The RTP timestamp is set to the RTP timestamp of 
      the primary coded picture to which the NALU is associated 
      according to section 7.4.1.2 of [1]. 
    o MIME type registration and SDP usage have been specified. 
     
     
 4.    Scope 
  
    This payload specification can only be used to carry the "naked" 
    JVT NALU stream over RTP.  Likely, the first applications of a 
    Standard Track RFC resulting from this draft will be in the 
    conversational multimedia field, video telephone or video 
 Wenger et. al.     Expires December 2002                [Page 6] 

 Internet Draft                                          01 March, 2003 
    conference.  The draft is not intended for the use in conjunction 
    with the Byte Stream format of Annex B of the JVT working draft. 
     
     
 5.    Definitions 
     
    This document uses the definitions of [1]. In addition, the 
    following definitions apply: 
     
    NAL unit decoding order: A NAL unit order that conforms to the 
    constraints on NAL unit order given in section 7.4.1.1 in [1].   
     
    Transmission order: The order of packets in ascending RTP sequence 
    number order (in modulo arithmetic).  Within an Aggregation Packet, 
    the NAL unit transmission order is the same as the order of 
    appearance of NAL units in the packet. 
     
     
 6.    RTP Payload Format 
     
 6.1.      RTP Header Usage 
  
    The format of the RTP header is specified in RFC 1889 [3] and 
    reprinted in Figure XXXX for convenience.  This payload format uses 
    the fields of the header in a manner consistent with that 
    specification. 
     
    When encapsulating one NALU per RTP packet, the RECOMMENDED RTP 
    payload is specified in section 6.2.  The RTP payload (and the 
    settings for some RTP header bits) for aggregation packets and 
    fragmentation units are specified in sections 6.3 and 6.4, 
    respectively.   
     
    0                   1                   2                   3 
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    |V=2|P|X|  CC   |M|     PT      |       sequence number         | 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    |                           timestamp                           | 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    |           synchronization source (SSRC) identifier            | 
    +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ 
    |            contributing source (CSRC) identifiers             | 
    |                             ....                              | 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     
    Figure xxxx: RTP header according to RFC 1889. 
     
    The RTP header information is set as follows:  
     
    Version (V): 2 bits 
       Set to 2 according to RFC 1889. 
     
    Padding (P): 1 bit 
       Used according to RFC 1889. 
 Wenger et. al.     Expires December 2002                [Page 7] 

 Internet Draft                                          01 March, 2003 
     
    Extension (X): 1 bit 
       Specified in the RTP profile in use. 
     
    CSRC count (CC): 4 bits 
       Used according to RFC 1889. 
     
    Marker bit (M): 1 bit 
       Set for the very last packet of the picture indicated by the RTP 
       timestamp, in line with the normal use of the M bit and to allow 
       an efficient playout buffer handling.  Decoders MAY use this bit 
       as an early indication of the last packet of a coded picture, 
       but MUST not rely on this property because the last packet of 
       the picture may get lost, and because the use of MTAPs does not 
       always preserve the M bit.   
     
    Payload type (PT): 7 bits 
       The assignment of an RTP payload type for this new packet format 
       is outside the scope of this document, and will not be specified 
       here.  It is expected that the RTP profile under which this 
       payload format is being used will assign a payload type for this 
       encoding or specify that the payload type is to be bound 
       dynamically. 
     
    Sequence number (SN): 16 bit 
       Increased by one for each sent packet.  Set to a random value 
       during startup as per RFC1889 
     
    Timestamp: 32 bits 
       The RTP timestamp is set to the sampling timestamp of the 
       content.  If the NALU has no own timing properties (e.g. 
       parameter set and SEI NAL units), the RTP timestamp is set to 
       the RTP timestamp of the primary coded picture to which the NALU 
       is associated according to section 7.4.1.2 of [1].  The setting 
       of the RTP Timestamp for MTAPs is defined in section 6.3.2 
       above. 
     
    Synchronization source (SSRC) identifier: 32 bits 
       Used according to RFC 1889. 
     
    Contributing source (CSRC) identifiers: 0 to 15 items, 32 bits each 
       Used according to RFC 1889. 
     
     
 6.2.      Simple Packet 
     
    The RTP payload of a Simple Packet according to this specification 
    consists of one NALU as depicted in Figure xxxx.  The type of the 
    NALU MUST be specified in [1], i.e., the NALU MUST NOT be an 
    aggregation packet or a fragmentation unit.  A NAL unit stream 
    composed by decapsulating Simple Packets in RTP sequence number 
    order MUST conform to the NAL unit decoding order. 
     
 Wenger et. al.     Expires December 2002                [Page 8] 

 Internet Draft                                          01 March, 2003 
    0                   1                   2                   3 
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    |                                                               | 
    |                              NALU                             | 
    |                                                               | 
    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    |                               :...OPTIONAL RTP padding        | 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     
    Figure xxxx. RTP payload format for Simple Packet. 
     
  
 6.3.      Aggregation Packets 
     
    Aggregation packets are the packet aggregation scheme of this 
    payload specification.  The scheme is introduced to reflect the 
    dramatically different MTU sizes of two key target networks -- 
    wireline IP networks (with an MTU size that is often limited by the 
    Ethernet MTU size -- roughly 1500 bytes), and IP or non-IP (e.g. 
    H.324/M) based wireless networks with preferred transmission unit 
    sizes of 254 bytes or less.  In order to prevent media transcoding 
    between the two worlds, and to avoid undesirable packetization 
    overhead, a packet aggregation scheme is introduced. 
     
    Two types of Aggregation packets are defined by this specification: 
     
    o Single-Time Aggregation Packet (STAP) aggregate NALUs with 
      identical NALU-time. 
    o Multi-Time Aggregation Packets (MTAP) aggregate NALUs with 
      potentially differing NALU-time.  Two different MTAPs are defined 
      that differ in the length of the NALU timestamp offset. 
     
    The term NALU-time is defined as the value the RTP timestamp would 
    have if that NALU would be transported in its own RTP packet.  
     
    The structure of the RTP payload format for aggregation packets is 
    presented in Figure xxxx. 
     
    0                   1                   2                   3 
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    |F|NRI|  type   |                                               | 
    +-+-+-+-+-+-+-+-+                                               | 
    |                                                               | 
    |                        NALU payload                           | 
    |                                                               | 
    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    |                               :...OPTIONAL RTP padding        | 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     
    Figure xxxx. RTP payload format for aggregation packets. 
     
    MTAPs and STAP share the following packetization rules:  The RTP 
    timestamp MUST be set to the minimum of the NALU times of all the 
 Wenger et. al.     Expires December 2002                [Page 9] 

 Internet Draft                                          01 March, 2003 
    NALUs to be aggregated.  The Type field of the NALU type octet MUST 
    be set to the appropriate value as indicated in table xxx.  The F 
    bit MUST be cleared if all F bits of the aggregated NALUs are zero, 
    otherwise it MUST be set. 
     
    Table xxx: Type field for STAP and MTAPs 
     
    Type   Packet    Timestamp offset field length (in bits) 
    ---------------------------------------------- 
    0x18   STAP      0 
    0x19   MTAP16    16 
    0x20   MTAP24    24 
     
    The Marker bit in the RTP header MUST be set to the value the 
    marker bit of the last NALU of the aggregated packet would have if 
    it were transported in its own RTP packet. 
     
    The NALU Payload of an aggregation packet consists of one or more 
    aggregation units.  See section 6.3.1 and 6.3.2 for the two 
    different types of aggregation units.  An aggregation packet can 
    carry as many aggregation units as necessary, however the total 
    amount of data in an aggregation packet obviously MUST fit into an 
    IP packet, and the size SHOULD be chosen such that the resulting IP 
    packet is smaller than the MTU size.  An aggregation packet MUST 
    NOT contain fragmentation units specified in section 6.4. 
  
    A NAL unit stream composed by decapsulating Aggregation Packets in 
    RTP sequence number order is NOT REQUIRED to conform to the NAL 
    unit decoding order.  Requirements on the NAL unit transmission 
    order are specified in section 7 and means to recover the NAL unit 
    decoding order are given in section 8.   
  
  
 6.3.1.        Single-Time Aggregation Packet 
  
    Single-Time Aggregation Packet (STAP) SHOULD be used whenever 
    aggregating NALUs that share the same NALU-time.  The NALU payload 
    of an STAP consists of a 16-bit unsigned decoding order number 
    (DON) followed by at least one Single-Picture Aggregation Unit as 
    presented in Figure XXXX. 
     
    0                   1                   2                   3 
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
                    :  decoding order number (DON)  |               | 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               | 
    |                                                               | 
    |             single-picture aggregation units                  | 
    |                                                               | 
    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    |                               : 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     
    Figure xxxx. NALU payload format for STAP. 
     
 Wenger et. al.     Expires December 2002               [Page 10] 

 Internet Draft                                          01 March, 2003 
    DON indicates the NAL unit decoding order specified in this 
    document.  The DON of the first NALU in transmission order MAY be 
    set to any value.  Let DON of one NAL unit be D1 and DON of another 
    NAL unit be D2.  If D1 < D2 and D2 - D1 < 32768, or if D1 > D2 and 
    D1 . D2 >= 32768, then the NAL unit having DON equal to D1 precedes 
    the NAL unit having DON equal to D2 in NAL unit decoding order.  If 
    D1 < D2 and D2 - D1 >= 32768, or if D1 > D2 and D1 - D2 < 32768, 
    then the NAL unit having DON equal to D2 precedes the NAL unit 
    having DON equal to D1 in NAL unit decoding order.  NAL units 
    associated with different primary coded pictures according to 
    subclause 7.4.1.2 of [1] MUST NOT have the same value of DON.  NAL 
    units associated with the same primary coded picture according to 
    subclause 7.4.1.2 of [1] MAY have the same value of DON.  If all 
    NAL units of a primary coded picture have the same value of DON, 
    NAL units of a redundant coded picture associated with the primary 
    coded picture SHOULD have the same value of DON as the NAL units of 
    the primary coded picture.  The NAL unit decoding order of NAL 
    units that have the same value of DON is the following: 
    1. Picture delimiter NAL unit, if any 
    2. Sequence parameter set NAL units, if any 
    3. Picture parameter set NAL units, if any 
    4. SEI NAL units, if any 
    5. Coded slice and slice data partition NAL units of the primary 
       coded picture, if any 
    6. Coded slice and slice data partition NAL units of the redundant 
       coded pictures, if any 
    7. Filler data NAL units, if any 
    8. End of sequence NAL unit, if any 
    9. End of stream NAL unit, if any 
      
    A Single-Picture Aggregation Unit consists of 16-bit unsigned size 
    information that indicates the size of the following NALU in bytes 
    (excluding these two octets, but including the NALU type octet of 
    the NALU), followed by the NALU itself including its NALU type  
    byte.  A Single-Picture Aggregation Unit is byte-aligned within the 
    RTP payload but it may not be aligned on a 32-bit word boundary.  
    Figure xxxx presents the structure of the Single-Picture 
    Aggregation Unit. 
     
    0                   1                   2                   3 
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
                    :           NALU size           |               | 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               | 
    |                                                               | 
    |                              NALU                             | 
    |                                                               | 
    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    |                               : 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     
    Figure xxxx. Structure for single-picture aggregation unit. 
     
     
 Wenger et. al.     Expires December 2002               [Page 11] 

 Internet Draft                                          01 March, 2003 
 6.3.2.        Multi-Time Aggregation Packets (MTAPs) 
     
    The NALU payload of MTAPs consists of a 16-bit unsigned decoding 
    order number base (DONB) and one or more Multi-Picture Aggregation 
    Units as presented in Figure xxxx.  DONB MUST contain the smallest 
    value of DON among the NAL units of the MTAP.  The choice between 
    the different MTAP fields is application dependent -- the larger 
    the timestamp offset is the higher is the flexibility of the MTAP, 
    but the higher is also the overhead. 
     
    0                   1                   2                   3 
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
                    :  decoding order number base   |               | 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               | 
    |                                                               | 
    |              multi-picture aggregation units                  | 
    |                                                               | 
    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    |                               : 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     
    Figure xxxx. NALU payload format for MTAPs. 
     
    Two different Multi-Time Aggregation Units are defined in this 
    specification.  Both of them consist of 16 bits unsigned size 
    information of the following NALU,  an 8-bit unsigned decoding 
    order number delta (DOND), and n bits of timing information for 
    this NALU, whereby n can be 16 or 24.  The structure of the Multi-
    Time Aggregation Units for MTAP16 and MTAP24 are presented in 
    figures XXXX and XXXX respectively.  Note that the starting or 
    ending position of an aggregation unit within a packet is NOT 
    REQUIRED to be on a 32-bit word boundary.  DON of the following 
    NALU is equal to DONB + DOND and MUST NOT be larger than 65535.  
    This memo does not specify how the NALUs within an MTAP are  
    ordered, but, in most cases, NAL unit decoding order, i.e., 
    ascending order of DONDs, SHOULD be used.  The timing information 
    field MUST be set so that the RTP timestamp of an RTP packet of 
    each NALU in the MTAP (the NALU-time) can be generated by adding 
    the timing information from the RTP timestamp of the MTAP.  
     
    0                   1                   2                   3 
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    :           NALU size           |      DOND     |  timing info  | 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    |  timing info  |                                               | 
    +-+-+-+-+-+-+-+-+              NALU                             | 
    |                                                               | 
    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    |                               : 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     
    Figure xxxx: Multi-Time Aggregation Unit for MTAP16 
     
 Wenger et. al.     Expires December 2002               [Page 12] 

 Internet Draft                                          01 March, 2003 
    0                   1                   2                   3 
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    :           NALU size           |      DOND     |  timing info  | 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    |         timing info           |                               | 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               | 
    |                             NALU                              | 
    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    |                               : 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     
    Figure xxxx: Multi-Time Aggregation Unit for MTAP24 
     
    For the "earliest" multi-picture Aggregation Unit in an MTAP the 
    timing offset MUST be zero.  Hence, the RTP timestamp of the MTAP 
    itself is identical to the earliest NALU-time. 
     
     
 6.4.      Fragmentation Units 
     
    Fragmentation units (FU) are the packet fragmentation scheme of 
    this payload specification. Among others, the scheme is introduced 
    to complement the aggregation unit scheme introduced in section 6.3 
    and to deliver pre-encoded packetized video over networks with 
    limited MTU size. FUs contain fragments of one single NALU, which 
    is referred to as fragmented NALU. STAPs and MTAPs MUST NOT be 
    fragmented. 
     
    0                   1                   2                   3 
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    |NALU type octet|   FU header   |                               | 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               | 
    |                                                               | 
    |                         FU payload                            | 
    |                                                               | 
    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    |                               :...OPTIONAL RTP padding        | 
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
     
    Figure xxxx. RTP payload format for Fragmentation Unit. 
     
    The NALU type octet of a fragmentation unit is indicated by the 
    type definition 0x21 and has the following format: 
     
    +---------------+ 
    |0|1|2|3|4|5|6|7| 
    +-+-+-+-+-+-+-+-+ 
    |F|NRI|  0x21   | 
    +---------------+ 
  
 Wenger et. al.     Expires December 2002               [Page 13] 

 Internet Draft                                          01 March, 2003 
    The NALU payload of a fragmentation unit consists of fragmentation 
    unit header of one octet and a fragmentation unit payload.  The FU 
    header has the following format: 
     
    +---------------+ 
    |0|1|2|3|4|5|6|7| 
    +-+-+-+-+-+-+-+-+ 
    |S|E|R|  Type   | 
    +---------------+ 
     
    S: 1 bit 
       The Start bit, when one, indicates the start of a fragmented 
       NALU. Otherwise, when the following payload is not the start of 
       an NALU payload, the Start bit is set to zero. 
        
    E: 1 bit 
       The End bit, when one, indicates the end of a fragmented NALU, 
       i.e. the last byte of the payload is also the last byte of the 
       fragmented NALU. Otherwise, when the following payload is not 
       the last FU of a fragmented NALU, the End bit is set to zero. 
        
    R: 1 bit 
       The Reserved bit MUST be 0. 
        
    Type: 5 bits 
       The NAL Unit payload type as defined in table 7.1 of [1]. 
  
    The FU payload consists of fragments of the payload of the 
    fragmented NALU such that if the fragmentation unit payloads of 
    consecutive FUs are sequentially concatenated, the payload of the 
    fragmented NALU is reconstructed.  Note that the NALU type octet of 
    the fragmented NALU is not included as such in the fragmentation 
    unit payload, but rather the information of the NALU type octet of 
    the fragmented NALU is conveyed in F and NRI fields of the NALU 
    type octet of the fragmentation unit and in the type field of the 
    FU header.  A FU payload can have any number of octets and can be 
    empty. 
  
    The following rules apply to fields in the RTP header and in the 
    NALU type octet of the RTP payload:  The RTP timestamp is set to 
    the NALU time of the fragmented NALU.  The F bit MUST be set 
    according to the F bit of the fragmented NALU.  The value of NRI 
    field MUST be set according to the value of the NRI field in the 
    fragmented NALU. 
  
  
 7.    Packetization Rules 
  
    Two cases of packetization rules have to be distinguished by the 
    possibility to put packets belonging to more than a single picture 
    into a single aggregated packet (using STAPs or MTAPs). 
     
     
 Wenger et. al.     Expires December 2002               [Page 14] 

 Internet Draft                                          01 March, 2003 
 7.1.      Unrestricted Mode (Multiple Picture Model) 
  
    This mode MAY be supported by some receivers.  Usually, the 
    capability of a receiver to support this mode is implied by the 
    application, or indicated by external control protocol means.  The 
    use of this mode MUST be signaled with the optional aggregation-
    mode MIME or SDP parameter, if MIME or SDP signaling is in use.  
    The following packetization rules MUST be enforced by the sender: 
     
    o Single slice NALUs or Data Partition NALUs belonging to the same 
      picture (and hence share the same RTP timestamp value) MAY be 
      sent in any order permitted by the applicable profile defined in 
      [1], although, for delay critical systems, they SHOULD be sent in 
      their original coding order to minimize the delay.  Note that the 
      coding order is not necessarily the scan order, but the order the 
      NAL packets become available to the RTP stack.  
     
    o The transmission order of NALUs MUST conform to the NAL unit 
      decoding order unless signaled otherwise with the optional num-
      reorder-VCL-NAL-units MIME parameter or by other means. Some 
      receivers MAY NOT support a transmission order that does not 
      conform to the NAL unit decoding order. 
     
    o Both MTAPs and STAPs MAY be used. 
     
    o FUs MAY be used. If an NALU is transmitted as a fragmented NALU, 
      the following rules apply. For the first FU of a fragmented NALU 
      the Start bit is set to one, the End bit is set to zero, and any 
      number of initial bytes of the fragmented NALU payload are 
      transported in this FU. Any number of additional FUs belonging to 
      this fragmented NALU may be transmitted with Start bit set to 
      zero and End bit set to zero. If the FU contains the last byte of 
      the fragmented NALU, the End bit is set to one. A fragmented NALU 
      MUST NOT be transmitted in one FU, i.e., Start bit and End bit 
      MUST NOT both be set to one in the same FU header.  A 
      Fragmentation Unit MUST NOT contain an aggregation packet.  
      Fragmentation units of a NALU MUST be sent in consecutive 
      packets. 
     
    o SEI packets MAY be sent anytime. 
     
    o Parameter set NALUs MUST NOT be sent in an RTP session whose 
      Parameter Sets were already changed by control protocol messages 
      during the lifetime of the RTP session.  If parameter set NALUs 
      are allowed by this condition, they MAY be sent at any time. 
     
    o An MTAP or a STAP MUST NOT contain an FU. 
     
    o An Aggregation Packet MUST succeed a Simple Packet in 
      transmission order if the NAL units in the Simple Packet precede 
      the NAL units in the Aggregation Packet in NAL unit decoding 
      order.  An Aggregation Packet MUST precede a Simple Packet in 
      transmission order if the NAL units in the Simple Packet succeed 
      the NAL units in the Aggregation Packet in NAL unit decoding 
      order.   
 Wenger et. al.     Expires December 2002               [Page 15] 

 Internet Draft                                          01 March, 2003 
     
    o An Aggregation Packet MUST succeed a Fragmentation Unit in 
      transmission order if the NAL units in the Simple Packet precede 
      the NAL unit conveyed in the Fragmentation Unit in NAL unit 
      decoding order.  An Aggregation Packet MUST precede a 
      Fragmentation Unit in transmission order if the NAL unit conveyed 
      in the Fragmentation Unit succeed the NAL units in the 
      Aggregation Packet in NAL unit decoding order.   
     
    o All NALU types MAY be mixed freely, provided that above rules are 
      obeyed.  In particular, it is allowed to mix slices in data-
      partitioned and single-slice mode. 
     
    o Network elements MAY convert multiple RTP packets carrying 
      individual NALUs into one aggregated RTP packet, convert an 
      aggregated RTP packet into several RTP packets carrying 
      individual NALUs, or mix both concepts.  However, when doing so 
      they SHOULD take into account at least the following parameters: 
      path MTU size, unequal protection mechanisms (e.g. through 
      packet-based FEC according to RFC2398, carried by RFC2198, 
      especially for parameter set NALUs and Type A Data Partitioning 
      NALUs), bearable latency of the system, and buffering 
      capabilities of the receiver. 
     
    o NALUs of all types except for FUs MAY be conveyed as aggregation 
      units of an STAP or MTAP rather than individual RTP packets.  
      Special care SHOULD be taken (particularly in gateways) to avoid 
      more than a single copy of identical NALUs in a single STAP/MTAP 
      in order to avoid unnecessary data transfers without any 
      improvements of QoS. 
     
     
 7.2.      Restricted Mode (Single Picture Model) 
     
    This mode MUST be supported by all receivers.  It is primarily 
    intended for low delay applications.  Its main difference from the 
    Unrestricted Mode is to forbid the packetization of data belonging 
    to more than one picture in a single RTP packet.  Hence, MTAPs MUST 
    NOT be used.  The following packetization rules MUST be enforced by 
    the sender: 
     
    o All rules of the Unrestricted Mode above, with the following  
      additions 
     
   o only STAPs MAY be used, MTAPs MUST NOT be used.  This implies that 
      aggregated packets MUST NOT include slices or data partitions   
      belonging to different pictures. 
     
     
 8.    De-Packetization Process 
  
    The de-packetization process is implementation dependent.  Hence, 
    the following description should be seen as an example of a 
    suitable implementation.  Other schemes MAY be used as well.  
 Wenger et. al.     Expires December 2002               [Page 16] 

 Internet Draft                                          01 March, 2003 
    Optimizations relative to the described algorithms are likely 
    possible. 
     
    The general concept behind these de-packetization rules is to 
    reorder NALUs from transmission order to the NAL unit decoding 
    order.  All fragmentation units of a NALU are collected and the 
    resulting NALU is processed as if it were received as a Simple 
    Packet.  Aggregation packets are handled by unloading their payload 
    into individual RTP packets carrying NALUs.  Those NALUs are 
    processed as if they were received in separate RTP packets, in the 
    order they were arranged in the Aggregation Packet. 
     
    Hereinafter, let N be the value of the optional num-reorder-VCL-
    NAL-units MIME type parameter (see section 9.1).  When the RTP 
    session is initialized, the receiver buffers at least N VCL NAL 
    units before passing any packet to the decoder. 
     
    For each NAL unit stored in the buffer, the RTP sequence number of 
    the packet that contained the NAL unit is stored and associated 
    with the stored NAL unit.  Moreover, the packet type (Simple Packet 
    or Aggregation Packet) that contained the NAL unit is stored and 
    associated with each stored NAL unit.  Furthermore, for NAL units 
    carried in aggregation packets, decoding order number (DON) is 
    calculated and stored. 
     
    If the receiver buffer contains at least N VCL NAL units, NAL units 
    are removed from the receiver buffer and passed to the decoder in 
    the order specified below until the buffer contains N-1 VCL NAL 
    units.   
     
    Hereinafter, let PDON be the DON of the previous NAL unit of an 
    aggregation packet in NAL unit decoding order.  If no previous NAL 
    unit of an aggregation packet in NAL unit decoding order exists, 
    PDON is 0.   
     
    The order that NAL units are passed to the decoder is specified as 
    follows: 
     
    o If the oldest RTP sequence number in the buffer corresponds to a 
      Simple Packet, the NALU in the Simple Packet is the next NALU in 
      the NAL unit decoding order. 
     
    o If the oldest RTP sequence number in the buffer corresponds to an 
      Aggregation Packet, the NAL unit decoding order is recovered 
      among the NALUs conveyed in Aggregation Packets in RTP sequence 
      number order until the next Simple Packet or FU (exclusive).  
      This set of NALUs is hereinafter referred to as the candidate 
      NALUs.  If no NALUs conveyed in Simple Packets or FUs reside in 
      the buffer, all NALUs belong to candidate NALUs. 
     
    o For each NAL unit among the candidate NALUs, a DON distance is 
      calculated as follows.  If the DON of the NAL unit is larger than 
      PDON, the DON distance is equal to DON - PDON.  Otherwise, the 
      DON distance is equal to 65535 - PDON + DON + 1.  NAL units are 
      delivered to the decoder in ascending order of DON distance.  
 Wenger et. al.     Expires December 2002               [Page 17] 

 Internet Draft                                          01 March, 2003 
     
    o If several NAL units share the same DON distance, the order to 
      pass them to the decoder is the following:  
     
        1. Picture delimiter NAL unit, if any 
        2. Sequence parameter set NAL units, if any 
        3. Picture parameter set NAL units, if any 
        4. SEI NAL units, if any 
        5. Coded slice and slice data partition NAL units of  
           the primary coded picture, if any 
        6. Coded slice and slice data partition NAL units of  
           the redundant coded pictures, if any 
        7. Filler data NAL units, if any 
        8. End of sequence NAL unit, if any 
        9. End of stream NAL unit, if any 
     
    o If the video decoder in use does not support Arbitrary Slice 
      Ordering, the decoding order of slices and A data partitions is 
      ordered in ascending order of the first_mb.in.slice syntax  
      element in the slice header. Moreover, B and C data partitions 
      immediately follow the corresponding A data partition in decoding 
      order. 
     
    The following additional de-packetization rules MAY be used to 
    implement an operational JVT de-packetizer: 
     
    o Intelligent RTP receivers (e.g. in Gateways) MAY identify lost  
      DPAs. If a lost DPA is found, the Gateway MAY decide not to send 
      the DPB and DPC partitions, as their information is meaningless 
      for the JVT Decoder.  In this way a network element can reduce 
      network load by discarding useless packets, without parsing a 
      complex bit stream 
     
    o Intelligent receivers MAY discard all packets that have a NAL 
      Reference Idc of 0.  However, they SHOULD process those packets 
      if possible, because the user experience may suffer if the 
      packets are discarded. 
       
    o If a Fragmentation Unit is lost, all Fragmentation Units 
      corresponding to the same NALU SHOULD be discarded. 
     
     
 9.    Payload Format Parameters 
  
    This section specifies the parameters that MAY be used to select 
    optional features of the payload format.  The parameters are 
    specified here as part of the MIME subtype registration for the 
    ITU-T H.264 | ISO/IEC 14496-10 codec.  A mapping of the parameters 
    into the Session Description Protocol (SDP) [4] is also provided 
    for those applications that use SDP.  Equivalent parameters could 
    be defined elsewhere for use with control protocols that do not use 
    MIME or SDP. 
     
     
 Wenger et. al.     Expires December 2002               [Page 18] 

 Internet Draft                                          01 March, 2003 
 9.1.      MIME Registration 
     
    The MIME subtype for the ITU-T H.264 | ISO/IEC 14496-10 codec is 
    allocated from the IETF tree.   
     
    Any unspecified parameter MUST be ignored by the receiver. 
     
    Media Type name:     video 
     
    Media subtype name:  H264 
     
    Required parameters: none 
     
    Optional parameters: 
       profile-level-id: A profile-level element used in specifying the 
                         value of this parameter is generated by 
                         forming a string of hexadecimal 
                         representations of the following two bytes in 
                         the sequence parameter set NAL unit specified 
                         in [1]: 1) profile_idc and 2) level_idc. The 
                         value of profile-level-id is a sequence of 
                         profile-level elements. If this parameter is 
                         used for indicating properties of a NAL unit 
                         stream, it indicates the profiles that are in 
                         use in the stream and the highest level that 
                         is in use for each signaled profile.  If this 
                         parameter is used for capability exchange or 
                         session setup procedure, it indicates the 
                         profiles that the codec supports and the 
                         highest level that is supported for each 
                         signaled profile.  For example, if a codec 
                         supports the Baseline Profile at level 3 and 
                         below and the Main Profile at level 2.1 and 
                         below, the profile-level-id becomes 421E4D15.  
                         If no profile-level-id is present, the 
                         Baseline Profile at Level 1 MUST be implied. 
        
       profile-interoperability: This parameter MAY be used to signal 
                         the properties of a NAL unit stream.  It MUST 
                         NOT be used to signal the capabilities of a 
                         codec implementation.  The parameter indicates 
                         which ones of the coding tools that are 
                         included in the Baseline Profile but are not 
                         included in the Main Profile are in use in the 
                         NAL unit stream.  The value of the parameter 
                         is a 3-character string of "1"s and "0"s 
                         indicating the values of 
                         more_than_one_slice_group_allowed_flag, 
                         arbitrary_slice_order_allowed_flag, and 
                         redundant_pictures_allowed_flag (respectively) 
                         of the sequence parameter set NAL units that 
                         are in use in the NAL unit stream.  If the 
                         value of any one of the flags in any of the 
                         sequence parameter sets used in the NAL unit 
                         stream changes, a value of "1" MUST be 
 Wenger et. al.     Expires December 2002               [Page 19] 

 Internet Draft                                          01 March, 2003 
                         indicated in the value of the corresponding 
                         flag in the profile-interoperability 
                         parameter.  If no profile-interoperability is 
                         present, its value is undefined. 
        
       parameter-sets:   This parameter MAY be used to convey such 
                         parameter set NAL units, herein referred to as 
                         the initial parameter set NAL units, that MUST 
                         precede any other NAL units in decoding  
                         order.  The parameter MUST NOT be used to 
                         indicate codec capability in any capability 
                         exchange procedure.  The value of the 
                         parameter is the hexadecimal representation of 
                         the initial parameter set NAL units as 
                         specified in sections 7.3.2.1 and 7.3.2.2 of 
                         [1].  The parameter sets are conveyed in 
                         decoding order and no framing of the parameter 
                         set NAL units takes place.  Note that the 
                         number of bytes in a parameter set NAL unit is 
                         typically less than 10 bytes, but a picture 
                         parameter set NALU can contain even several 
                         hundreds of bytes. 
        
       num-reorder-VCL-NAL-units: This parameter MAY be used to signal 
                         the properties of a NAL unit stream or the 
                         capabilities of a transmitter or receiver 
                         implementation.  The parameter specifies the 
                         maximum amount of VCL NAL units that precede 
                         any VCL NAL unit in the NAL unit stream in NAL 
                         unit decoding order and follow the VCL NAL 
                         unit in RTP sequence number order or in the 
                         composition order of the aggregation packet 
                         containing the VCL NAL unit.  If the parameter 
                         is not present, num-reorder-VCL-NAL-units 
                         equal to 0 MUST be implied.  The value of num-
                         reorder-VCL-NAL-units MUST be an integer in 
                         the range from 0 to 32767, inclusive. 
  
       aggregation-mode: Permissible values are 0 and 1.  If 0 or not 
                         present, STAPs MAY be present and MTAPs MUST 
                         NOT be present in the NAL unit stream. If 1, 
                         both STAPs and MTAPs MAY be present in the NAL 
                         unit stream. 
  
    Encoding considerations: 
                         This type is defined for transfer via RTP (RFC 
                         1889). 
     
    Security considerations: 
                         See section 10 of RFC XXXX. 
     
    Public specification: 
                         Please refer to RFC XXXX and its section 15. 
     
    Additional information: 
 Wenger et. al.     Expires December 2002               [Page 20] 

 Internet Draft                                          01 March, 2003 
                         None 
     
    File extensions:     none 
    Macintosh file type code: none 
    Object identifier or OID: none 
     
    Person & email address to contact for further information: 
                         stewe@cs.tu-berlin.de 
     
    Intended usage:      COMMON. 
     
    Author/Change controller: 
                         stewe@cs.tu-berlin.de 
                         IETF Audio/Video transport working group 
     
     
 9.2.      SDP Parameters 
     
    The MIME media type video/H264 string is mapped to fields in the 
    Session Description Protocol (SDP) [4] as follows: 
     
    o The media name in the "m=" line of SDP MUST be video. 
     
    o The encoding name in the "a=rtpmap" line of SDP MUST be H264 (the 
      MIME subtype). 
     
    o The "a=fmtp" line of SDP MUST contain the optional parameters 
      "profile-level-id", "profile-interoperability", "parameter-sets", 
      "num-reorder-VCL-NAL-units", and "aggregation-mode", if any, to 
      indicate the coder capability and configuration, respectively.  
      These parameters are expressed as a MIME media type string, in 
      the form of as a semicolon separated list of parameter=value 
      pairs. 
     
    An example of media representation in SDP is as follows (Baseline 
    Profile, Level 3.0, more than one slice group, arbitrary slice 
    ordering, and redundant slices are in use): 
     
    m=video 49170/2 RTP/AVP 98 
    a=rtpmap:98 H264/90000 
    a=fmtp:98 profile-level-id=421E;profile-interoperability=111 
     
     
 10.     Security Considerations 
  
    So far, no security considerations beyond those of RFC1889 have 
    been identified. 
     
    Currently, the JVT CD does not allow carrying any type of active 
    payload.  However, the inclusion of a "user data" mechanism is 
    under consideration, which could potentially be used for mechanisms 
    such as remote software updates of the video decoder and similar 
    tasks.  
     
     
 Wenger et. al.     Expires December 2002               [Page 21] 

 Internet Draft                                          01 March, 2003 
 11.     Informative Appendix: Application Examples 
  
    This payload specification is very flexible in its use, to cover 
    the extremely wide application space that is anticipated for the 
    JVT codec.  However, such a great flexibility also makes it 
    difficult for an implementer to decide on a reasonable 
    packetization scheme.  Some information how to apply this 
    specification to real-world scenarios is likely to appear in the 
    form of academic publications and a Test Model in the near future.  
    However, some preliminary usage scenarios should be described here 
    as well.   
     
     
 11.1.     Video Telephony, no Data Partitioning, no packet aggregation 
  
    The RTP part of this scheme is implemented and tested (though not 
    the control-protocol part, see below). 
     
    In most real-world video telephony applications, the picture 
    parameters such as picture size or optional modes never change 
    during the lifetime of a connection.  Hence, all necessary 
    Parameter Sets (usually only one) are sent as a side effect of the 
    capability exchange/announcement process e.g. according to the SDP 
    syntax specified in section 9.2 of this document.  Since all 
    necessary Parameter Set information is established before the RTP 
    session starts, there is no need for sending any parameter set 
    NALUs.  Data Partitioning is not used either.  Hence, the RTP 
    packet stream consists basically of NALUs that carry single slices 
    of video information. 
     
    The size of those single-slice NALUs is chosen by the encoder such 
    that they offer the best performance.  Often, this is done by 
    adapting the coded slice size to the MTU size of the IP network.  
    For small picture sizes this may result in a one-picture-per-one-
    packet strategy.  The loss of packets and the resulting drift-
    related artifacts are cleaned up by Intra refresh algorithms. 
     
     
 11.2.       Video Telephony, Interleaved Packetization using Packet 
 Aggregation 
  
    This scheme allows better error concealment and is widely used in 
    H.263 based designed using RFC2429 packetization.  It is also 
    implemented and good results were reported [8].  
     
    The source picture is coded by the VCL such that all MBs of one MB 
    line are assigned to one slice.  All slices with even MB row 
    addresses are combined into one STAP, and all slices with odd MB 
    row addresses into another STAP.  Those STAPs are transmitted as 
    RTP packets.  The establishment of the Parameter Sets is performed 
    as discussed above. 
     
    Note that the use of STAPs is essential here, because the high 
    number of individual slices (18 for a CIF picture) would lead to 
    unacceptably high IP/UDP/RTP header overhead (unless the source 
 Wenger et. al.     Expires December 2002               [Page 22] 

 Internet Draft                                          01 March, 2003 
    coding tool FMO is used, which is not assumed in this scenario).  
    Furthermore, some wireless video transmission systems, such as 
    H.324M and the IP-based video telephony specified in 3GPP, are 
    likely to use relatively small transport packet size.  For example, 
    a typical MTU size of H.223 AL3 SDU is around 100 bytes [11].  
    Coding individual slices according to this packetization scheme 
    provides a further advantage in communication between wired and 
    wireless networks, as individual slices are likely to be smaller 
    than the preferred maximum packet size of wireless systems.  
    Consequently, a gateway can convert the STAPs used in a wired 
    network to several RTP packets with only one NALU that are 
    preferred in a wireless network and vice versa.  
     
     
 11.3.       Video Telephony, with Data Partitioning 
  
    This scheme is implemented and was shown to offer good performance 
    especially at higher packet loss rates [8]. 
    Data Partitioning is known to be useful only when some form of 
    unequal error protection is available.  Normally, in single-session 
    RTP environments, even error characteristics are assumed -- 
    statistically, the packet loss probability of all packets of the 
    session is the same.  However, there are means to reduce the packet 
    loss probability of individual packets in an RTP session.  RFC 2198 
    [12], for example, allows carrying a redundant copy of a essential 
    packet in the next RTP packet.  Packet-based Forward Error 
    Correction [13] carried in RFC2198 is also an appropriate means to 
    protect high priority information. 
     
    In all cases, the incurred overhead is substantial, but in the same 
    order of magnitude as the number of bits that have otherwise be 
    spent for intra information.  However, this mechanism is not adding 
    any delay to the system.   
     
    Again, the complete Parameter Set establishment is performed 
    through control protocol means. 
     
     
 11.4.       Low-Bit-Rate Streaming 
  
    This scheme has been implemented with H.263 and gave good results 
    [14].  There is no technical reason why similarly good results 
    could not be achievable using the JVT codec.  
     
    In today's Internet streaming, some of the offered bit-rates are 
    relatively low in order to allow terminals with dial-up modems to 
    access the content.  In wired IP networks, relatively large 
    packets, say 500 - 1500 bytes, are preferred to smaller and more 
    frequently occurring packets in order to reduce network  
    congestion.  Moreover, use of large packets decreases the amount of 
    RTP/UDP/IP header overhead.  For low-bit-rate video, the use of 
    large packets means that sometimes up to few pictures should be 
    encapsulated in one packet.  
     
 Wenger et. al.     Expires December 2002               [Page 23] 

 Internet Draft                                          01 March, 2003 
    However, loss of such a packet would have drastic consequences in 
    visual quality, as there is practically no other way to conceal a 
    loss of an entire picture than to repeat the previous one.  One way 
    to construct relatively large packets and maintain possibilities 
    for successful loss concealment is to construct MTAPs that contain 
    slices from several pictures in an interleaved manner.  An MTAP 
    should not contain spatially adjacent slices from the same picture 
    or spatially overlapping slices from any picture.  If a packet is 
    lost, it is likely that a lost slice is surrounded by spatially 
    adjacent slices of the same picture and spatially corresponding 
    slices of the temporally previous and succeeding pictures. 
    Consequently, concealment of the lost slice is likely to succeed 
    relatively well. 
     
     
 11.5.       Robust Packet Scheduling in Video Streaming 
     
    This scheme has been implemented with MPEG-4 Part 2 and simulated 
    in a wireless streaming environment [15].  There is no technical 
    reason why similar or better results could not be achievable using 
    the JVT codec. 
     
    Streaming clients typically have a receiver buffer that is capable 
    of storing a relatively large amount of data.  Initially, when a 
    streaming session is established, a client does not start playing 
    the stream back immediately, but rather it typically buffers the 
    incoming data for a few seconds.  This buffering helps to maintain 
    continuous playback, because, in case of occasional increased 
    transmission delays or network throughput drops, the client can 
    decode and play buffered data.  Otherwise, without initial 
    buffering, the client has to freeze the display, stop decoding, and 
    wait for incoming data.  The buffering is also necessary for either 
    automatic or selective retransmission in any protocol level.  If 
    any part of a picture is lost, a retransmission mechanism may be 
    used to resend the lost data.  If the retransmitted data is 
    received before its scheduled decoding or playback time, the loss 
    is perfectly recovered.  Coded pictures can be ranked according to 
    their importance in the subjective quality of the decoded  
    sequence.  For example, non-reference pictures, such as 
    conventional B pictures, are subjectively least important, because 
    their absence does not affect decoding of any other pictures.  In 
    addition to non-reference pictures, the ITU-T H.264 | ISO/IEC 
    14496-10 standard includes a temporal scalability method called 
    sub-sequences [16].  Subjective ranking can also be made on data 
    partition or slice group basis.  Coded slices and data partitions 
    that are subjectively the most important can be sent earlier than 
    their decoding order indicates, whereas coded slices and data 
    partitions that are subjectively the least important can be sent 
    later than their natural coding order indicates.  Consequently, any 
    retransmitted parts of the most important slice and data partitions 
    are more likely to be received before their scheduled decoding or 
    playback time compared to the least important slices and data 
    partitions. 
     
     
 Wenger et. al.     Expires December 2002               [Page 24] 

 Internet Draft                                          01 March, 2003 
 12.     Open Issues 
    There may be an issue when using the I-D to transport interlace 
    content.  It seems that the draft has a problem when one picture 
    has more than one timestamp.  The authors will try to come to a 
    conclusion during the Pattaya meeting of JVT (in the week before 
    the San Francisco IETF), and report in the AVT session whether a 
    problem exist and, if time permits, present a possible solution. 
     
     
 13.     Full Copyright Statement 
     
    Copyright (C) The Internet Society (2002). All Rights Reserved. 
     
    This document and translations of it may be copied and furnished to 
    others, and derivative works that comment on or otherwise explain 
    it 
    or assist in its implementation may be prepared, copied, published 
    and distributed, in whole or in part, without restriction of any 
    kind, provided that the above copyright notice and this paragraph 
    are included on all such copies and derivative works. 
     
    However, this document itself may not be modified in any way, such 
    as by removing the copyright notice or references to the Internet 
    Society or other Internet organizations, except as needed for the 
    purpose of developing Internet standards in which case the 
    procedures for copyrights defined in the Internet Standards process 
    must be followed, or as required to translate it into languages 
    other than English. 
     
    The limited permissions granted above are perpetual and will not be 
    revoked by the Internet Society or its successors or assigns. 
     
    This document and the information contained herein is provided on 
    an 
    "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 
    TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 
    BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 
    HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 
    MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 
     
     
 14.     Intellectual Property Notice 
     
    The IETF has been notified of intellectual property rights claimed 
    in regard to some or all of the specification contained in this 
    document.  For more information consult the online list of claimed 
    rights at http://www.ietf.org/ipr. 
     
     
 15.     References 
     
 15.1.       Normative References 
  
    [1]  "Study of Final Committee Draft of Joint Video Specification 
          (ITU-T Rec. H.264 | ISO/IEC 14496-10 AVC) ", available from  
 Wenger et. al.     Expires December 2002               [Page 25] 

 Internet Draft                                          01 March, 2003 
          ftp://ftp.imtc-files.org/jvt-experts/2002_12_Awaji/JVT-
          F100.zip, February 2003. 
    [2]  S. Bradner,"Key words for use in RFCs to Indicate Requirement 
          Levels", BCP 14, RFC 2119, March 1997. 
    [3]  H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson, 
          "RTP: A Transport Protocol for Real-Time Applications", RFC 
          1889, January 1996. 
    [4]  M. Handley and V. Jacobson, "SDP: Session Description 
          Protocol", RFC 2327, April 1998. 
  
 15.2.       Informative References 
  
    [5]  P. Borgwardt, "Handling Interlaced Video in H.26L", VCEG-
          N57r2, available from ftp://standard.pictel.com/video-
          site/0109_San/VCEG-N57r2.doc, September 2001. 
    [6]  C. Borman et. Al., "RTP Payload Format for the 1998 Version 
          of ITU-T Rec. H.263 Video (H.263+)", RFC 2429, October 1998. 
    [7]  ISO/IEC IS 14496-2. 
    [8]  S. Wenger, "H.26L over IP", IEEE Transaction on Circuits and 
          Systems for Video technology, to appear (April 2002). 
    [9]  S. Wenger, "H.26L over IP: The IP Network Adaptation Layer", 
          Proceedings Packet Video Workshop 02, April 2002, to appear. 
    [10] T. Stockhammer, M.M. Hannuksela, and S. Wenger, "H.26L/JVT 
          Coding Network Abstraction Layer and IP-based Transport" in 
          Proc. ICIP 2002, Rochester, NY, September 2002. 
    [11] ITU-T Recommendation H.223 (1999). 
    [12] C. Perkins et. al., "RTP Payload for Redundant Audio Data", 
          RFC 2198, September 1997. 
    [13] J. Rosenberg, H. Schulzrinne, "An RTP Payload Format for 
          Generic Forward Error Correction", RFC 2733, December 1999. 
    [14] V Varsa, M. Karczewicz, "Slice interleaving in compressed 
          video packetization", Packet Video Workshop 2000. 
    [15] S.H. Kang and A. Zakhor, "Packet scheduling algorithm for 
          wireless video streaming," International Packet Video 
          Workshop 2002, available http://www.pv2002.org. 
    [16] M.M. Hannuksela, "Enhanced concept of GOP", JVT-B042, 
          available ftp://standard.pictel.com/video-site/0201_Gen/JVT-
          B042.doc, January 2002. 
     
     
    Author's Addresses 
     
    Stephan Wenger                    Phone: +49-172-300-0813 
    TU Berlin / Teles AG              Email: stewe@cs.tu-berlin.de 
    Franklinstr. 28-29 
    D-10587 Berlin 
    Germany 
     
    Thomas Stockhammer                Phone: +49-89-28923474 
    Institute for Communications Eng. Email: stockhammer@ei.tum.de 
    Munich University of Technology 
    D-80290 Munich 
    Germany 
 Wenger et. al.     Expires December 2002               [Page 26] 

 Internet Draft                                          01 March, 2003 
     
    Miska M. Hannuksela               Phone: +358 40 5212845 
    Nokia Corporation                 Email: miska.hannuksela@nokia.com 
    P.O. Box 68 
    33721 Tampere 
    Finland   
     
 Wenger et. al.     Expires December 2002               [Page 27]