Internet Engineering Task Force                    Avaro-France Telecom 
Internet Draft                                               Basso-AT&T 
                                                   Casner-Packet Design 
                                                          Civanlar-AT&T 
                                                        Gentric-Philips 
                                                         Herpel-Thomson 
                                                      Lifshitz-Optibase 
                                                            Lim-mp4cast 
                                                            Perkins-ISI 
                                                   van der Meer-Philips 
                                                              July 2001 
                                                      Expires Jan. 2002 
Document: draft-ietf-avt-mpeg4-multisl-01.txt                           
 
 
                 RTP Payload Format for MPEG-4 Streams 
 
    
Status of this Memo 
    
   This document is an Internet-Draft and is in full conformance with 
   all provisions of Section 10 of RFC2026. 
    
   Internet-Drafts are working documents of the Internet Engineering 
   Task Force (IETF), its areas, and its working groups. Note that 
   other groups may also distribute working documents as Internet-
   Drafts. Internet-Drafts are draft documents valid for a maximum of 
   six months and may be updated, replaced, or obsoleted by other 
   documents at any time. It is inappropriate to use Internet- Drafts 
   as reference material or to cite them other than as "work in 
   progress." 
    
   This specification is a product of the Audio/Video Transport working 
   group within the Internet Engineering Task Force and ISO/IEC MPEG-4 
   ad hoc group on MPEG-4 over Internet. Comments are solicited and 
   should be addressed to the working group's mailing list at 
   avt@ietf.org and/or the authors. 
    
   The list of current Internet-Drafts can be accessed at 
   http://www.ietf.org/ietf/1id-abstracts.txt 
   The list of Internet-Draft Shadow Directories can be accessed at 
   http://www.ietf.org/shadow.html. 
    
   This document contains a MIME type registration form that is 
   intended to be taken as-is and therefore makes reference to this 
   document, using the temporary placeholder: <self-reference-to-this>. 
    
Abstract 
    
   This document describes a payload format for transporting MPEG-4 
   encoded data using RTP. MPEG-4 is a recent standard from ISO/IEC for 
   the coding of natural and synthetic audio-visual data. Several 
   services provided by RTP are beneficial for MPEG-4 encoded data 

  
Gentric et al.           Expires January 2002                        1 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   transport over the Internet. Additionally, the use of RTP makes it 
   possible to synchronize MPEG-4 data with other real-time data types. 
    
1. Introduction 
    
   MPEG-4 is a recent standard from ISO/IEC for the coding of natural 
   and synthetic audio-visual data in the form of audiovisual objects 
   that are arranged into an audiovisual scene by means of a scene 
   description [1][2][3][4]. This draft specifies an RTP [5] payload 
   format for transporting MPEG-4 encoded data streams. 
    
   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 
   "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in 
   this document are to be interpreted as described in RFC 2119 [6]. 
    
   The benefits of using RTP for MPEG-4 data stream transport include: 
    
   i. Ability to synchronize MPEG-4 streams with other RTP payloads 
    
   ii. Monitoring MPEG-4 delivery performance through RTCP 
    
   iii. Combining MPEG-4 and other real-time data streams received from 
   multiple end-systems into a set of consolidated streams through RTP 
   mixers 
    
   iv. Converting data types, etc. through the use of RTP translators. 
    
   1.1 Overview of MPEG-4 End-System Architecture 
    
   Fig. 1 below shows the layered architecture of a terminal which 
   implements the complete MPEG-4 systems model. The Compression Layer 
   processes individual audio-visual media streams. The MPEG-4 
   compression schemes are defined in the ISO/IEC specifications 14496-
   2 [2] and 14496-3 [3]. The compression schemes in MPEG-4 achieve 
   efficient encoding over a bandwidth ranging from several kbps to 
   many Mbps. The audio-visual content compressed by this layer is 
   organized into Elementary Streams (ESs). 
   The MPEG-4 standard specifies MPEG-4 compliant streams. Within the 
   constraint of this compliance the compression layer is unaware of a 
   specific delivery technology, but it can be made to react to the 
   characteristics of a particular delivery layer such as the path-MTU 
   or loss characteristics. Also, some compressors can be designed to 
   be delivery specific for implementation efficiency. In such cases 
   the compressor may work in a non-optimal fashion with delivery 
   technologies that are different than the one it is specifically 
   designed to operate with. 
    
   The hierarchical relations, location and properties of ESs in a 
   presentation are described by a dynamic set of Object Descriptors 
   (ODs). Each OD groups one or more ES Descriptors referring to a 
   single content item (audio-visual object). Hence, multiple 
   alternative or hierarchical representations of each content item are 
   possible. 
  
Gentric et al.           Expires January 2002                        2 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   ODs are themselves conveyed through one or more ESs. A complete set 
   of ODs can be seen as an MPEG-4 resource or session description at a 
   stream level. The resource description may itself be hierarchical, 
   i.e. an ES conveying an OD may describe other ESs conveying other 
   ODs. 
    
   The session description is accompanied by a dynamic scene 
   description, Binary Format for Scene (BIFS), again conveyed through 
   one or more ESs. At this level, content is identified in terms of 
   audio-visual objects. The spatio-temporal location of each object is 
   defined by BIFS. The audio-visual content of those objects that are 
   synthetic and static are described by BIFS also. Natural and 
   animated synthetic objects may refer to an OD that points to one or 
   more ESs that carries the coded representation of the object or its 
   animation data. 
    
   By conveying the session (or resource) description as well as the 
   scene (or content composition) description through their own ESs, it 
   is made possible to change portions of the content composition and 
   the number and properties of media streams that carry the audio- 
   visual content separately and dynamically at well known instants in 
   time. 
    
   One or more initial Scene Description streams and the corresponding 
   OD stream are pointed to by an initial object descriptor (IOD). In 
   this context the IOD needs to be made available to the receivers 
   through some out-of-band means that are out of scope of this payload 
   specification. However in the context of transport on IP networks it 
   is defined in a separate document [9]. Note that for applications 
   that only use audio and/or video this payload format can also be 
   used without IOD and OD streams (decoder configuration is then 
   transported as MIME parameters, see section 4.1). 
    
   The Compression Layer organizes the ESs in Access Units (AU), the 
   smallest elements that can be attributed individual timestamps. The 
   Access Units concept defines the boundary between media specific 
   processing and delivery specific processing. That is to say 
   transport should not depend on the nature of the media data but only 
   on AU properties. 
    
   The Sync Layer (SL) that primarily provides the synchronization 
   between streams defines a homogeneous encapsulation of ESs carrying 
   media or control data (ODs, BIFS). Integer or fractional AUs are 
   then encapsulated in SL packets and in the following we will 
   describe this payload format as transporting SL packets, although in 
   many cases SL packet payloads are actually (entire) Access Units 
   payloads i.e. encoded media frames. All consecutive data from one 
   stream is called an SL-packetized stream at this layer. The 
   interface between the compression layer and the SL is called the 
   Elementary Stream Interface (ESI). The ESI is informative i.e. it is 
   extremely useful in order to define concepts and mechanisms but does 
   not have to be implemented. For the same reason this draft describes 
  
Gentric et al.           Expires January 2002                        3 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   the transport of SL packets i.e. Access Units or fragments thereof. 
   It is important to note however that a SL stream can be configured 
   so that SL packets are reduced to the media (compressed) data and in 
   that case implementations do not need to be aware of the SL at all. 
    
   The Delivery Layer in MPEG-4 consists of the Delivery Multimedia 
   Integration Framework defined in ISO/IEC 14496-6 [4]. This layer is 
   media unaware but delivery technology aware. It provides transparent 
   access to and delivery of content irrespective of the technologies 
   used.  The interface between the SL and DMIF is called the DMIF 
   Application Interface (DAI). It offers content location independent 
   procedures for establishing MPEG-4 sessions and access to transport 
   channels. The specification of this payload format is considered as 
   a part of the MPEG-4 Delivery Layer. 
    
   media aware        +-----------------------------------------+ 
   delivery unaware   |           COMPRESSION LAYER             | 
   14496-2 Visual     |streams from as low as Kbps to multi-Mbps| 
   14496-3 Audio      +-----------------------------------------+ 
    
                                                      Elementary 
                                                      Stream 
   ===================================================Interface 
    
   (ESI) 
                     +-------------------------------------------+ 
   media and         |              SYNC LAYER                   | 
   delivery unaware  | manages elementary streams, their synch-  | 
   14496-1 Systems   | ronization and hierarchical relations     | 
                     +-------------------------------------------+ 
                  
                                                       DMIF 
                                                       Application 
   ====================================================Interface 
    
   (DAI) 
                     +-------------------------------------------+ 
   delivery aware    |               DELIVERY LAYER              | 
   media  unaware    |provides transparent access to and delivery| 
   14496-6 DMIF      | of content irrespective of delivery       | 
                     |                technology                 | 
                     +-------------------------------------------+ 
    
   Figure 1: Conceptual MPEG-4 terminal architecture 
    
    
1.2 MPEG-4 Elementary Stream Data Packetization 
    
   The ESs from the encoders are fed into the SL with indications of AU 
   boundaries, random access points, desired composition time and the 
   current time. 
    

Gentric et al.           Expires January 2002                        4 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   The Sync Layer fragments the ESs into SL packets, each containing a 
   header that encodes information conveyed through the ESI. If the AU 
   is larger than a SL packet, subsequent packets containing remaining 
   parts of the AU are generated with subset headers until the complete 
   AU is packetized. 
    
   The syntax of the Sync Layer is configurable and can be adapted to 
   the needs of the stream to be transported. This includes the 
   possibility to select the presence or absence of individual syntax 
   elements as well as configuration of their length in bits. The 
   configuration for each individual stream is conveyed in a 
   SLConfigDescriptor, which is an integral part of the ES Descriptor 
   for this stream. The MPEG-4 SLConfigDescriptor, being configuration 
   information, is not carried by the media stream itself but is rather 
   transported via an ObjectDescriptor Stream encoded using the MPEG-4 
   Object Description framework. This can be done in a separate stream 
   using this payload format (see section 5.2 for details). The 
   SLConfigDescriptor MAY also be transported by other means (for 
   example as a parameter, see section 4.1). Finally streams for which 
   the SL packet headers are completely empty (or fully map into the 
   RTP headers) can also be transported using this payload format; in 
   these cases the Synch Layer can be seen as a purely conceptual 
   construction that does not have to be implemented at all. Since only 
   the knowledge of the decoder configuration is then needed it MAY 
   also be transported as a parameter, as described in section 4.1.  
    
    
2. Analysis of the carriage of MPEG-4 over IP 
    
   When transporting MPEG-4 audio and video, applications may or may 
   not require the use of MPEG-4 systems. To achieve the highest level 
   of interoperability between all MPEG-4 applications, it is desirable 
   that (a) in both cases the same MPEG-4 transport format can be used 
   and that (b) receivers that have no MPEG-4 system knowledge can 
   easily skip the MPEG-4 system specific information, if any. 
    
   RTP is perfectly suitable to transport MPEG-4 audio and MPEG-4 
   video, but when using MPEG-4 systems a problem arises from the fact 
   that both RTP and MPEG-4 systems contain a synchronization layer. 
   In particular, the RTP header duplicates some of the information 
   provided in SL packet headers such as the composition timestamps 
   (CTSs) and the marker bit that signals the end of access units. 
    
   To avoid unnecessary overhead and potential interoperability risks 
   when transporting MPEG-4 systems, it is desirable to remove the 
   redundancy between the SL packet header and the RTP packet header. 
   To be independent on the use of MPEG-4 systems, synchronization can 
   rely on the parameters provided in the RTP header. 
    
   In case SL headers are used, the redundant fields are removed from 
   the SL header, producing "reduced SL headers". 
   The remaining information from the SL header, if any, is contained 
   inside the RTP packet payload, together with the SL packet payload. 
  
Gentric et al.           Expires January 2002                        5 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   The combination of RTP packet headers and reduced SL packet headers 
   can be used to logically map the RTP packets to complete SL packets. 
    
   Some of the information contained in the reduced SL headers is also 
   useful for transport over RTP when MPEG-4 systems is not used. 
    
   For that reason the information in the "reduced" SL headers is split 
   into "general useful information" and "MPEG-4 systems only 
   information". 
    
   The "general useful information" hereinafter called Mapped SL Packet 
   Header (MSLH) is carried by a number of fields configurable using 
   parameters defined in section 4.1; all receivers MUST parse these 
   fields. 
    
   The "MPEG-4 systems only information", if any, is contained in a 
   reduced SL header, hereinafter called Remaining SL Packet Header 
   (RSLH), also configured using parameters (see section 4.1) and 
   preceded by a length field, so that non-MPEG-4-system devices MAY 
   skip this information. 
    
   This is depicted in figure 2. 
    
    
                            <----------SL Packet--------> 
    
                            +---------------------------+ 
                            |   SL Packet   | SL Packet | 
                            |    Header     | Payload   | 
                            +---------------------------+ 
                                  |                | 
                                  |                | 
         +-------------+----------+---+            |    
         |             |              |            | 
         V             V              V            V  
   +-----------+ +-----------+ +-------------+ +-----------+ 
   |RTP Packet | | Mapped SL | | Remaining SL| | SL Packet | 
   |  Header   | |  Header   | |    Header   | | Payload   | 
   +-----------+ +-----------+ +-------------+ +-----------+ 
    
                 <----RTP Packet Payload-------------------> 
    
    
   Figure 2: Mapping of SL Packet into RTP packet 
    
   When the configuration is such that SL packet headers map directly 
   to RTP headers this process of mapping SL packet headers is purely 
   conceptual. For example this RTP payload format has been designed so 
   that it is by default configured to be identical to RFC 3016 for the 
   recommended MPEG-4 video configurations (see section 5.5). Hence 
   receivers that comply with this payload specification can decode 
   such RTP payload without knowledge about the Synch Layer (see the 
   example in Appendix.1). In a similar fashion MPEG-4 audio (see 
  
Gentric et al.           Expires January 2002                        6 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   Appendix for examples) can be transported without explicit use of 
   the Synch Layer. 
    
3. Payload Format 
    
   The RTP Payload corresponds to an integer number of SL packets. 
    
   If multiple SL packets are transported in each RTP packet, they MUST 
   be in decoding order, i.e: 
   i)   decodingTimeStamp order, if present 
   ii)  packetSequenceNumber order, if present 
   iii) Implicit decoding order in all other cases. 
    
   The SL Packet Headers are transformed into RSLH with some fields 
   extracted to be mapped in the RTP header and others extracted to be 
   mapped in the corresponding MSLH. The SL Packet Payload is 
   unchanged. 
    
   This payload format has two modes. The "SingleSL" mode is a mode 
   where a single SL packet is transported per RTP packet. The 
   "MultipleSL" mode is a mode where possibly more than one SL packet 
   are transported per RTP packet. The default mode is the Single-SL 
   mode. The mode can be set to Multiple-SL by adding a non-zero 
   ConstantSize or SizeLength parameter (see section 4.1). 
    
   RTP Packets SHOULD be sent in the SL stream order (as defined 
   above). In case of interleaving the first SL packet of each RTP 
   packet is used as reference as in the following examples of RTP 
   packets containing interleaved SL packets. 
   This sequence is correct: [0,2,4][1,3,5] 
   This sequence is correct: [0,3,6][1,2][4,5] 
   This sequence is correct: [0,3,6][1,4][2,5] 
   This sequence is prohibited: [0,4,2][1,5,3] 
   This sequence is prohibited: [1,3,5][0,2,4] 
   This sequence is prohibited: [0,3,6][2,5][1,4] 
    
   The size (or number) of the SL packet(s) SHOULD be adjusted such 
   that the resulting RTP packet is not larger than the path-MTU. To 
   handle larger packets, this payload format relies on lower layers 
   for fragmentation, which may not be desirable. 
    
3.1 RTP Header Fields Usage 
    
   Payload Type (PT): The assignment of an RTP payload type for this 
   new packet format is outside the scope of this document, and will 
   not be specified here. It is expected that the RTP profile for a 
   particular class of applications will assign a payload type for this 
   encoding, or if that is not done then a payload type in the dynamic 
   range shall be chosen. 
    
   Marker (M) bit: The M bit is set to 1 when all SL packets in the RTP 
   packet are Access Units ends i.e. the M bit maps to the Synch Layer 
   accessUnitEndFlag. 
  
Gentric et al.           Expires January 2002                        7 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   Specifically the M bit is set to 0 when the RTP packet contains one 
   or more Access Unit fragments that are not Access Unit ends, and the 
   M bit is set to 1 for RTP packets that contain either: 
   . A single complete Access Unit 
   . The last fragment of an Access Unit 
   . Several complete Access Units 
   . Several last fragments of Access Units 
   . A mix of complete Access Units and last fragments of Access Units 
    
   Therefore for streams where all SL packets are complete Access Units 
   the M bit is 1 for all RTP packets. 
    
   Extension (X) bit: Defined by the RTP profile used. 
    
   Sequence Number: The RTP sequence number should be generated by the 
   sender with a constant random offset and does not have to be 
   correlated to any (optional) MPEG-4 SL sequence numbers. 
    
   Timestamp: Set to the value in the compositionTimeStamp field of the 
   first SL packet in the RTP packet, if present. If 
   compositionTimeStamp has less than 32 bits length, the MSBs of 
   timestamp MUST be set to zero. 
    
   Although it is available from the SL configuration data, the 
   resolution of the timestamp may need to be conveyed explicitly 
   through some out-of-band means to be used by network elements that 
   are not MPEG-4 aware. 
    
   If compositionTimeStamp has more than 32 bits length, this payload 
   format cannot be used. 
    
   In all cases, the sender SHALL always make sure that RTP time stamps 
   are identical only for RTP packets transporting fragments of the 
   same Access Unit. 
    
   In case compositionTimeStamp is not present in the current SL 
   packet, but has been present in a previous SL packet the reason is 
   that this is the same Access Unit that has been fragmented, 
   therefore the same timestamp value MUST be taken as RTP timestamp. 
    
   If compositionTimeStamp is never present in SL packets for this 
   stream, the RTP packetizer SHOULD convey a reading of a local clock 
   at the time the RTP packet is created. 
    
   According to RFC1889 [5, Section 5.1] timestamps are recommended to 
   start at a random value for security reasons. However then, a 
   receiver is not in the general case able to reconstruct the original 
   MPEG-4 Time Stamps (CTS, DTS, OCR) which can be of use for 
   applications where streams from multiple sources are to be 
   synchronized. Therefore the usage of such a random offset SHOULD be 
   avoided. 
    
  
Gentric et al.           Expires January 2002                        8 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   Note that since RTP devices may re-stamp the stream, all time stamps 
   inside of the RTP payload (CTS and DTS in MSLH, OCR in RSLH) MUST be 
   expressed as difference to the RTP time stamp. Since this 
   subtraction may lead to negative values, the offset MUST be encoded 
   as a two's complement signed integer in network byte order. Note 
   these offsets (delta) typically require much fewer bits to be 
   encoded than the original length, which is another justification. 
    
   When startCompositionTimeStamp is signaled in the SLConfigDescriptor 
   the RTP time stamps MUST start with this value. 
 
   SSRC, CC and CSRC fields are used as described in RFC 1889 [5]. 
    
   RTCP SHOULD be used as defined in RFC 1889 [5]. 
    
   RTP timestamps in RTCP SR packets: according to the RTP timing 
   model, the RTP timestamp that is carried into an RTCP SR packet is 
   the same as the compositionTimeStamp that would be applied to an RTP 
   packet for data that was sampled at the instant the SR packet is 
   being generated and sent. The RTP timestamp value is calculated from 
   the NTP timestamp for the current time, which also goes in the RTCP 
   SR packet. To perform that calculation, an implementation needs to 
   periodically establish a correspondence between the CTS value of a 
   data packet and the NTP time at which that data was sampled. 
 
3.2 RTP payload structure 
    
   The packet payload structure consists of 3 byte-aligned sections. 
    
   The first section is the MSLHSection and contains Mapped SL Packet 
   Headers (MSLH). The MSLH structure is described in 3.3. In the 
   Single-SL mode this section is empty by default. 
    
   The second section is the RSLHSection and contains Remaining SL 
   Headers (RSLH). The RSLH structure is described in 3.5. By default 
   this section is empty. 
    
   The last section (SLPPSection) contains the SL packet payloads. This 
   section is never empty. 
    
   The Nth MSLH in the MSLHSection, the Nth RSLH in the RSLHSection and 
   the Nth SL packet payload in the SLPPSection correspond to the Nth 
   SL packet transported by the RTP packet. 
    
    
   0                   1                   2                   3 
   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   |V=2|P|X|  CC   |M|     PT      |       sequence number         | 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   |                           timestamp                           | 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   |           synchronization source (SSRC) identifier            | 
  
Gentric et al.           Expires January 2002                        9 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   :            contributing source (CSRC) identifiers             : 
   +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ 
   |                                                               | 
   |                MSLHSection (byte aligned)                     | 
   |                                                               | 
   |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   |                               |                               | 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               | 
   |                                                               | 
   |                RSLHSection (byte aligned)                     | 
   |                                                               | 
   |               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   |               |                                               | 
   +-+-+-+-+-+-+-+-+                                               | 
   |                                                               | 
   |                SLPPSection (byte aligned)                     | 
   |                                                               | 
   |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   |                               :...OPTIONAL RTP padding        | 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    
    
   Figure 3: An RTP packet for MPEG-4 
 
3.3 MSLHSection structure 
    
   If the MSLHSection consumes a non-integer number of bytes, up to 7 
   zero-valued padding bits MUST be inserted at the end in order to 
   achieve byte-alignment. 
    
   In the Single-SL mode the MSLHSection consists of a single MSLH.  
         
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   | MSLH (x bits )  : padding bits| 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    
   Figure 4: MSLHSection structure in Single-SL mode 
    
   In the Multiple-SL mode this section consist of a 2 bytes field 
   giving the size in bits (in network byte order) of the following 
   block of bit-wise concatenated MSLHs.  
    
   This size field is absent in the Single-SL mode not because it is 
   not needed (which would be a minor gain) but for compatibility with 
   RFC 3016. 
    
   This size field is also absent when the value would always be zero 
   because the MSLH is always empty, which may happen when a constant 
   size in signaled using ConstantSize. 
    
    
   0                   1                   2                   3 
  
Gentric et al.           Expires January 2002                       10 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   | MSLH section size in bits     | MSLH        |         etc     |         
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                 | 
   | as many bit-wise concatenated MSLHs                           | 
   | as SL packets in this RTP packet                              | 
   |                                 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   |                                 : padding bits| 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    
   Figure 5: MSLHSection structure in Multiple-SL mode 
 
3.4 MSLH structure 
    
   The Mapped SL Packet Header content depends on parameters (as 
   described in section 4.1); by default it is empty for the Single-SL 
   mode and, except when ConstantSize is signaled, contains at least 
   the PayloadSize field in the Multiple-SL mode. 
    
   When all options are used the MSLH structure is given in figure 6. 
    
   +============================+ 
   |PayloadSize                 | 
   +----------------------------+ 
   |Index or IndexDelta         | 
   +----------------------------+ 
   |CTSFlag                     | 
   +----------------------------+ 
   |CTSDelta                    | 
   +----------------------------+ 
   |DTSFlag                     | 
   +----------------------------+ 
   |DTSDelta                    | 
   +============================+ 
    
   Figure 6: Mapped SL Packet Header (MSLH) structure 
    
   In the general case a receiver can only discover the size of a MSLH 
   by parsing it since for example the presence of CTSDelta is signaled 
   by the value of CTSFlag. 
 
3.4.1 Fields of MSLH 
    
   PayloadSize: Indicates the size in bytes of the associated SL Packet 
   Payload, which can be found in the SLPPSection of the RTP packet. 
   The length in bits of this field is signaled by the SizeLength 
   parameter (see section 4.1). 
    
   There is an exception to that: when the RTP packet contains a single 
   SL packet the PayloadSize field SHALL contain the size of the entire 
   corresponding Access Unit, for two reasons, firstly the size of the 
   fragment is not needed when there is only one fragment, secondly 

  
Gentric et al.           Expires January 2002                       11 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   this is useful in order to detect that a full Access Unit has been 
   received after the loss of a packet carrying M bit set to 1. 
    
   Index, IndexDelta: Encodes the packetSequenceNumber (serial number) 
   of the SL Packet. When making streams specifically for transport 
   with this payload format IndexDelta is useful for interleaving (see 
   section 3.8). Since a mapping of packetSequenceNumber to RTP 
   sequence number is not possible in the Multiple-SL mode there is no 
   requirement for a correspondence. 
    
   Index is optional and -if present- appears for the first SL packet 
   in a RTP packet. 
    
   The length in bits of the Index field is defined by the IndexLength 
   parameter (see section 4.1). 
    
   IndexDelta is optional and -if present- appears for subsequent (non-
   first) SL packets in a RTP packet. 
    
   The length in bits of the IndexDelta field is defined by the 
   IndexDeltaLength parameter (see section 4.1). 
    
   If the parameter IndexDeltaLength is defined, non-first SL packets 
   inside a RTP packet have their packetSequenceNumber encoded as a 
   difference (thus the name IndexDelta). This difference is relative 
   to the previous SL packet in the RTP packet according to (with 
   i>=0): 
   packetSequenceNumber(0) = Index(0)    
   packetSequenceNumber(i+1) = packetSequenceNumber(i) + 
   IndexDelta(i+1) + 1    
    
   If the parameter IndexDeltaLength is not defined the default value 
   is zero and then the IndexDelta field is not present for non-first 
   SL packets. Nevertheless receivers SHALL then apply the above 
   formula with IndexDelta equal to zero. In other words by default 
   packetSequenceNumber is incremented by 1 for each SL packet in one 
   RTP packet. 
 
   CTSFlag (1 bit): Indicates whether the CTSDelta field is present. A 
   value of 1 indicates that the CTSDelta field is present, a value of 
   0 that it is not present. 
    
   If CTSDeltaLength is not zero, CTSFlag is present in all MSLH 
   regardless of whether the SL packet is an Access Unit start or not. 
    
   CTSDelta (CTSDeltaLength bits): Specifies the value of the CTS as a 
   2-complement offset (delta) from the timestamp in the RTP header of 
   the RTP packet. The length in bits of each CTSDelta field is 
   specified by the CTSDeltaLength parameter (see section 4.1). 
    
   The CTSDelta field is present if CTSFlag is 1. 
    

Gentric et al.           Expires January 2002                       12 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   For the first MSLH of each RTP packet CTSFlag is always 0, since the 
   composition time stamp of the first SL packet in the RTP packet is 
   mapped to the RTP time stamp. In all cases the sender MUST remove 
   the compositionTimeStamp from the RSLH. 
 
   DTSFlag (1 bit): Indicates whether the DTSDelta field is present. A 
   value of 1 indicates that DTSDelta is present, a value of 0 that it 
   is not present. 
    
   If DTSDeltaLength is not zero, DTSFlag is present in all MSLH 
   regardless of whether the SL packet is an Access Unit start or not; 
   the receiver needs this flag in order to reconstruct the 
   decodingTimeStampFlag of SL Headers. 
    
   DTSDelta (DTSDeltaLength  bits): encodes (compositionTimeStamp - 
   decodingTimeStamp) for the same SL packet (always positive). The 
   length in bits of each DTSDelta field is specified by the 
   DTSDeltaLength parameter (see section 4.1). 
    
   The DTSDelta field appears when DTSFlag is 1. The sender MUST always 
   remove the decodingTimeStamp from the RSLH. 
    
   If DTSDelta is zero i.e. if decodingTimeStamp equals 
   compositionTimeStamp then DTSFlag MUST be set to 0 and no DTSDelta 
   field SHALL be present. 
 
3.4.2 Relationship between sizes of MSLH fields and parameters 
    
   The relationship between a Mapped SL Packet Header and the related 
   parameters is as follows: 
    
   +===========================+=================================+ 
   | Fields of MSLPH           | Number of bits (parameters)     | 
   +===========================+=================================+ 
   | PayloadSize               | SizeLength                      | 
   +---------------------------+---------------------------------+ 
   | Index                     | IndexLength                     | 
   +---------------------------+---------------------------------+ 
   | IndexDelta                | IndexDeltaLength                | 
   +---------------------------+---------------------------------+ 
   | CTSFlag                   | 1      If (CTSDeltaLength > 0)  | 
   +---------------------------+---------------------------------+ 
   | CTSDelta                  | CTSDeltaLength If (CTSFlag==1)  | 
   +---------------------------+---------------------------------+ 
   | DTSFlag                   | 1      If (DTSDeltaLength > 0)  | 
   +---------------------------+---------------------------------+ 
   | DTSDelta                  | DTSDeltaLength If (DTSFlag==1)  | 
   +---------------------------+---------------------------------+ 
    
   Table 1: Relationship between MSLH field size and parameters 
    
3.5 RSLHSection structure 
    
  
Gentric et al.           Expires January 2002                       13 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   This section consists of a field (RSLHSectionSize) giving the size 
   in bits of the following block of bit-wise concatenated RSLHs. 
         
   If the section consumes a non-integer number of bytes, up to 7 zero 
   padding bits MUST be inserted at the end in order to achieve byte-
   alignment. 
    
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   | RSLHSectionSize (RSLHSectionSizeLength bits)| RSLH (variable  | 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                 |              
   | number of bits)                                               | 
   |                                                               | 
   |         +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   |         | RSLH (variable number of bits)                      | 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   | etc                                                           | 
   | as many bit-wise concatenated RSLHs                           | 
   | as SL Packets in this RTP packet                              | 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   | RSLH (variable number of bits)                                | 
   |                                                 +-+-+-+-+-+-+-+ 
   |                                                 : padding bits|   
   |-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
                                     
   Figure 7: RSLHSection structure 
 
   The length in bits of the RSLHSectionSize field is 
   RSLHSectionSizeLength and is specified with a default value of zero 
   indicating that the whole RSLHSection is absent. Compatibility with 
   RFC 3016 requires that the RSLHSection should be empty, including 
   the RSLHSectionSize field. This is the reason why there is such a 
   variable length with a default value indicating absence of the 
   RSLHSectionSize field. 
    
   +=================================+===============================+ 
   | Fields of RSLHSection           |         Number of bits        | 
   +=================================+===============================+ 
   | RSLHSectionSize                 |       RSLHSectionSizeLength   | 
   +---------------------------------+-------------------------------+ 
   | all bit-wise concatenated RSLHs |       RSLHSectionSize         | 
   +---------------------------------+-------------------------------+ 
    
   Table 2: Sizes in bits inside RSLHSection 
 
   Parsing of the bit-wise concatenated RSLHs requires MPEG-4 system 
   awareness, specifically it requires to understand the MPEG-4 
   Synchronization Layer (SL) syntax and the modifications to this 
   syntax described in the next section. 
    
   However thanks to the RSLHSectionSize field non-MPEG-4-system 
   receivers MAY skip this part by rounding up RSLPHSize/8 to the next 
   integer number of bytes.  
       
  
Gentric et al.           Expires January 2002                       14 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
3.6 RSLH structure 
    
   A Remaining SL Packet Header (RSLH) is what remains of an SL header 
   after modifications for mapping into this payload format.  
    
   The following modifications of the SL packet header MUST be applied. 
   The other fields of the SL packet header MUST remain unchanged but 
   are bit-shifted to fill in the gaps left by the operations specified 
   below. 
    
3.6.1 Removal of fields 
    
   The following SL Packet Header fields -if present- are removed since 
   they are mapped either in the RTP header or in the corresponding 
   MSLH: 
   . compositionTimeStampFlag 
   . compositionTimeStamp 
   . decodingTimeStampFlag 
   . decodingTimeStamp 
   . packetSequenceNumber 
   . AccessUnitEndFlag (in Single-SL mode only) 
    
   The AccessUnitEndFlag, when present for a given stream, MUST be 
   removed from every RSLH when using the Single-SL mode since it has 
   the same meaning as the Marker bit (and for compatibility with RFC 
   3016). However when using the Multiple-SL mode, AccessUnitEndFlag 
   MUST NOT be removed since it is useful to signal individual AU ends. 
 
3.6.2 Mapping of OCR 
    
   Furthermore if the SL Packet header contains an OCR, then this field 
   is encoded in the RSLH as a 2-complement difference (delta) exactly 
   like a compositionTimeStamp or a decodingTimeStamp in the MSLH. The 
   length in bit of this difference is indicated by the OCRDeltaLength 
   parameter (see section 4.1). 
    
   With this payload format OCRs MUST have the same clock resolution as 
   Time Stamps. 
    
   If compositionTimeStamp is not present for a SL packet that has OCR 
   then the OCR SHALL be encoded as a difference to the RTP time stamp. 
    
3.6.3 Degradation Priority 
    
   For streams that use the optional degradationPriority field in the 
   SL Packet Headers, only SL packets with the same degradation 
   priority SHALL be transported by one RTP packet so that components 
   may dispatch the RTP packets according to appropriate QOS or 
   protection schemes. Furthermore only the first RSLH of one RTP 
   packet SHALL contain the degradationPriority field since it would be 
   otherwise redundant.  
 
3.7 SLPPSection structure 
  
Gentric et al.           Expires January 2002                       15 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   The SLPPSection (SL Packet Payload Section) contains the 
   concatenated SL Packet Payloads. By definition SL Packet Payloads 
   are byte aligned. 
    
   For efficiency SL packets do not carry their own payload size. This 
   is not an issue for RTP packets that contain a single SL Packet. 
    
   However in the Multiple-SL mode the size of each SL packet payload 
   MUST be available to the receiver. 
    
   If the SL packet payload size is constant for a stream, the size 
   information SHOULD NOT be transported in the RTP packet. However in 
   that case it MUST be signaled using the ConstantSize parameter (see 
   section 4.1). 
    
   If the SL packet payload size is variable then the size of each SL 
   packet payload MUST be indicated in the corresponding MSLH. In order 
   to do so the MSLH MUST contain a PayloadSize field. The number of 
   bits on which this PayloadSize field is encoded MUST be indicated 
   using the SizeLength parameter (see section 4.1). 
    
   The absence of either ConstantSize or SizeLength indicates the 
   Single-SL mode i.e. that a single SL packet is transported in each 
   RTP packet for that stream. 
    
    
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   | SLPP (variable number of bytes)                           | 
   |                                                           | 
   |                                                           | 
   |                         +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   |                         | SLPP (variable number of bytes) | 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+                                 | 
   |                                                           | 
   |                                                           | 
   |                                                           | 
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   | etc                                                       | 
   | as many byte-wise concatenated SLPPs                      | 
   | as SL Packets in this RTP packet                          | 
   |-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
    
   Figure 8: SLPPSection structure 
    
3.8 Interleaving 
    
   SL Packets MAY be interleaved. Senders MAY perform interleaving. 
   Receivers MUST support interleaving. 
    
   When interleaving of SL packets is used it SHALL be implemented 
   using the Index and IndexDelta fields of MSLH. 
    
  
Gentric et al.           Expires January 2002                       16 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   The conjunction of RTP sequence number and Index, IndexDelta can 
   produce a quasi-unique identifier for each SL packet so that a 
   receiver can unambiguously reconstruct the original order even in 
   case of out-of-order packets, packet loss or duplication. 
    
   However implementors of receivers must take care that when 
   IndexLength is small, Index will rollover often; for that reason 
   timestamps SHOULD be used as a basis for implementation of de-
   interleaving, i.e. the reordering algorithm should consider 
   timestamps and IndexDelta first and use Index only when CTS are not 
   available. Symmetrically senders MUST either use properly large 
   values for IndexLength or use small values only when CTS are either 
   present in MSLH or can be otherwise unambiguously computed for each 
   SL packet (for example audio streams as in Appendix.5).  
    
   The AUSequenceNumber field of the SL header MUST NOT be used for 
   interleaving since firstly it may collide with the Scene Description 
   Carousel usage described in section 4.1 and secondly it is not 
   visible to non-MPEG-4 system receivers. 
    
3.9 Fragmentation Rules 
    
   This section specifies rules for senders in order to prevent media 
   decoding difficulties at the receiver end. 
    
   MPEG-4 Access Units are the default fragments for MPEG-4 bitstreams 
   and SHOULD be mapped directly into RTP packets of this format with 
   two exceptions: 
   - Access Units larger than the MTU 
   - When using interleaving for better packet loss resilience. 
    
   In all cases Access Unit start MUST be aligned with SL packet start. 
    
   This section gives rules to apply when performing Access Unit 
   fragmentation. 
    
   Some MPEG-4 codecs define optional syntax for Access Units sub-
   entities (fragments) that are independently decodable for error 
   resilience purposes. Examples are Video Packets for video and Error 
   Sensitivity Categories (ESC) for audio. This always corresponds to 
   specific bitstream syntax, which is signaled in the 
   DecoderSpecificInfo inside the DecoderConfig in SLConfig, and/or 
   using the corresponding parameters as described in section 4.1. 
   Therefore encoders and decoders are both aware whether they are 
   operating in such a mode or not (however since this codec 
   configuration is an opaque data block this is not explicitly 
   signaled by this payload format). 
    
   If not operating in such a mode it is obvious that the decoder has 
   to skip packets after a loss until an Access Unit start is received. 
   Similarly decoder implementations that do not implement robust 
   decoding of Access Units fragments have to discard all packets after 
   a packet loss until an Access Unit start is received. In the same 
  
Gentric et al.           Expires January 2002                       17 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   way decoder implementations that do not implement re-synchronization 
   at any Access Units start have to discard all packets after a packet 
   loss until a Random Access Point Access Unit is received. These are 
   all obvious things that a good implementation would do. 
    
   However serious problems would arise for decoder implementations 
   that try to restart decoding after a packet loss if independently 
   decodable fragments are signaled (in the decoder configuration) but 
   the fragments actually received are not independently decodable 
   because the RTP sender has made RTP packets on different boundaries 
   than the fragments provided by the encoder (so this issue applies to 
   the interface between the encoder and the RTP sender and to the RTP 
   sender component itself), because the decoder has in general no way 
   to detect such a faulty fragment. 
    
   For this reason the following rules must apply to SL streams that 
   are specifically made for transport with this payload format: 
    
   SL packets SHOULD be codec-semantic entities in the spirit of ALF 
   i.e. either complete Access Units or fragments of Access Units that 
   are independently decodable. Specifically when a given codec has an 
   independently decodable Access Unit fragments optional syntax this 
   option SHOULD be used. 
    
   Furthermore when streams are generated using independently decodable 
   Access Units fragments these Access Units fragments MUST be mapped 
   one-to-one into SL packets. Consequently independently decodable 
   Access Units fragments MUST NOT be split across several SL packets 
   and therefore MUST NOT be split across several RTP packets. 
    
   For example an MPEG-4 audio stream encoded using the ESC syntax MUST 
   NOT split one ESC across 2 RTP packets. 
    
   This rule is relaxed when using MPEG-4 Video Packets for two 
   reasons: firstly Video Packets can be much larger than typical MTU 
   and secondly all Video Packets start with a specific 
   resynchronization marker that can be unambiguously detected. 
   Therefore for video streams using the Video Packet syntax Video 
   Packets MAY be split across several SL packets although it is 
   strongly RECOMMENDED to always adapt the Video Packet size to fit 
   the MTU. A Video Packet start MUST always be aligned with a SL 
   packet start, except when a GOV is present, in which case the GOV 
   and the first Video Packet of the following VOP MUST be included in 
   the same SL packet. 
 
4. Types and Names 
    
   This section describes the MIME types and names associated with this 
   payload format. Section 4.1 is intended for registration with IANA 
   as in RFC 2048. 
    
   This format may require additional information about the mapping to 
   be made available to the receiver. This is done using parameters 
  
Gentric et al.           Expires January 2002                       18 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   described in the next section. The absence of any of these fields is 
   equivalent to a field set to the default value, which is always 
   zero. The absence of any such parameters resolves into a default 
   "basic" configuration compatible with RFC3016 for MPEG-4 video. 
    
   In the MPEG-4 framework the SL stream configuration information is 
   carried using the Object Descriptor. For compatibility with 
   receivers that do not implement the full MPEG-4 system specification 
   this information MAY also be signaled using parameters described 
   here. When such information is present both in an Object Descriptor 
   and as a parameter of this payload format it MUST be exactly the 
   same. 
    
   For transport of MPEG-4 audio and video without the use of MPEG-4 
   systems, as well as to support non-MPEG-4 system receivers, it is 
   also possible to transport information on the profile and level of 
   the stream and on the decoder configuration. This is also described 
   in the next section. 
    
   Finally this MIME type also defines a mode parameter and a profile 
   parameter that are intended for future derivations of this payload 
   format. 
    
4.1 MIME type registration 
 
   MIME media type name:  "video" or "audio" or "application" 
    
   "video" SHOULD be used for MPEG-4 Visual streams (i.e. video as 
   defined in ISO/IEC 14496-2 [2] and/or graphics as defined in ISO/IEC 
   14496-1 [1]) or MPEG-4 Systems streams that convey information 
   needed for an audio/visual presentation. 
    
   "audio" SHOULD be used for MPEG-4 Audio streams (ISO/IEC 14496-3) or 
   MPEG-4 Systems streams that convey information needed for an audio 
   only presentation. 
    
   "application" SHOULD be used for MPEG-4 Systems streams 
   (ISO/IEC14496-1) that serve other purposes than audio/visual 
   presentation, e.g. in some cases when MPEG-J streams are 
   transmitted. 
    
   MIME subtype name: mpeg4-generic 
    
   Required parameters: none 
    
   Optional parameters: 
    
   Mode: 
   The mode in which this specification is used. This specification 
   itself defines only the default mode (Mode=default). When the mode 
   parameter is not present the default mode SHALL be assumed. In the 
   default mode all parameters are optional and as defined here. Other 
   modes may be defined as needed in other RFCs. A mode MUST be a 
  
Gentric et al.           Expires January 2002                       19 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   subset of this specification. Specifically when defining a mode care 
   MUST be taken that an implementation of this specification can 
   decode the payload format corresponding to this new mode. For this 
   reason a mode MUST NOT specify new default values for MIME 
   parameters and MIME parameters MUST be present (unless they have the 
   default value) even if it is redundant in case the mode assigns 
   fixed values. A mode may define additionally that some MIME 
   parameters are required instead of optional, that some MIME 
   parameters have fixed values (or ranges), and that there are rules 
   restricting the usage (for example forbidding the carriage of 
   multiple AU fragments in the same RTP packet). 
    
   Profile: 
   The meaning of this parameter may be defined by a mode. This is 
   meant to be used in order to define sub-configurations of a given 
   mode, for example the maximum delay (and therefore the size of 
   buffers) induced by the usage of interleaving. Implementations of 
   this specification can ignore this parameter. 
    
   DTSDeltaLength: 
   The number of bits on which the DTSDelta field is encoded in MSLH. 
   The default value is zero and indicates the absence of DTSFlag and 
   DTSDelta in MSLH (the stream does not transport decodingTimeStamps). 
   A value larger than zero indicates that there is a DTSFlag in each 
   MSLH. Since decodingTimeStamp -if present- must be encoded as a 
   difference to the RTP time stamp, the DTSDeltaLength parameter MUST 
   be present in order to transport decodingTimeStamps with this 
   payload format. 
         
   CTSDeltaLength: 
   The number of bits on which the CTSDelta field is encoded in (non-
   first) MSLH. The default value is zero and indicates the absence of 
   the CTSFlag and CTSDelta fields in MSLH. Non-zero values MUST NOT be 
   signaled in the Single-SL mode. Since compositionTimeStamps �if 
   present- must be encoded as a difference to the RTP time stamp, the 
   CTSDeltaLength parameter MUST be present in order to transport 
   compositionTimeStamps using this payload format (in the Multiple-SL 
   mode). However CTSDeltaLength SHOULD be set to zero (or not 
   signaled) for streams that have a constant Access Unit duration 
   (which can be explicitly signaled using the DurationFlag and 
   AccessUnitDuration field of SLConfigDescriptor). 
         
   OCRDeltaLength: 
   The number of bits on which the OCRDelta field is encoded in RSLH. 
   The default value is zero and indicates the absence of OCR for this 
   stream. Since objectClockReference -if present- must be encoded as a 
   difference to the RTP time stamp, the OCRDeltaLength parameter MUST 
   be present in order to transport objectClockReferences with this 
   payload format. 
    
   SizeLength: 
   The number of bits on which the PayloadSize field of MSLH is 
   encoded. The default value is zero and indicates the Single-SL mode 
  
Gentric et al.           Expires January 2002                       20 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   (unless ConstantSize is present). Simultaneous presence of this 
   parameter and ConstantSize is illegal. Either the SizeLength or 
   ConstantSize parameter MUST be present in order to signal the 
   Multiple-SL mode of this payload format. 
    
   ConstantSize: 
   The constant size in bytes of each SL Packet Payload for this 
   stream. The default value is zero and indicates variable SL Packet 
   Payload size (or the Single-SL mode if SizeLength is absent). 
   Simultaneous presence of this parameter and SizeLength is illegal. 
   Either the SizeLength or ConstantSize parameter MUST be present in 
   order to signal the Multiple-SL mode of this payload format. When 
   ConstantSize is present the PayloadSize of MSLH in the RTP packets 
   MUST NOT be present. 
    
   IndexLength: 
   The number of bits on which the Index is encoded in the first MSLH. 
   The default value is zero and indicates the absence of Index and 
   IndexDelta for all MSLHs. Since packetSequenceNumber -if present- 
   must be mapped in MSLH, the IndexLength parameter MUST be present in 
   order to transport packetSequenceNumber with this payload format. 
 
   IndexDeltaLength: 
   The number of bits on which the IndexDelta are encoded in any non-
   first MSLH. The default value is zero and indicates that 
   packetSequenceNumber MUST be incremented by one for each SL packet 
   in the RTP packet (see section 3.5). IndexDeltaLength parameter MUST 
   be present when using interleaving with this payload format. 
    
   RSLHSectionSizeLength: 
   The number of bits that is used to encode the RSLHSectionSize field. 
   The default value is zero and indicates the absence of the whole 
   RSLHSection for all RTP packets of this stream.  
    
   SLConfigDescriptor: 
   A base-64 encoding of the SLConfigDescriptor. This SHALL be the 
   original SLConfigDescriptor and it SHALL be the same as the one 
   transported by the OD framework, if any. 
    
   profile-level-id: 
   A decimal representation of the MPEG-4 Profile Level indication 
   value. For audio this parameter indicates which MPEG-4 Audio tool 
   subsets are applied to encode the audio stream and is defined in 
   ISO/IEC 14496-1 [1]. For video this parameter indicates which MPEG-4 
   Visual tool subsets are applied to encode the video stream and is 
   defined in Table G-1 of ISO/IEC 14496-2 [2]. This parameter MAY be 
   used in the capability exchange or session setup procedure to 
   indicate MPEG-4 Profile and Level combination of which the relevant 
   MPEG-4 media codec is capable. If this parameter is not specified 
   its default value is 1 (Simple Profile/Level 1) for video (for 
   compatibility with RFC 3016) and otherwise 0xFE (defined in ISO/IEC 
   14496-1 [1] as being the generic default value). 
    
  
Gentric et al.           Expires January 2002                       21 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   Config: 
   A hexadecimal representation of an octet string that expresses the 
   media payload configuration. Configuration data is mapped onto the 
   octet string in an MSB-first basis. The first bit of the 
   configuration data SHALL be located at the MSB of the first octet. 
   In the last octet, zero-valued padding bits, if necessary, shall 
   follow the configuration data. For audio streams, config is the 
   audio object type specific decoder configuration data 
   AudioSpecificConfig() as defined in ISO/IEC 14496-3 [3]. For video 
   this expresses the MPEG-4 Visual configuration information, as 
   defined in subclause 6.2.1 Start codes of ISO/IEC14496-2 [2] and the 
   configuration information indicated by this parameter SHALL be the 
   same as the configuration information in the corresponding MPEG-4 
   Visual stream, except for first-half-vbv-occupancy and latter-half-
   vbv-occupancy, if it exists, which may vary in the repeated 
   configuration information inside an MPEG-4 Visual stream (See 6.2.1 
   Start codes of ISO/IEC14496-2). 
    
   StreamType: 
   The integer value that indicates the type of MPEG-4 stream that is 
   carried; its coding corresponds to the values of the streamType as 
   defined for the DecoderConfigDescriptor in ISO/IEC 14496-1. 
    
   Encoding considerations: 
   System bitstreams MUST be generated according to MPEG-4 System 
   specifications (ISO/IEC 14496-1). Video bitstreams MUST be generated 
   according to MPEG-4 Visual specifications (ISO/IEC 14496-2). Audio 
   bitstreams MUST be generated according to MPEG-4 Visual 
   specifications (ISO/IEC 14496-3). All SL streams MUST be generated 
   according to MPEG-4 Sync Layer specifications (ISO/IEC 14496-1 
   section 10), in order to read this format the SLConfigDescriptor may 
   be required. These bitstream are binary data and MUST be encoded for 
   non-binary transport (for Email, the Base64 encoding is sufficient).  
   This type is also defined for transfer via RTP.  The RTP packets 
   MUST be packetized according to the RTP payload format defined in 
   RFC <self-reference-to-this>. 
    
   Security considerations: 
   As in RFC <self-reference-to-this>. 
    
   Interoperability considerations: 
   MPEG-4 provides a large and rich set of tools for the coding of 
   visual objects.  For effective implementation of the standard, 
   subsets of the MPEG-4 tool sets have been provided for use in 
   specific applications. These subsets, called 'Profiles', limit the 
   size of the tool set a decoder is required to implement. In order to 
   restrict computational complexity, one or more 'Levels' are set for 
   each Profile. A Profile@Level combination allows: 
   . a codec builder to implement only the subset of the standard he 
   needs, while maintaining interoperability with other MPEG-4 devices 
   included in the same combination, and 
   . checking whether MPEG-4 devices comply with the standard 
   ('conformance testing'). 
  
Gentric et al.           Expires January 2002                       22 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   A stream SHALL be compliant with the MPEG-4 Profile@Level specified 
   by the parameter "profile-level-id". Interoperability between a 
   sender and a receiver may be achieved by specifying the parameter 
   "profile-level-id" in MIME content, or by arranging in the 
   capability exchange/announcement procedure to set this parameter 
   mutually to the same value. 
    
   Published specification: 
   The specifications for MPEG-4 streams are presented in ISO/IEC 
   14469-1, 14469-2, and 14469-3.  The RTP payload format is described 
   in RFC <self-reference-to-this>. 
    
   Applications that use this media type: 
   Multimedia streaming and conferencing tools, Internet messaging and 
   Email applications. Also trans-galactic supra-relativistic 
   elementary particle hyperspace tunneling communication devices :-) 
    
   Additional information: none 
    
   Magic number(s): none 
    
   File extension(s): 
   None. A file format with the extension .mp4 has been defined for 
   MPEG-4 content but is not directly correlated with this MIME type 
   which sole purpose is RTP transport. 
    
   Macintosh File Type Code(s): none 
    
   Person & email address to contact for further information: 
   Authors of RFC <self-reference-to-this>. 
    
   Intended usage: COMMON 
    
   Author/Change controller: 
   Authors of RFC <self-reference-to-this>. 
    
4.2 Concatenation of parameters 
    
   Multiple parameters SHOULD be expressed as a MIME media type string, 
   in the form of a semicolon-separated list of parameter=value pairs 
   (see examples in Appendix). 
    
4.3 Usage of SDP 
    
4.3.1 The a=fmtp keyword 
    
   It is assumed that one typical way to transport the above-described 
   parameters associated with this payload format is via an SDP [10] 
   message for example transported to the client in reply to a RTSP 
   [13] DESCRIBE message or via SAP [14]. In that case the (a=fmtp) 
   keyword MUST be used as described in RFC 2327 [10, section 6]. The 
   syntax being then: 
    
  
Gentric et al.           Expires January 2002                       23 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   a=fmtp:<format> <parameter name>=<value> 
    
4.3.2 SDP example 
    
   The following is an example of SDP syntax for the description of a 
   session containing one MPEG-4 audio stream, one MPEG-4 video and 
   three MPEG-4 system streams, the first one being BIFS, the second 
   one OD and the third one IPMP. All are transported using this format 
   and the AVP profile [12]. Note that the video stream DTSDelta are 
   encoded on 4 bits in this example. See the Appendix for more 
   examples. 
    
   o= .... 
   I= .... 
   c=IN IP4 123.234.71.112 
    
   m=video 1034 RTP/AVP 97 
   a=fmtp:97 StreamType=4;DTSDeltaLength=4 
   a=rtpmap:97 mpeg4-sl 
    
   m=audio 810  RTP/AVP 98 
   a=fmtp:98 StreamType=5; profile-level-id=1; config=7866E7E6EF 
   a=rtpmap:98 mpeg4-sl 
    
   m=application 1234  RTP/AVP 99 
   a=rtpmap:99 mpeg4-sl 
   a=fmtp:99 StreamType=3;  
    
   m=application 1236  RTP/AVP 99 
   a=rtpmap:99 mpeg4-sl 
   a=fmtp:99 StreamType=1;  
    
   m=application 1238  RTP/AVP 99 
   a=rtpmap:99 mpeg4-sl 
   a=fmtp:99 StreamType=7;  
 
5. Other issues 
 
5.1 SL packetized stream reconstruction 
    
   The purpose of this section is to document how a receiver can 
   reconstruct a valid SL packetized stream. Since this format directly 
   transports SL packets this reconstruction is performed by reversing 
   the payload structure rules (section 3). We explicitly describe here 
   the most complex transformations.  
    
   In the following let (i) be the index of SL packets inside one RTP 
   packet (starting at zero for each RTP packet), let SLPacketHeader.x 
   denote field x of the reconstructed SL packet header, let MSLH.x 
   denote field x of the received MSLH, etc. 
    
   SLPacketHeader.packetSequenceNumber is restored from MSLH.Index and 
   MSLH.IndexDelta using: 
  
Gentric et al.           Expires January 2002                       24 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   If ( IndexLength == 0) { // or is absent 
      if ( SLConfig.packetSeqNumLength == 0 ) { 
          // this stream does not have SL packet sequence number 
      } 
      else { 
          // illegal, normally the sender MUST map 
          // SLPacketHeader.packetSequenceNumber in MSLH  
          // and set a relevant IndexLength value; 
          // otherwise it is unfortunately impossible for the receiver 
          // to reconstruct the correct sequence 
      }  
   } 
   else { // IndexLength is not zero 
      if ( SLConfig.packetSeqNumLength == 0 ) { 
          // the original SL stream does not have SL packet  
          // sequence numbers, typically the sender inserted them 
          // in order to implement interleaving at the RTP level; 
          // they must be ignored for SL stream reconstruction 
      } 
      else { 
         if (i == 0){ // first SL packet in RTP packet 
           SLPacketHeader.packetSequenceNumber(0) = MSLH.Index(0); 
         } 
         else { // remaining SL packets 
           SLPacketHeader.packetSequenceNumber(i+1)= 
              SLPacketHeader.packetSequenceNumber(i) 
              + MSLH.IndexDelta(i+1) 
              +1; 
         } 
   } 
    
   All time stamps (CTS, DTS, OCR), when present, are restored from the 
   delta values. Time stamps flags (CTSFlag, DTSFlag) in MSLH are used 
   to reconstruct respectively the compositionTimeStampFlag and 
   decodingTimeStampFlag of SLPacketHeader. 
    
   if ( CTSDeltaLength == 0) { // or CTSDeltaLength is absent 
      // CTS is not transported for this RTP stream 
      if (i == 0){ // first SL packet in RTP packet 
         if ( SLConfig.useTimeStamps == 1 ) { 
            if ( SLPacketHeader.accessUnitStartFlag == 1 ) { 
               SLPacketHeader.compositionTimeStampFlag(0) = 1; 
               SLPacketHeader.compositionTimeStamp(0) = RTP TimeStamp; 
            } 
            else { 
               // ignore 
            } 
         } 
         else { 
             // empty 
         } 
      } 
  
Gentric et al.           Expires January 2002                       25 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
      else { // non-first SL packets in RTP packet 
         if ( SLConfig.useTimeStamps == 1 ) { 
             if ( SLPacketHeader.accessUnitStartFlag == 1 ) { 
                SLPacketHeader.compositionTimeStampFlag(i) = 0; 
             } 
             else { 
                // ignore 
             } 
         } 
         else { 
             // empty 
         } 
      } 
   } 
   else { // CTSDeltaLength is not zero 
      // CTS is transported for this stream 
      if ( SLConfig.useTimeStamps == 1 ) { 
         if ( SLPacketHeader.accessUnitStartFlag == 1 ) { 
             SLPacketHeader.compositionTimeStampFlag(i) = 
                      MSLH.CTSFlag(i); 
             SLPacketHeader.compositionTimeStamp(i) =  
                    RTP TimeStamp + MSLH.CTSDelta(i); 
         } 
         else { 
            // ignore CTSFlag (which must be zero) 
         } 
      else { 
         // this is strange and sub-optimal at best 
         // a receiver should ignore this 
      } 
   } 
    
   if ( DTSDeltaLength == 0) { // or DTSDeltaLength is absent 
      // DTS is not transported for this stream 
      if ( SLConfig.useTimeStamps == 1 ) { 
         if ( SLPacketHeader.accessUnitStartFlag == 1 ) { 
             SLPacketHeader.decodingTimeStampFlag(i) = 0; 
         } 
         else { 
             // ignore 
         } 
      } 
      else { 
          // empty 
      } 
   } 
   else { 
      // DTS is transported for this stream 
      if ( SLConfig.useTimeStamps == 1 ) { 
         if ( SLPacketHeader.accessUnitStartFlag == 1 ) { 
              SLPacketHeader.decodingTimeStampFlag(i) = 
                  MSLH.DTSFlag(i); 
              SLPacketHeader.decodingTimeStamp(i) =  
  
Gentric et al.           Expires January 2002                       26 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
                  RTP TimeStamp + MSLH.DTSDelta(i); 
         } 
         else { 
             // ignore DTSFlag (which must be zero) 
         } 
      } 
      else { 
         // this is strange and sub-optimal at best 
         // a receiver should ignore this 
      } 
   } 
    
   if ( OCRDeltaLength == 0) { // or OCRDeltaLength is absent 
      // the RTP stream does not transport any OCR 
      if ( SLConfig.OCRLenght == 0 ) { 
          // this stream does not have any OCR 
      } 
      else { 
          // illegal, normally the sender MUST detect 
          // OCRs, replace them with OCRDelta and set 
          // a relevant OCRDeltaLength value 
      } 
   } 
   else { 
      if ( SLConfig.OCRLenght == 0 ) { 
         // this is strange and sub-optimal at best 
         // a receiver should ignore this 
      } 
      else { 
          SLPacketHeader.OCRflag(i) = RSLH.OCRFlag(i); 
          if ( SLPacketHeader.OCRflag(i) == 1) { 
               SLPacketHeader.objectClockReference(i) =  
                    RTP TimeStamp + RSLH.OCRDelta(i); 
          } 
      } 
   } 
    
    
   In the SingleSL mode the AccessUnitEndFlag, if needed, is restored 
   from the M bit, as follows: 
    
   if ( SLConfig.useAccessUnitEndFlag == 0 ) { 
       // this SL stream does not signal access unit ends 
   else { 
       SLPacketHeader.AccessUnitEndFlag = M bit; 
   } 
    
   In the multipleSL mode the AccessUnitEndFlag is untouched in RSLH. 
    
   The other SL packet header fields SHALL remain as found in RSLH. 
    
   It is obvious that in the general case the reconstruction of the 
   original SL packetized stream requires SL-awareness. However this 
  
Gentric et al.           Expires January 2002                       27 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   payload format allows in all cases a receiver that does not know 
   about the SL syntax to reconstruct the semantic of SL for the 
   following very useful features: 
   - Packet order (decoding order) 
   - Access Unit boundaries (using the M bit) 
   - Access Unit fragments (i.e. SL packet boundaries using 
   MSLH.PayloadSize) 
   - Composition Time Stamps (using the RTP Time Stamp and 
   MSLH.CTSDelta) 
   - Decoding Time Stamps (using the RTP Time Stamp and MSLH.DTSDelta) 
   - Packet sequence number (using the RTP Time Sequence number and 
   MSLH.Index) 
 
5.2 Handling of scene description streams 
    
   MPEG-4 introduces new stream types as described in section 1 namely 
   Object Descriptors and BIFS. In the following both OD and BIFS are 
   discussed on the same basis i.e. as "scene description". 
    
   Considering scene description as a "stream-able" type of content is 
   a rather new concept and for that reasons some specific comments are 
   needed. 
    
   Typically scene descriptions are encoded in such a way that 
   information loss would in the general case cripple the presentation 
   beyond any hope of repair by the receiver. Still this is well suited 
   for a number of multimedia applications were the scene is first made 
   available via reliable channels to the client and then played. This 
   payload format is not intended for this type of applications for 
   which download of MPEG-4 interchange (.mp4) files is typical. 
   However it can also be used if the RTP packets are transported using 
   TCP or any other reliable protocol. 
    
   On the other hand MPEG-4 has introduced the possibility to 
   dynamically change the scene description by sending animation 
   information (changes in parameters) and structural change 
   information (updates). Since this information has to be sent in a 
   timely fashion MPEG-4 has defined a number of techniques in order to 
   encode the scene description in a manner that makes it behave 
   similarly to other temporal encoding schemes such as audio and 
   video. This payload format is intended for this usage. 
    
   Note that in many cases the application will consist of first the 
   reliable transmission of a static initial scene followed by the 
   streaming of animations and updates. For this reason the usage of 
   this payload format is attractive since it offers a unique solution. 
    
   Senders must be aware that suitable schemes should be used when 
   scene description streams transport sensitive configuration 
   information. For example in case the RTP packet transporting an OD-
   update command would be lost, the corresponding media stream would 
   not be accessible by the receiver.  
 
  
Gentric et al.           Expires January 2002                       28 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   Redundancy is a possibility and may either be added by tools 
   hierarchically higher than this payload format, e.g. by packet based 
   FEC, re-transmission, or similar tools. In such a case, the general 
   congestion control principles have to be observed. 
    
   Since BIFS and OD streams may be modified during the session with 
   update commands, there is a need to send both update commands and 
   full BIFS/OD refresh. For that reason MPEG-4 defines Random Access 
   Points (RAP) for scene description streams (OD and BIFS) where by 
   definition a decoder can restart decoding i.e. receives a "full 
   update" of the scene. This mechanism is called Scene and Object 
   Description Carrousel. The AU Sequence Number field of SL Packet 
   Header is used to support this behavior at the Synchronization 
   Layer. When two access units are sent consecutively with the same AU 
   Sequence Number, the second one is assumed to be a semantic 
   repetition of the first. If a receiver starts to listen in the 
   middle of a session or has detected losses, it can skip all received 
   Access Units until such a RAP. The periodicity of transmission of 
   these RAPs should be chosen/adjusted depending on the application 
   and the network it is deployed on; i.e. exactly like Intra-coded 
   frames for video, it is the responsibility of the sender to make 
   sure the periodicity of RAPs is suitable. 
 
5.3 Multiplexing 
    
   An advanced MPEG-4 session may involve a large number of objects 
   that may be as many as a few hundred, transporting each ES as an 
   individual RTP stream may not always be practical. Allocating and 
   controlling hundreds of destination addresses for each MPEG-4 
   session may pose insurmountable session administration problems. 
   The input/output processing overhead at the end-points will be 
   extremely high also. Additionally, low delay transmission of low 
   bitrate data streams, e.g. facial animation parameters, results in 
   extremely high header overheads. 
    
   To solve these problems, MPEG-4 data transport requires a 
   multiplexing scheme that allows selective bundling of several ESs. 
   This is beyond the scope of the payload format defined here. 
    
   The MPEG-4's Flexmux multiplexing scheme may be used for this 
   purpose and a specific RTP payload format is being developed [11].  
    
   Another approach may be to develop a generic RTP multiplexing scheme 
   usable for MPEG-4 data. The multiplexing scheme reported in [8] may 
   be a candidate for this approach. 
    
   For MPEG-4 applications, the multiplexing technique needs to address 
   the following requirements: 
    
   i. The ESs multiplexed in one stream can change frequently during a 
   session. Consequently, the coding type, individual packet size and 
   temporal relationships between the multiplexed data units must be 
   handled dynamically. 
  
Gentric et al.           Expires January 2002                       29 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   ii. The multiplexing scheme should have a mechanism to determine the 
   ES identifier (ES_ID) for each of the multiplexed packets. ES_ID is 
   not a part of the SL header. 
    
   iii. In general, an SL packet does not contain information about its 
   size. The multiplexing scheme should be able to delineate the 
   multiplexed packets whose lengths may vary from a few bytes to close 
   to the path-MTU. 
    
5.5 Overlap with RFC 3016 
    
   This payload format has been designed to have a (large) overlap with 
   RFC 3016 [7]. The conditions for this overlap are: 
   Conditions for RFC 3016: 
   i. MPEG-4 video elementary streams only 
   ii. There MUST be a single VOP or Video Packet per RTP packet (only 
   recommended in RFC 3016) 
   iii. The decoder configuration MUST be signaled out-of-band either 
   using the Config mime parameter or using the OD framework 
   Conditions for this payload format: 
   i. No structural parameters defined (or all set to zero), i.e. 
   Single-SL mode with empty MSLH and empty RSLH.  
   ii. Receivers MUST be ready to accept (and ignore) video 
   configuration headers (e.g. VOSH, VO and VOL) and visual-object-
   sequence-end-code transported in-band. 
    
6. Security Considerations 
    
   RTP packets using the payload format defined in this specification 
   are subject to the security considerations discussed in the RTP 
   specification [5]. This implies that confidentiality of the media 
   streams is achieved by encryption. Because the data compression used 
   with this payload format is applied end-to-end, encryption may be 
   performed on the compressed data so there is no conflict between the 
   two operations. The packet processing complexity of this payload 
   type (i.e. excluding media data processing) does not exhibit any 
   significant non-uniformity in the receiver side to cause a denial-
   of-service threat. 
    
   However, it is possible to inject non-compliant MPEG streams (Audio, 
   Video, and Systems) to overload the receiver/decoder's buffers which 
   might compromise the functionality of the receiver or even crash it. 
   This is especially true for end-to-end systems like MPEG where the 
   buffer models are precisely defined. 
    
   MPEG-4 Systems supports stream types including commands that are 
   executed on the terminal like OD commands, BIFS commands, etc. and 
   programmatic content like MPEG-J (Java(TM) Byte Code) and 
   ECMAScript. It is possible to use one or more of the above in a 
   manner non-compliant to MPEG to crash or temporarily make the 
   receiver unavailable. 
    
  
Gentric et al.           Expires January 2002                       30 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   Authentication mechanisms can be used to validate of the sender and 
   the data to prevent security problems due to non-compliant malignant 
   MPEG-4 streams. 
    
   A security model is defined in MPEG-4 Systems streams carrying MPEG-
   J access units which comprises Java(TM) classes and objects. MPEG-J 
   defines a set of Java APIs and a secure execution model.  MPEG-J 
   content can call this set of APIs and Java(TM) methods from a set of 
   Java packages supported in the receiver within the defined security 
   model. According to this security model, downloaded byte code is 
   forbidden to load libraries, define native methods, start programs, 
   read or write files, or read system properties. 
    
   Receivers can implement intelligent filters to validate the buffer 
   requirements or parametric (OD, BIFS, etc.) or programmatic (MPEG-J, 
   ECMAScript) commands in the streams. However, this can increase the 
   complexity significantly. 
 
7. Acknowledgements 
   This document evolved across several years thanks to contributions 
   from a large number of people since it is based on work within the 
   IETF AVT working group and various ISO MPEG working groups, 
   especially the 4-on-IP ad-hoc group in the last stages. The authors 
   wish to thank Guido Fransceschini, Art Howarth, Dave Mackie, Dave 
   Singer, and Stephan Wenger for their valuable comments. 
 
8. References 
 
   [1] ISO/IEC 14496-1:2001 MPEG-4 Systems  
    
   [2] ISO/IEC 14496-2:2001 MPEG-4 Visual 
    
   [3] ISO/IEC 14496-3:2001 MPEG-4 Audio 
    
   [4] ISO/IEC 14496-6:2001 Delivery Multimedia Integration Framework. 
    
   [5] H. Schulzrinne, S. Casner, R. Frederick, V. Jacobson, RTP: A 
   Transport Protocol for Real Time Applications, RFC 1889, Internet 
   Engineering Task Force, January 1996. 
    
   [6] S. Bradner, Key words for use in RFCs to Indicate Requirement 
   Levels, RFC 2119, Internet Engineering Task Force, March 1997. 
    
   [7] Y. Kikuchi, T. Nomura, S. Fukunaga, Y. Matsui, H. Kimata, RTP 
   payload format for MPEG-4 Audio/Visual streams, Internet Engineering 
   Task Force, RFC 3016. 
    
   [8] B. Thompson, T. Koren, D. Wing, Tunneling multiplexed Compressed 
   RTP ("TCRTP"), work in progress, draft-ietf-avt-tcrtp-02.txt, 
   November 2000. 
    

Gentric et al.           Expires January 2002                       31 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   [9] D. Singer, Y Lim, A Framework for the delivery of MPEG-4 over 
   IP-based Protocols, work in progress, draft-singer-mpeg4-ip-02.txt, 
   May 2001. 
    
   [10] M. Handley, V. Jacobson, SDP: Session Description Protocol, RFC 
   2327, Internet Engineering Task Force, April 1998. 
    
   [11] C.Roux & al, RTP Payload Format for MPEG-4 FlexMultiplexed 
   Streams, work in progress, draft-curet-avt-rtp-mpeg4-flexmux-00.txt, 
   February 2001. 
    
   [12] H. Schulzrinne, RTP Profile for Audio and Video Conferences 
   with Minimal Control, RFC 1890, Internet Engineering Task Force, 
   January 1996. 
    
   [13] H. Schulzrinne, A. Rao, R. Lanphier, Real Time Streaming 
   Protocol, RFC 2326, Internet Engineering Task Force, April 1998. 
    
   [14] M. Handley, C. Perkins, E. Whelan, Session Announcement 
   Protocol, RFC 2974, Internet Engineering Task Force, October 2000. 
    
    
9. Authors' Addresses 
    
   Olivier Avaro 
   France Telecom 
   35 A Schutzenhuttenweg 
   60598 Frankfurt am Main 
   Deutschland 
   e-mail: olivier.avaro@francetelecom.fr 
    
   Andrea Basso 
   AT&T Labs Research 
   200 Laurel Avenue 
   Middletown, NJ 07748 
   USA 
   e-mail: basso@research.att.com 
    
   Stephen L. Casner 
   Packet Design, Inc. 
   66 Willow Place 
   Menlo Park, CA 94025 
   USA 
   e-mail: casner@acm.org 
    
   M. Reha Civanlar 
   AT&T Labs - Research 
   100 Schultz Drive 
   Red Bank, NJ 07701 
   USA 
   e-mail: civanlar@research.att.com 
    
   Philippe Gentric 
  
Gentric et al.           Expires January 2002                       32 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   Philips Digital Networks � MP4Net 
   51 rue Carnot 
   92156 Suresnes 
   France 
   e-mail: philippe.gentric@philips.com 
    
   Carsten Herpel 
   THOMSON multimedia 
   Karl-Wiechert-Allee 74 
   30625 Hannover 
   Germany 
   e-mail: herpelc@thmulti.com 
    
   Zvi Lifshitz 
   Optibase Ltd. 
   7 Shenkar St. 
   Herzliya 46120 
   Israel 
   e-mail: zvil@optibase.com 
    
   Young-kwon Lim 
   mp4cast (MPEG-4 Internet Broadcasting Solution Consortium) 
   1001-1 Daechi-Dong Gangnam-Gu 
   Seoul, 305-333, 
   Korea 
   e-mail : young@techway.co.kr 
    
   Colin Perkins 
   USC Information Sciences Institute 
   4350 N. Fairfax Drive #620 
   Arlington, VA 22203 
   USA 
   e-mail : csp@isi.edu 
    
   Jan van der Meer 
   Philips Digital Networks 
   Cederlaan 4 
   5600 JB Eindhoven 
   Netherlands 
   e-mail : jan.vandermeer@philips.com 
    
    
APPENDIX: Examples of usage 
    
   This payload format has been designed to transport efficiently a 
   very versatile packetization scheme: the MPEG-4 Synch Layer; as a 
   result its complexity is larger than the average RTP payload format. 
  
Gentric et al.           Expires January 2002                       33 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   For this reason this section describes a number of key examples of 
   how this payload format can be used.  
    
   A C++-like syntax called SDL (Syntactic Description Language) 
   defined in [1, section 14] is used to economically describe MPEG-4 
   system data structures. 
    
   However, as discussed in section 2, this payload format can also be 
   used without explicit knowledge of SL (logically equivalent to 
   configuring the SL headers as being empty), several examples 
   (Appendix 1,3,4,5) cover this case. 
    
   Furthermore these examples assume that the (a=fmtp) SDP syntax is 
   used to convey the MIME parameters of the payload format. 
 
Appendix.1 RFC 3016 compatible MPEG-4 Video (no SL) 
    
   This is an example of a video stream where the SL is configured to 
   produce RTP packets compatible with RFC 3016. 
    
SLConfigDescriptor 
    
   In this example the SLConfigDescriptor is: 
    
   class SLConfigDescriptor extends BaseDescriptor : bit(8) 
   tag=SLConfigDescrTag { 
    bit(8) predefined; 
    if (predefined==0) { 
     bit(1) useAccessUnitStartFlag; = 0 
     bit(1) useAccessUnitEndFlag; = 1 
     bit(1) useRandomAccessPointFlag; = 0 
     bit(1) hasRandomAccessUnitsOnlyFlag; = 0 
     bit(1) usePaddingFlag; = 0 
     bit(1) useTimeStampsFlag; = 0 
     bit(1) useIdleFlag; = 0 
     bit(1) durationFlag; = 0 
     bit(32) timeStampResolution; = 0 
     bit(32) OCRResolution; = 0 
     bit(8) timeStampLength; = 0 
     bit(8) OCRLength; = 0 
     bit(8) AU_Length; = 0 
     bit(8) instantBitrateLength; = 0 
     bit(4) degradationPriorityLength; = 0 
     bit(5) AU_seqNumLength; = 0 
     bit(5) packetSeqNumLength; = 0 
     bit(2) reserved=0b11; 
    } 
    if (durationFlag) { 
     bit(32) timeScale; // NOT USED 
     bit(16) accessUnitDuration;  // NOT USED 
     bit(16) compositionUnitDuration;  // NOT USED 
    } 
    if (!useTimeStampsFlag) { 
  
Gentric et al.           Expires January 2002                       34 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
     bit(timeStampLength) startDecodingTimeStamp; = 0 
     bit(timeStampLength) startCompositionTimeStamp; = 0 
    } 
   } 
 
SL Packet Header structure 
    
   With this configuration we have the following SL packet header 
   structure: 
    
   aligned(8) class SL_PacketHeader (SLConfigDescriptor SL) { 
    if (SL.useAccessUnitEndFlag) { 
     bit(1) accessUnitEndFlag; // 1 bit 
    } 
   } 
    
   In this case this payload produces RTP packets that are exactly 
   conformant to RFC 3016 and the Synch Layer is reduced to a purely 
   logical construction that neither sender nor receiver need to 
   implement. 
    
Parameters 
    
   This configuration is the default one; no parameters are required. 
    
RTP packet structure 
    
   Note that accessUnitEndFlag is mapped to the RTP header M bit. 
    
   +=========================================+=============+ 
   | Field                                   |  size       | 
   +=========================================+=============+ 
   | RTP header                              |    -        | 
   +-----------------------------------------+-------------+ 
   | SL packet payload                       | 1400 bytes  | 
   +-----------------------------------------+-------------+ 
 
Overhead 
 
   In this example we have an RTP overhead of 40 bytes for 1400 bytes 
   of payload i.e. 3 % overhead. 
 
Appendix.2 MPEG-4 Video with SL 
    
   Let us consider the case of a 30 frames per second MPEG-4 video 
   stream which bit rate is high enough that Access Units have to be 
   split in several SL packets (typically above 300 kb/s). 
    
   Let us assume also that the video codec generates in that case Video 
   Packets suitable to fit in one SL packet i.e that the video codec is 
   MTU aware and the MTU is 1500 bytes. We assume furthermore that this 
   stream contains B frames and that decodingTimeStamps are present. 
    
  
Gentric et al.           Expires January 2002                       35 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
SLConfigDescriptor 
    
   In this example the SLConfigDescriptor is: 
    
   class SLConfigDescriptor extends BaseDescriptor : bit(8) 
   tag=SLConfigDescrTag { 
    bit(8) predefined; 
    if (predefined==0) { 
     bit(1) useAccessUnitStartFlag; = 1 
     bit(1) useAccessUnitEndFlag; = 0 
     bit(1) useRandomAccessPointFlag; = 1 
     bit(1) hasRandomAccessUnitsOnlyFlag; = 0 
     bit(1) usePaddingFlag; = 0 
     bit(1) useTimeStampsFlag; = 1 
     bit(1) useIdleFlag; = 0 
     bit(1) durationFlag; = 0 
     bit(32) timeStampResolution; = 30 
     bit(32) OCRResolution; = 0 
     bit(8) timeStampLength; = 32 
     bit(8) OCRLength; = 0 
     bit(8) AU_Length; = 0 
     bit(8) instantBitrateLength; = 0 
     bit(4) degradationPriorityLength; = 0 
     bit(5) AU_seqNumLength; = 0 
     bit(5) packetSeqNumLength; = 0 
     bit(2) reserved=0b11; 
    } 
    if (durationFlag) { 
     bit(32) timeScale; // NOT USED 
     bit(16) accessUnitDuration;  // NOT USED 
     bit(16) compositionUnitDuration;  // NOT USED 
    } 
    if (!useTimeStampsFlag) { 
     bit(timeStampLength) startDecodingTimeStamp; // NOT USED 
     bit(timeStampLength) startCompositionTimeStamp; // NOT USED 
    } 
   } 
    
   The useRandomAccessPointFlag is set so that the 
   randomAccessPointFlag can indicate that the corresponding SL packet 
   contains a GOV and the first Video Packet of an Intra coded frame. 
    
SL Packet Header structure 
    
   With this configuration we have the following SL packet header 
   structure: 
    
   aligned(8) class SL_PacketHeader (SLConfigDescriptor SL) { 
    bit(1) accessUnitStartFlag; // 1 bit 
    if (accessUnitStartFlag) { 
      bit(1) randomAccessPointFlag; // 1 bit 
      bit(1) decodingTimeStampFlag; // 1 bit 
      bit(1) compositionTimeStampFlag; // 1 bit 
  
Gentric et al.           Expires January 2002                       36 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
      if (decodingTimeStampFlag) { 
         bit(SL.timeStampLength) decodingTimeStamp; 
      } 
      if (compositionTimeStampFlag) { 
         bit(SL.timeStampLength) compositionTimeStamp; 
      } 
   } 
    
Parameters 
    
   decodingTimeStamps are encoded on 32 bits, which is much more than 
   needed for delta. Therefore the sender will use DTSDeltaLength to 
   signal that only 7 bits are used for the coding of relative DTS in 
   the RTP packet. 
    
   The RSLHSectionSize cannot exceed 2 bits, which is encoded on 2 bits 
   and signaled by RSLHSectionSizeLength. The resulting concatenated 
   fmtp line is: 
    
   a=fmtp:<format> DTSDeltaLength=7;RSLHSectionSizeLength=3 
    
RTP packet structure 
    
   Two cases can occur; for packets that transport first fragments of 
   Access Units we have: 
    
   +=========================================+=============+ 
   | Field                                   |  size       | 
   +=========================================+=============+ 
   | RTP header                              |    -        | 
   +-----------------------------------------+-------------+ 
   | DTSFlag = 1                             |  1 bit      | 
   +-----------------------------------------+-------------+ 
   | DTSDelta                                |  7 bits     | 
   +-----------------------------------------+-------------+ 
   | bits to byte alignment                  |  0 bits     | 
   +-----------------------------------------+-------------+ 
   | RSLHSectionSize = 4                     |  3 bits     | 
   +-----------------------------------------+-------------+ 
   | accessUnitStartFlag = 1                 |  1 bit      | 
   +-----------------------------------------+-------------+ 
   | randomAccessPointFlag                   |  1 bit      | 
   +-----------------------------------------+-------------+ 
   | decodingTimeStampFlag                   |  1 bit      | 
   +-----------------------------------------+-------------+ 
   | compositionTimeStampFlag                |  1 bit      | 
   +-----------------------------------------+-------------+ 
   | bits to byte alignment                  |  1 bit      | 
   +-----------------------------------------+-------------+ 
   | SL packet payload                       |  N bytes    | 
   +-----------------------------------------+-------------+ 
    
 
Gentric et al.           Expires January 2002                       37 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   For packets that transport non-first fragments of Access Units we 
   have: 
    
   +=========================================+=============+ 
   | Field                                   |  size       | 
   +=========================================+=============+ 
   | RTP header                              |    -        | 
   +-----------------------------------------+-------------+ 
   | DTSFlag = 0                             |  1 bit      | 
   +-----------------------------------------+-------------+ 
   | bits to byte alignment                  |  7 bits     | 
   +-----------------------------------------+-------------+ 
   | RSLHSectionSize = 1                     |  3 bits     | 
   +-----------------------------------------+-------------+ 
   | accessUnitStartFlag = 0                 |  1 bit      | 
   +-----------------------------------------+-------------+ 
   | bits to byte alignment                  |  4 bits     | 
   +-----------------------------------------+-------------+ 
   | SL packet payload                       |  N bytes    | 
   +-----------------------------------------+-------------+ 
    
Overhead estimation 
    
   In this example we have a RTP overhead of 40 + 2 bytes for 1400 
   bytes of payload i.e. 3 % overhead. 
 
Appendix.3 Low delay MPEG-4 Audio (no SL) 
    
   This example is for a low delay audio service. For this reason a 
   single SL packet is transported in each RTP packet. Actually each SL 
   packet contains a complete Access Unit. 
    
SLConfigDescriptor 
    
   Since CTS=DTS and Access Unit duration is constant signaling of 
   MPEG-4 time stamps is not needed (the durationFlag of SLConfig is 
   set) 
    
   We also assume here an audio Object Type for which all Access Units 
   are Random Access Points, which is signaled using the 
   hasRandomAccessUnitsOnlyFlag in the SLConfigDescriptor. 
    
   We assume furthermore a mode where the Access Unit size is constant 
   and equal to 5 bytes (which is signaled with AU_Length). 
    
   In this example the SLConfigDescriptor is: 
    
   class SLConfigDescriptor extends BaseDescriptor : bit(8) 
   tag=SLConfigDescrTag { 
    bit(8) predefined; 
    if (predefined==0) { 
     bit(1) useAccessUnitStartFlag; = 0 
     bit(1) useAccessUnitEndFlag; = 0 
  
Gentric et al.           Expires January 2002                       38 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
     bit(1) useRandomAccessPointFlag; = 0 
     bit(1) hasRandomAccessUnitsOnlyFlag; = 1 
     bit(1) usePaddingFlag; = 0 
     bit(1) useTimeStampsFlag; = 0 
     bit(1) useIdleFlag; = 0 
     bit(1) durationFlag; = 1 // signals constant AU duration 
     bit(32) timeStampResolution; = 0 
     bit(32) OCRResolution; = 0 
     bit(8) timeStampLength; = 0 
     bit(8) OCRLength; = 0 
     bit(8) AU_Length; = 5 
     bit(8) instantBitrateLength; = 0 
     bit(4) degradationPriorityLength; = 0 
     bit(5) AU_seqNumLength; = 0 
     bit(5) packetSeqNumLength; = 0 
     bit(2) reserved=0b11; 
    } 
    if (durationFlag) { 
     bit(32) timeScale; = 1000 // for milliseconds 
     bit(16) accessUnitDuration; = 10 // ms 
     bit(16) compositionUnitDuration; = 10 // ms 
    } 
    if (!useTimeStampsFlag) { 
     bit(timeStampLength) startDecodingTimeStamp; = 0 
     bit(timeStampLength) startCompositionTimeStamp; = 0 
    } 
   } 
    
SL packet header 
    
   With this configuration the SL packet header is empty. The Synch 
   Layer is reduced to a purely logical construction that neither 
   sender nor receiver need to implement. 
    
Parameters 
    
   No parameters are required. 
    
RTP packet structure 
    
   Note that the RTP header M bit should be always set to 1. 
    
   +=========================================+=============+ 
   | Field                                   |  size       | 
   +=========================================+=============+ 
   | RTP header                              |    -        | 
   +-----------------------------------------+-------------+ 
   | SL packet payload                       |  5 bytes    | 
   +-----------------------------------------+-------------+ 
    
    
Overhead estimation 
    
  
Gentric et al.           Expires January 2002                       39 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   The overhead is extremely large i.e. more than 800 %, since 40 bytes 
   of headers are required to transport 5 bytes of data. Note however 
   that RTP header compression would work well since time stamps 
   increments are constant. 
    
    
Appendix.4 Media delivery MPEG-4 Audio (no SL) 
    
   This example is for a media delivery service where delay is not an 
   issue but efficiency is. In this case several SL Packets are 
   transported in each RTP packet. 
    
SLConfigDescriptor 
    
   Similar to previous example. 
    
SL packet header 
    
   With this configuration the SL packet header is empty. The Synch 
   Layer is reduced to a purely logical construction that neither 
   sender nor receiver need to implement. 
    
Parameters 
    
   The absence of RSLHSectionSizeLength indicates that the RSLHSection 
   is empty. 
    
   The size of SL Packets (which are all complete Access Units in this 
   case) is constant and is indicated  with: 
    
   a=fmtp:<format> ConstantSize=5 
    
   This also indicates to the receiver that the Multiple-SL mode will 
   be used, the 2 bytes field that would give the size of the 
   MSLHSection is ommited since in this case this field always contains 
   zero (the MSLHSection is always empty). 
    
    
RTP packet structure 
    
   Note that the RTP header M bit is always set to 1, which indicates 
   to the receiver that only complete Access Units are transported. 
    
   +=========================================+=============+ 
   | Field                                   |  size       | 
   +=========================================+=============+ 
   | RTP header                              |    -        | 
   +-----------------------------------------+-------------+ 
   | SL packet payload                       |  5 bytes    | 
   +-----------------------------------------+-------------+ 
   | SL packet payload                       |  5 bytes    | 
  
Gentric et al.           Expires January 2002                       40 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   +-----------------------------------------+-------------+ 
   | etc, until MTU is reached                             | 
   +-----------------------------------------+-------------+ 
   | SL packet payload                       |  5 bytes    | 
   +-----------------------------------------+-------------+ 
    
Overhead estimation 
    
   The overhead is 3% i.e. minimal. 
 
 
Appendix.5 AAC with interleaving (no SL) 
    
   Let us consider AAC at 128 kb/s where each Access Unit is in the 
   average 320 bytes. Interleaving is applied with a continuous 
   interleaving scheme (see table below) where 4 Access Units are used 
   to construct each RTP packet in order to match a MTU of 1500 bytes.  
    
   IndexDelta is constant and equal to 2 (since +1 is automatically 
   added); it is encoded on 3 bits. 
    
   Index (being encoded on 3 bits) rolls over very fast and is not very 
   useful for reordering. However this a case as explained in section 
   3.8 where time stamps should be used for de-interleaving; receivers 
   know that each SL packet is a complete Access Unit because all RTP 
   packets have the M bit set to 1 and therefore, since Access Unit 
   duration is constant, Access Unit timestamps can be computed from 
   RTP timestamps and IndexDelta values; this can be used for de-
   interleaving even in case of losses. 
    
    
   +-----------------------------------------------------------------+ 
   | RTP packet | RTP Timestamp |    Aus          | Index,IndexDelta | 
   +-----------------------------------------------------------------+ 
   |    1       |   CTS(AU1)    |             1   |  1               | 
   +-----------------------------------------------------------------+ 
   |    2       |   CTS(AU2)    |          2, 5   |  2,2             | 
   +-----------------------------------------------------------------+ 
   |    3       |   CTS(AU3)    |       3, 6, 9   |  3,2,2           | 
   +-----------------------------------------------------------------+ 
   |    4       |   CTS(AU4)    |    4, 7,10,13   |  4,2,2,2         | 
   +-----------------------------------------------------------------+ 
   |    5       |   CTS(AU8)    |    8,11,14,17   |  0,2,2,2         | 
   +-----------------------------------------------------------------+ 
   |    6       |   CTS(AU12)   |   12,15,18,21   |  4,2,2,2         | 
   +-----------------------------------------------------------------+ 
   |    7       |   CTS(AU16)   |   16,19,22,25   |  0,2,2,2         | 
   +----------------------------------------------------------------+ 
   |    8       |   CTS(AU20)   |   20,23,26,29   |  4,2,2,2         | 
   +-----------------------------------------------------------------+ 
   |    9       |   CTS(AU24)   |   24,27,30,33   |  0,2,2,2         | 
   +-----------------------------------------------------------------+ 
   |    10      |   CTS(AU28)   |   28,31,34,37   |  4,2,2,2         | 
  
Gentric et al.           Expires January 2002                       41 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   +-----------------------------------------------------------------+ 
   |                              etc                                | 
   +-----------------------------------------------------------------+ 
    
SLConfigDescriptor  
    
   Similar to previous example. 
    
SL Packet Header 
    
   Similar to previous example (empty). 
    
Parameters 
    
   The resulting concatenated fmtp line is: 
    
   a=fmtp:<format> SizeLength=13;IndexLength=3;IndexDeltaLength=3 
    
RTP packet structure 
    
   +=========================================+=============+ 
   | Field                                   |  size       | 
   +=========================================+=============+ 
   | RTP header                              |    -        | 
   +-----------------------------------------+-------------+ 
                         MSLHSection 
   +=========================================+=============+ 
   | MSLHSection size in bits = 135          |  2 bytes    | 
   +-----------------------------------------+-------------+ 
   | PayloadSize                             |  13 bits    | 
   +-----------------------------------------+-------------+ 
   | Index                                   |  3 bits     | 
   +-----------------------------------------+-------------+ 
   | PayloadSize                             |  13 bits    | 
   +-----------------------------------------+-------------+ 
   | IndexDelta                              |  3 bits     | 
   +-----------------------------------------+-------------+ 
   | PayloadSize                             |  13 bits    | 
   +-----------------------------------------+-------------+ 
   | IndexDelta                              |  3 bits     | 
   +-----------------------------------------+-------------+ 
   | PayloadSize                             |  13 bits    | 
   +-----------------------------------------+-------------+ 
   | IndexDelta                              |  3 bits     | 
   +-----------------------------------------+-------------+ 
   | bits to byte alignment                  |  0 bits     | 
   +-----------------------------------------+-------------+ 
                         SLPPSection 
   +=========================================+=============+ 
   | AAC Access Unit                         |   x bytes   | 
   +-----------------------------------------+-------------+ 
   | AAC Access Unit                         |   x bytes   | 
   +-----------------------------------------+-------------+ 
  
Gentric et al.           Expires January 2002                       42 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   | AAC Access Unit                         |   x bytes   | 
   +-----------------------------------------+-------------+ 
   | AAC Access Unit                         |   x bytes   | 
   +-----------------------------------------+-------------+ 
    
    
Overhead estimation 
    
   The MSLHSection is 8 bytes; in this example we have therefore a RTP 
   overhead of 40 + 8 bytes for 1400 bytes (approx) of payload i.e. 
   around 4 % overhead. 
 
 
Appendix.6 A more complex case: AAC with interleaving and SL 
    
   Let us consider AAC around 130 kb/s where each Access Unit is split 
   in 4 SL packets corresponding to Error Sensitivity Categories (ESC) 
   of maximum 90 bytes for which interleaving is very useful in terms 
   of error resilience. We thus use an interleaving scheme where 15 SL 
   Packets (extracted from 15 consecutive Access Units) are used to 
   construct each RTP packet in order to match a MTU of 1500 bytes. 
   Note that since ESC fragments are not byte aligned we also use the 
   paddingFlag and paddingBits features of the Synch Layer.  
    
   The interleaving sequence is 4 RTP packets and 350 ms long, which is 
   too long for conferencing but perfectly OK for Internet radio. 
    
   Since the sequence contains 60 SL packets, the sequence number can 
   be encoded on 6 bits. However 2 bits are actually enough if the 
   sender always resets the SL packet sequence number to zero at the 
   start of each sequence, since only the first MSLH in each of the 4 
   RTP packets in the sequence carries an absolute sequence number 
   value (0,1,2,3). 
    
   2 bits are also enough for IndexDelta, which is constant and equal 
   to 3 (since +1 is automatically added). 
    
   Note that the 4th RTP packet in each sequence has its M bit set to 1 
   since it contains 15 SL packets transporting the end of 15 
   consecutive Access Units. 
    
   With this scheme a sender (for example upon reception of RTCP 
   reports indicating high loss rates) can (for example) choose to 
   duplicate for each interleaving sequence the first RTP packet that 
   contains the most useful data in terms of ESC or apply other error 
   protection techniques, with due care to congestion issues. 
    
   In this example we will also show several other SL features (OCR, AU 
   boundary flags, padding, as detailed below). 
    
   One feature demonstrated by this example is the degradation 
   priority. We assume degradation priority can take 4 different 
  
Gentric et al.           Expires January 2002                       43 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   values, mapped to Error Sensitivity Categories, and is encoded on 2 
   bits. This interleaving scheme makes sure that only SL packets of 
   identical degradation priorities are grouped in the same RTP packet 
   (3.6.3) and that only the first RSLH of each RTP packet transports 
   the degradation priority. 
    
   We also assume that for each last SL packet of each RTP packet the 
   server inserts an OCR. 
    
    
SLConfigDescriptor  
    
   In this example the SLConfigDescriptor is: 
    
   class SLConfigDescriptor extends BaseDescriptor : bit(8) 
   tag=SLConfigDescrTag { 
    bit(8) predefined; 
    if (predefined==0) { 
     bit(1) useAccessUnitStartFlag; = 1 
     bit(1) useAccessUnitEndFlag; = 1 
     bit(1) useRandomAccessPointFlag; = 0 
     bit(1) hasRandomAccessUnitsOnlyFlag; = 1 
     bit(1) usePaddingFlag; = 1 // we need to signal padding bits 
     bit(1) useTimeStampsFlag; = 0 
     bit(1) useIdleFlag; = 0 
     bit(1) durationFlag; = 1 
     bit(32) timeStampResolution; = 0 
     bit(32) OCRResolution; = 30 
     bit(8) timeStampLength; = 0 
     bit(8) OCRLength; = 32 
     bit(8) AU_Length; = 0 
     bit(8) instantBitrateLength; = 0 
     bit(4) degradationPriorityLength; = 2 
     bit(5) AU_seqNumLength; = 0 
     bit(5) packetSeqNumLength; = 6 
     bit(2) reserved=0b11; 
    } 
    if (durationFlag) { 
     bit(32) timeScale; = 1000// milliseconds 
     bit(16) accessUnitDuration; = 23.22 // ms 
     bit(16) compositionUnitDuration; = 23.22 // ms 
    } 
    if (!useTimeStampsFlag) { 
     bit(timeStampLength) startDecodingTimeStamp; = 0 
     bit(timeStampLength) startCompositionTimeStamp; = 0 
    } 
   } 
    
SL Packet Header structure 
    
   With this configuration we have the following SL packet header 
   structure: 
    
  
Gentric et al.           Expires January 2002                       44 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   aligned(8) class SL_PacketHeader (SLConfigDescriptor SL) { 
    bit(1) accessUnitStartFlag;  
    bit(1) accessUnitEndFlag; 
    bit(1) OCRflag; 
    bit(1) paddingFlag; 
    if (paddingFlag) bit(3) paddingBits; 
    bit(SL.packetSeqNumLength) packetSequenceNumber; 
    bit(1) DegPrioflag; 
    if (DegPrioflag) { 
     bit(SL.degradationPriorityLength) degradationPriority;} 
    if (OCRflag) { 
     bit(SL.OCRLength) objectClockReference;} 
    } 
   } 
    
Parameters 
    
   The RSLHSectionSize cannot exceed 2 bits, which is encoded on 2 bits 
   and signaled by RSLHSectionSizeLength. 
    
   The resulting concatenated fmtp line is: 
    
   a=fmtp:<format> 
   SizeLength=6;RSLHSectionSizeLength=2;IndexLength=2;IndexDeltaLength=
   2;OCRDeltaLength=16 
    
RTP packet structure 
    
   +=========================================+=============+ 
   | Field                                   |  size       | 
   +=========================================+=============+ 
   | RTP header                              |    -        | 
   +-----------------------------------------+-------------+ 
                         MSLHSection 
   +=========================================+=============+ 
   | MSLHSection size in bits = 135          |  2 bytes    | 
   +-----------------------------------------+-------------+ 
   | PayloadSize                             |  7 bits     | 
   +-----------------------------------------+-------------+ 
   | Index = 0 or 1 or 2 or 3                |  2 bits     | 
   +-----------------------------------------+-------------+ 
   | PayloadSize                             |  7 bits     | 
   +-----------------------------------------+-------------+ 
   | IndexDelta = 3                          |  2 bits     | 
   +-----------------------------------------+-------------+ 
   |            etc + 12 times 9 bits                      | 
   +-----------------------------------------+-------------+ 
   | PayloadSize                             |  7 bits     | 
   +-----------------------------------------+-------------+ 
   | IndexDelta = 3                          |  2 bits     | 
   +-----------------------------------------+-------------+ 
   | bits to byte alignment                  |  7 bits     | 
   +-----------------------------------------+-------------+ 
  
Gentric et al.           Expires January 2002                       45 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
                         RSLHSection 
   +=========================================+=============+ 
   | RSLHSectionSize                         |  6 bits     | 
   +-----------------------------------------+-------------+ 
   | accessUnitStartFlag                     |  1 bit      | 
   +-----------------------------------------+-------------+ 
   | accessUnitEndFlag                       |  1 bit      | 
   +-----------------------------------------+-------------+ 
   | OCRFlag = 0                             |  1 bit      | 
   +-----------------------------------------+-------------+ 
   | paddingFlag = 1                         |  1 bit      | 
   +-----------------------------------------+-------------+ 
   | paddingBits                             |  3 bits     | 
   +-----------------------------------------+-------------+ 
   | DegPrioflag = 1                         |  1 bit      | 
   +-----------------------------------------+-------------+ 
   | degradationPriority                     |  2 bits     | 
   +-----------------------------------------+-------------+ 
   | accessUnitStartFlag                     |  1 bit      | 
   +-----------------------------------------+-------------+ 
   | accessUnitEndFlag                       |  1 bit      | 
   +-----------------------------------------+-------------+ 
   | OCRFlag = 0                             |  1 bit      | 
   +-----------------------------------------+-------------+ 
   | paddingFlag = 1                         |  1 bit      | 
   +-----------------------------------------+-------------+ 
   | paddingBits                             |  3 bits     | 
   +-----------------------------------------+-------------+ 
   | DegPrioflag = 0                         |  1 bit      | 
   +-----------------------------------------+-------------+ 
   |              etc + 12 times 8 bits                    | 
   +-----------------------------------------+-------------+ 
   | accessUnitStartFlag                     |  1 bit      | 
   +-----------------------------------------+-------------+ 
   | accessUnitEndFlag                       |  1 bit      | 
   +-----------------------------------------+-------------+ 
   | OCRFlag = 1                             |  1 bit      | 
   +-----------------------------------------+-------------+ 
   | OCRDelta                                |  16 bits    | 
   +-----------------------------------------+-------------+ 
   | paddingFlag = 0                         |  1 bit      | 
   +-----------------------------------------+-------------+ 
   | DegPrioflag = 0                         |  1 bit      | 
   +-----------------------------------------+-------------+ 
   | bits to byte alignment                  |  5 bits     | 
   +-----------------------------------------+-------------+ 
                         SLPPSection 
   +=========================================+=============+ 
   | SL packet payload                       |max 90 bytes | 
   +-----------------------------------------+-------------+ 
   |             etc + 13  SL packets                      | 
   +-----------------------------------------+-------------+ 
   | SL packet payload                       |max 90 bytes | 
  
Gentric et al.           Expires January 2002                       46 

                RTP Payload Format for MPEG-4 Streams       July 2001 
 
 
   +-----------------------------------------+-------------+ 
    
    
   Note that in the above table the last SL packet in the RTP packet 
   has a payload that is byte-aligned (at the end). When this happens 
   paddingFlag is set to zero and the paddingBits field is omitted.  
    
Overhead estimation 
    
   The MSLHSection is 19 bytes, the RSLHSection is 16 bytes; in this 
   example we have therefore a RTP overhead of 40 + 35 bytes for 1350 
   bytes (max) of payload i.e. around 6 % overhead. 
 

Gentric et al.           Expires January 2002                       47