Internet Engineering Task Force                  Avaro-France Telecom
Internet Draft                                             Basso-AT&T
                                                 Casner-Packet Design
                                                        Civanlar-AT&T
                                                      Gentric-Philips
                                                       Herpel-Thomson
                                                    Lifshitz-Optibase
                                                          Lim-mp4cast
                                                          Perkins-ISI
                                                 van der Meer-Philips
 draft-gentric-avt-mpeg4-singlesl-00.txt

                                                        December 2000
                                                    Expires June 2001


                 RTP Payload Format for MPEG-4 Streams


                          Status of this Memo

This document is an Internet-Draft and is in full conformance with all
provisions of Section 10 of RFC2026.

Internet-Drafts are working documents of the Internet Engineering Task
Force (IETF), its areas, and its working groups. Note that other groups
may also distribute working documents as Internet-Drafts. Internet-
Drafts are draft documents valid for a maximum of six months and may be
updated, replaced, or obsoleted by other documents at any time. It is
inappropriate to use Internet- Drafts as reference material or to cite
them other than as "work in progress."

The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.

This specification is a product of the Audio/Video Transport working
group within the Internet Engineering Task Force and ISO/IEC MPEG-4 ad
hoc group on MPEG-4 over Internet. Comments are solicited and should be
addressed to the working group's mailing list at rem-conf@es.net and/or
the authors.


                                Abstract

This document describes a payload format for transporting MPEG-4 encoded
data using RTP. MPEG-4 is a recent standard from ISO/IEC for the coding
of natural and synthetic audio-visual data. Several services provided by
RTP are beneficial for MPEG-4 encoded data transport over the Internet.
Additionally, the use of RTP makes it possible to synchronize MPEG-4
data with other real-time data types.



Gentric et al.                                                  1








RTP Payload Format for MPEG-4 Streams               December 2000







1. Introduction

MPEG-4 is a recent standard from ISO/IEC for the coding of natural and
synthetic audio-visual data in the form of audiovisual objects that are
arranged into an audiovisual scene by means of a scene description
[1][2][3][4]. This draft specifies an RTP [5] payload format for
transporting MPEG-4 encoded data streams.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [6].

The benefits of using RTP for MPEG-4 data stream transport include:

i. Ability to synchronize MPEG-4 streams with other RTP payloads

ii. Monitoring MPEG-4 delivery performance through RTCP

iii. Combining MPEG-4 and other real-time data streams received from
multiple end-systems into a set of consolidated streams through RTP
mixers

iv. Converting data types, etc. through the use of RTP translators.

1.1 Overview of MPEG-4 End-System Architecture

Fig. 1 below shows the general layered architecture of MPEG-4 terminals.
The Compression Layer processes individual audio-visual media streams.
The MPEG-4 compression schemes are defined in the ISO/IEC specifications
14496-2 [2] and 14496-3 [3]. The compression schemes in MPEG-4 achieve
efficient encoding over a bandwidth ranging from several Kbps to many
Mbps. The audio-visual content compressed by this layer is organized
into Elementary Streams (ESs). The MPEG-4 standard specifies MPEG-4
compliant streams. Within the constraint of this compliance the
compression layer is unaware of a specific delivery technology, but it
can be made to react to the characteristics of a particular delivery
layer such as the path-MTU or loss characteristics. Also, some
compressors can be designed to be delivery specific for implementation
efficiency.  In such cases the compressor may work in a non-optimal
fashion with delivery technologies that are different than the one it is
specifically designed to operate with.

The hierarchical relations, location and properties of ESs in a
presentation are described by a dynamic set of Object Descriptors (ODs).
Each OD groups one or more ES Descriptors referring to a single content
item (audio-visual object). Hence, multiple alternative or hierarchical
representations of each content item are possible.


Gentric et al.                                                  2








RTP Payload Format for MPEG-4 Streams               December 2000


ODs are themselves conveyed through one or more ESs. A complete set of
ODs can be seen as an MPEG-4 resource or session description at a stream
level. The resource description may itself be hierarchical, i.e. an ES
conveying an OD may describe other ESs conveying other ODs.

The session description is accompanied by a dynamic scene description,
Binary Format for Scene (BIFS), again conveyed through one or more ESs.
At this level, content is identified in terms of audio-visual objects.
The spatio-temporal location of each object is defined by BIFS. The
audio-visual content of those objects that are synthetic and static are
described by BIFS also. Natural and animated synthetic objects may refer
to an OD that points to one or more ESs that carry the coded
representation of the object or its animation data.

By conveying the session (or resource) description as well as the scene
(or content composition) description through their own ESs, it is made
possible to change portions of the content composition and the number
and properties of media streams that carry the audio-visual content
separately and dynamically at well known instants in time.

One or more initial Scene Description streams and the corresponding OD
stream has to be pointed to by an initial object descriptor (IOD). The
IOD needs to be made available to the receivers through some out-of-band
means that are not defined in this document.

A homogeneous encapsulation of ESs carrying media or control (ODs, BIFS)
data is defined by the Sync Layer (SL) that primarily provides the
synchronization between streams. The Compression Layer organizes the ESs
in Access Units (AU), the smallest elements that can be attributed
individual timestamps. Integer or fractional AUs are then encapsulated
in SL packets.  All consecutive data from one stream is called an SL-
packetized stream at this layer. The interface between the compression
layer and the SL is called the Elementary Stream Interface (ESI). The
ESI is informative.

The Delivery Layer in MPEG-4 consists of the Delivery Multimedia
Integration Framework defined in ISO/IEC 14496-6 [4]. This layer is
media unaware but delivery technology aware. It provides transparent
access to and delivery of content irrespective of the technologies used.
The interface between the SL and DMIF is called the DMIF Application
Interface (DAI). It offers content location independent procedures for
establishing MPEG-4 sessions and access to transport channels. The
specification of this payload format is considered as a part of the
MPEG-4 Delivery Layer.

 media aware        +-----------------------------------------+
 delivery unaware   |           COMPRESSION LAYER             |
 14496-2 Visual     |streams from as low as Kbps to multi-Mbps|
 14496-3 Audio      +-----------------------------------------+
                                                           Elementary
                                                           Stream
 ==========================================================Interface
                                                                (ESI)

Gentric et al.                                                  3








RTP Payload Format for MPEG-4 Streams               December 2000


                   +-------------------------------------------+
 media and         |              SYNC LAYER                   |
 delivery unaware  | manages elementary streams, their synch-  |
 14496-1 Systems   | ronization and hierarchical relations     |
                   +-------------------------------------------+
                                                           DMIF
                                                           Application
===========================================================Interface
                                                                (DAI)
                   +-------------------------------------------+
 delivery aware    |               DELIVERY LAYER              |
 media  unaware    |provides transparent access to and delivery|
 14496-6 DMIF      | of content irrespective of delivery       |
                   |                technology                 |
                   +-------------------------------------------+

                 Figure 1: General MPEG-4 terminal architecture

1.2 MPEG-4 Elementary Stream Data Packetization

The ESs from the encoders are fed into the SL with indications of AU
boundaries, random access points, desired composition time and the
current time.

The Sync Layer fragments the ESs into SL packets, each containing a
header that encodes information conveyed through the ESI. If the AU is
larger than a SL packet, subsequent packets containing remaining parts
of the AU are generated with subset headers until the complete AU is
packetized.

The syntax of the Sync Layer is not fixed and can be adapted to the
needs of the stream to be transported. This includes the possibility to
select the presence or absence of individual syntax elements as well as
configuration of their length in bits. The configuration for each
individual stream is conveyed in a SLConfigDescriptor, which is an
integral part of the ES Descriptor for this stream.

2. Analysis of the alternatives for carrying MPEG-4 over IP

2.1 MPEG-4 over UDP

Considering that the MPEG-4 SL defines several transport related
functions such as timing, sequence numbering, etc., this seems to be the
most straightforward alternative for carrying MPEG-4 data over IP. One
group of problems with this approach, however, stems from the monolithic
architecture of MPEG-4. No other multimedia data stream (including those
carried with RTP) can be synchronized with MPEG-4 data carried directly
over UDP. Furthermore, the dynamic scene and session control concepts
can't be extended to non-MPEG-4 data.

Even if the coordination with non-MPEG-4 data is overlooked, carrying
MPEG-4 data over UDP has the following additional shortcomings:


Gentric et al.                                                  4








RTP Payload Format for MPEG-4 Streams               December 2000


i. Mechanisms need to be defined to protect sensitive parts of MPEG-4
data. Some of these (like FEC) are already defined for RTP.

ii. There is no defined technique for synchronizing MPEG-4 streams from
different servers in the variable delay environment of the Internet.

iii. MPEG-4 streams originating from two servers may collide (their
sources may become unresolvable at the destination) in a multicast
session.

iv. An MPEG-4 back channel needs to be defined for quality feedback
similar to that provided by RTCP.

v. RTP mixers and translators can't be used.

The back-channel problem may be alleviated by developing a reception
reporting protocol like RTCP. Such an effort may benefit from RTCP
design knowledge, but needs extensions.

2.2 RTP header followed by full MPEG-4 headers

This alternative may be implemented by using the send time or the
composition time coming from the reference clock as the RTP timestamp.
This way no new feedback protocol needs to be defined for MPEG-4's back
channel, but RTCP may not be sufficient for MPEG-4's feedback
requirements that are still in the definition stage. Additionally, due
to the duplication of header information, such as the sequence numbers
and time stamps, this alternative causes unnecessary increases in the
overhead. Scene description or dynamic session control can't be extended
to non-MPEG-4 streams also.

2.3 MPEG-4 ESs over RTP with individual payload types

This is the most suitable alternative for coordination with the existing
Internet multimedia transport techniques and does not use MPEG-4 systems
at all. Complete implementation of it requires definition of potentially
many payload types, as already proposed for audio and video payloads
[7], and might lead to constructing new session and scene description
mechanisms. Considering the size of the work involved which essentially
reconstructs MPEG-4 systems, this may only be a long term alternative if
no other solution can be found.

2.4 RTP header followed by a reduced SL header

The inefficiency of the approach described in 2.2 can be fixed by using
a reduced SL header that does not carry duplicate information following
the RTP header.

2.5 Recommendation

Based on the above analysis, the best compromise is to map the MPEG-4 SL
packets onto RTP packets, such that the common pieces of the headers
reside in the RTP header that is followed by an optional reduced SL

Gentric et al.                                                  5








RTP Payload Format for MPEG-4 Streams               December 2000


header providing the MPEG-4 specific information. The details of this
payload format are described in the next section.

3. Payload Format

MPEG-4 SL packets are mapped onto RTP packets. The SL Packet header is
transformed into a reduced SL packet header, with some fields replaced
by those in the RTP header and others transported in reduced form. The
payload is unchanged.

If the resulting, smaller, SL packet header consumes a non-integer
number of bytes, zero padding bits MUST be inserted at the end of the SL
header to byte-align the SL packet payload. Similarly the SL packet
payload MUST be byte-aligned using zero padding bits.

When generating an SL packetized stream specifically for this format use
of all other fields in the SL packet headers that the RTP header does
not duplicate (including the decodingTimeStamp) is OPTIONAL.

RTP Packets SHOULD be sent in the decoding (MPEG-4 decodingTimeStamp)
order.

The size of the SL packet SHOULD be adjusted such that the resulting RTP
packet is not larger than the path-MTU. To handle larger packets, this
payload format relies on lower layers for fragmentation, which may not
be desirable.

It is assumed that the MPEG-4 SLConfigDescriptor is transported "out of
band". This is typically done via an ObjectDescriptorStream using the
MPEG-4 Object Description framework.

However since some knowledge of the SLConfigDescriptor is required by an
RTP receiver in order to parse MPEG-4 System specific elements in the
RTP payload defined in this document, the SLConfigDescriptor MAY be
transported in the SDP associated with such a stream using the a=fmtp
syntax (see below).

0                   1                   2                   3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|V=2|P|X|  CC   |M|     PT      |       sequence number         | RTP
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                           timestamp                           |Header
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|           synchronization source (SSRC) identifier            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
:            contributing source (CSRC) identifiers             :
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
|Reduced SL Packet Header (variable # of bytes)  |              |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-              | RTP
|                                                               |
|       SL Packet Payload (byte aligned)                        |Payload
|                                                               |

Gentric et al.                                                  6








RTP Payload Format for MPEG-4 Streams               December 2000


|                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                               :...OPTIONAL RTP padding        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                  Figure 2 - An RTP packet for MPEG-4


3.1 RTP Header Fields Usage

Payload Type (PT): The assignment of an RTP payload type for this new
packet format is outside the scope of this document, and will not be
specified here. It is expected that the RTP profile for a particular
class of applications will assign a payload type for this encoding, or
if that is not done then a payload type in the dynamic range shall be
chosen.

Marker (M) bit: Set to one to mark the last fragment (or only fragment)
of an AU.

Extension (X) bit: Defined by the RTP profile used.

Sequence Number: The RTP sequence number should be generated by the
sender with a constant random offset and does not have to be correlated
to any (optional) MPEG-4 SL sequence numbers.

Timestamp: Set to the value of the compositionTimeStamp field from the
reduced SL packet, if present. If compositionTimeStamp has less than 32
bits length, the most significant bits of timestamp MUST be set to zero.

Although it is available from the SL configuration data, the resolution
of the timestamp may need to be conveyed explicitly through some out-of-
band means to be used by network elements which are not MPEG-4 aware.

If compositionTimeStamp has more than 32 bits length, this payload
format cannot be used.

In all cases, the sender SHALL always make sure that RTP time stamps are
identical only for RTP packets transporting fragments of the same Access
Unit.

In case compositionTimeStamp is not present in the current SL packet,
but has been present in a previous SL packet the reason is that this is
the same Access Unit that has been fragmented therefore the same
timestamp value MUST be taken as RTP timestamp.

According to RFC1889 [5, Section 5.1] timestamps are recommended to
start at a random value for security reasons. However then, a receiver
is not in the general case able to reconstruct the original MPEG-4 Time
Stamps (CTS, DTS, OCR) which can be of use for applications where
streams from multiple sources are to be synchronized. Therefore the
usage of such a random offset SHOULD be avoided.




Gentric et al.                                                  7








RTP Payload Format for MPEG-4 Streams               December 2000


SSRC: set as described in RFC1889 [5]. A mapping between the ES
identifiers (ESIDs) and SSRCs should be provided through out-of-band
means.

CC and CSRC fields are used as described in RFC 1889 [5].

RTCP SHOULD be used as defined in RFC 1889 [5].

Reduced SL Header Packet : Defined in section 3.2 and 3.3. If the
Reduced SL Packet Header contains a non-integer number of bytes,
trailing padding bits, each coded as zero, MUST be inserted to byte
align the start of the SL Packet Payload.

SL Packet Payload : The payload of an SL Packet. The payload MUST be
byte aligned, if needed, by using trailing padding bits, each coded as
zero.

RTP timestamps in RTCP SR packets: according to the RTP timing model,
the RTP timestamp that is carried into an RTCP SR packet is the same as
the compositionTimeStamp that would be applied to an RTP packet for data
that was sampled at the instant the SR packet is being generated and
sent. The RTP timestamp value is calculated from the NTP timestamp for
the current time, which also goes in the RTCP SR packet. To perform that
calculation, an implementation needs to periodically establish a
correspondence between the CTS value of a data packet and the NTP time
at which that data was sampled.

3.2 Reduced SL Packet header construction

The following modifications of the SL packet header MUST be applied to a
SL packetized stream before encapsulation in this RTP payload format.
The other fields of the SL packet header MUST remain unchanged (but are
bit-shifted to fill in the gaps left by the changes specified below).

3.2.1 Time Stamps transformation

After placing its value in the RTP time stamp, the sender MUST remove
the compositionTimeStamp, if any, from each original SL packet header.
Consequently, the reduced SL packet includes a header without
compositionTimeStamp field. Furthermore other MPEG-4 Time Stamps are
encoded as offsets.

If compositionTimeStamp is never present in SL packets for this stream,
the RTP packetizer SHOULD convey a reading of a local clock at the time
the RTP packet is created.

The decodingTimeStamp, if present, MUST be replaced by the difference
between its value and the value of the compositionTimeStamp. If an OCR
(Object Clock Reference) is present it MUST also be changed to encode a
difference from the compositionTimeStamp in the same fashion. With this
payload format OCRs MUST have the same clock resolution as Time Stamps.
If compositionTimeStamp is not present for a SL packet that has OCR then
the OCR SHALL be encoded as a difference to the RTP time stamp.

Gentric et al.                                                  8








RTP Payload Format for MPEG-4 Streams               December 2000



Since this subtraction may lead to negative values, the offset MUST be
encoded as a twoÆs complement signed integer in network byte order.

Because these offsets (delta) typically require fewer bits to be
encoded, the sender MAY use a different length than the one indicated by
the original SLConfigDescriptor timeStampLength field. The length MUST
then be signaled to the receiver by using an SDP a=fmtp field (see
section 3.3 and section 8).

3.2.2 Indication of size

For efficiency SL packets do not carry their own size. This is not an
issue for RTP packets that contain a single SL Packet.

3.3 Reduced SL Packet Header

The reduced SL Packet Header is configurable and depends on SDP
parameters.

               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
               |     decodingTimeStampFlag             |
               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
               |     decodingTimeStampDelta            |
               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
               |     remainingSLPacketHeaderSize       |
               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
               |     remainingSLPacketHeader           |
               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                 Figure 3 û Reduced SL Packet Header


3.3.1 Usage of fields

decodingTimeStampFlag :  Indicates whether the decodingTimeStampDelta
field is present. A value of 1 indicates that the field is present, a
value of 0 that it is not present. If the decodingTimeStampFlag is true,
the sender MUST remove the decodingTimeStamp from the original SL packet
headers.

DecodingTimeStampDelta : Specifies the value of the DTS as a 2-
complement offset from the timestamp in the RTP header of this packet.

remainingSLPacketHeaderSize : Specifies the length in bits of the
immediately following remainingSLPacketHeader.

remainingSLPacketHeader : The remainder of an SL header after removal of
the CTS and DTS field, if any, but without any modification of the time
stamp flags. The semantics of the original SL Packet Header are defined
by a SLConfigDescriptor conveyed in SDP or by other means. If the
remaining SL Packet header contains an OCR field, then this field is not


Gentric et al.                                                  9








RTP Payload Format for MPEG-4 Streams               December 2000


coded as defined in such descriptor, but instead as described in 3.1.1
with a length indicated by the OCRDeltaLength parameter at SDP.

3.3.2 Relationship between reduced SL Packet header and SDP parameters

The relationship between the reduced SL Packet Header and the SDP
parameters decodingTimeStampDeltaLength and
remainingSLPacketHeaderSizeLength is as follows:

                                                Number of bits
Reduced SL Packet Header {
If (decodingTimeStampDeltaLength > 0){
        decodingTimeStampFlag                           1
        If (decodingTimeStampFlag == 1){
                decodingTimeStampDelta     decodingTimeStampDeltaLength
                }
        }
If (remainingSLPacketHeaderSizeLength > 0){
        remainingSLPacketHeaderSize
remainingSLPacketHeaderSizeLength
        remainingSLPacketHeader()
        }
}


4. SL packetized stream reconstruction

The MPEG-4 over IP framework [9] requires that the way a receiver can
reconstruct a valid SL packetized stream shall be documented, this is
the purpose of this section.

Since this format directly transports SL packets this reconstruction is
trivial with the following rules:
- The compositionTimeStamp is restored from the RTP timestamp if
   compositionTimeStampFlag is set to TRUE in the SLConfigDescriptor
   has been signaled in SDP.
- The decodingTimeStamp and OCR, if present, are restored from the
   offsets relative to the RTP timestamp.
- The other SL packet header fields SHALL remain exactly the same as in
   the remainingSLPacketHeader.

5. Multiplexing

Since a typical MPEG-4 session may involve a large number of objects,
that may be as many as a few hundred, transporting each ES as an
individual RTP session may not always be practical. Allocating and
controlling hundreds of destination addresses for each MPEG-4 session
may pose insurmountable session administration problems.  The
input/output processing overhead at the end-points will be extremely
high also. Additionally, low delay transmission of low bitrate data
streams, e.g. facial animation parameters, results in extremely high
header overheads.


Gentric et al.                                                  10








RTP Payload Format for MPEG-4 Streams               December 2000


To solve these problems, MPEG-4 data transport requires a multiplexing
scheme that allows selective bundling of several ESs. This is beyond the
scope of the payload format defined here. MPEG-4's Flexmux multiplexing
scheme may be used for this purpose by defining an additional RTP
payload format for "multiplexed MPEG-4 streams." Another approach may be
to develop a generic RTP multiplexing scheme usable for MPEG-4 data. The
multiplexing scheme reported in [8] may be a candidate for this
approach.

For MPEG-4 applications, the multiplexing technique needs to address the
following requirements:

i. The ESs multiplexed in one stream can change frequently during a
session. Consequently, the coding type, individual packet size and
temporal relationships between the multiplexed data units must be
handled dynamically.

ii. The multiplexing scheme should have a mechanism to determine the ES
identifier (ES_ID) for each of the multiplexed packets. ES_ID is not a
part of the SL header.

iii. In general, an SL packet does not contain information about its
size. The multiplexing scheme should be able to delineate the
multiplexed packets whose lengths may vary from a few bytes to close to
the path-MTU.

6. Security Considerations

RTP packets using the payload format defined in this specification are
subject to the security considerations discussed in the RTP
specification [5]. This implies that confidentiality of the media
streams is achieved by encryption. Because the data compression used
with this payload format is applied end-to-end, encryption may be
performed on the compressed data so there is no conflict between the two
operations. The packet processing complexity of this payload type does
not exhibit any significant non-uniformity in the receiver side to cause
a denial-of-service threat.

However, it is possible to inject non-compliant MPEG streams (Audio,
Video, and Systems) to overload the receiver/decoder's buffers which
might compromise the functionality of the receiver or even crash it.
This is especially true for end-to-end systems like MPEG where the
buffer models are precisely defined.

MPEG-4 Systems supports stream types including commands that are
executed on the terminal like OD commands, BIFS commands, etc. and
programmatic content like MPEG-J (Java(TM) Byte Code) and ECMASCRIPT. It
is possible to use one or more of the above in a manner non-compliant to
MPEG to crash or temporarily make the receiver unavailable.

Authentication mechanisms can be used to validate of the sender and the
data to prevent security problems due to non-compliant malignant MPEG-4
streams.

Gentric et al.                                                  11








RTP Payload Format for MPEG-4 Streams               December 2000



A security model is defined in MPEG-4 Systems streams carrying MPEG-J
access units which comprises Java(TM) classes and objects. MPEG-J
defines a set of Java APIs and a secure execution model.  MPEG-J content
can call this set of APIs and Java(TM) methods from a set of Java
packages supported in the receiver within the defined security model.
According to this security model, downloaded byte code is forbidden to
load libraries, define native methods, start programs, read or write
files, or read system properties.

Receivers can implement intelligent filters to validate the buffer
requirements or parametric (OD, BIFS, etc.) or programmatic (MPEG-J,
ECMAScript) commands in the streams. However, this can increase the
complexity significantly.

7. Types and names

The encoding name associated to this RTP payload format is "mpeg4-sl".

The media type may be any of:
- "video"
- "audio"
- "application"

"video" SHOULD be used for MPEG-4 Video streams (ISO/IEC 14496-2) or
MPEG-4 Systems streams that convey information needed for an
audio/visual presentation.

"audio" SHOULD be used for MPEG-4 Audio streams (ISO/IEC 14496-3) or
MPEG-4 Systems streams that convey information needed for an audio-only
presentation.

"application" SHOULD be used for MPEG-4 Systems streams (ISO/IEC 14496-
1) that serve other purposes than audio/visual presentation, e.g. in
some cases when MPEG-J streams are transmitted.

***

This section will have to be replaced by a full MIME type registration
before this is published as an RFC.

***


8. Additional SDP syntax

This format may require additional information about the mapping to be
made available to the receiver.

For example as mentioned above some fields of the SL packet header MUST
be reconfigured for optimal efficiency. When such a change is performed,
it MUST be signaled to the receiver using a SDP (a=fmtp) parameter as in
RFC 2327 [10, section 6].

Gentric et al.                                                  12








RTP Payload Format for MPEG-4 Streams               December 2000



8.1 Required Mapping information

8.1.1 Indication of decodingTimeStamp delta bit length

The following syntax should be used:

a=fmtp:<format> decodingTimeStampDeltaLength=<value>

<value> being the number of bits on which the decodingTimeStampDelta
field is encoded in the reduced SL packet headers. A value larger than
zero indicates that the decodingTimeStampFlag is contained in each
Reduced SL Packet Header. A value of zero indicates that the
decodingTimeStampFlag is not present; in that case, the sender MUST
remove any decodingTimeStampFlag from the original SL packet headers. If
this parameter is not present, it has a default value of zero.

8.1.2 Indication of the size of the remainingSLPacketHeader field

The following syntax should be used:

a=fmtp:<format> remainingSLPacketHeaderSize=<value>

<value> being the number of bits that is used to encode the subsequent
remaindingSLPacketHeaderSize field. An encoded value of zero indicate
non-presence of the remaindingSLPacketHeaderSize and the
remaindingSLPacketHeader fields. Absence of this parameter is equivalent
to an encoded value of zero.

8.1.3 Indication of OCR delta bit lengths

The following syntax should be used:

a=fmtp:<format> OCRDeltaLength=<value>

<value> being the number of bits on which the Object Clock References
deltas are encoded in the remaindingSLPacketHeader.

8.2 Optional configuration information

In the MPEG-4 framework the following information is carried using the
Object Descriptor. For compatibility with receivers that do not
implement the full MPEG-4 system specification this information MAY also
be indicated in SDP.

For transport of MPEG-4 audio and video without the use of MPEG-4
systems, as well as to support non-MPEG-4 system receivers, it is
possible to transport information on the profile and level of the stream
and on the decoder configuration.

8.2.1 Indication of SLConfigDescriptor

Senders MAY transmit the SLConfigDescriptor in SDP.

Gentric et al.                                                  13








RTP Payload Format for MPEG-4 Streams               December 2000



The following syntax should be used:

a=fmtp:<format> SLConfigDescriptor=<value>

<value> being a base-64 encoding of the SLConfigDescriptor. This SHALL
be the original SLConfigDescriptor and it SHALL be the same as the one
transported by the OD framework.

8.2.2 Indications for MPEG-4 audio streams

8.2.2.1 Indication of profile and level

Senders MAY transmit the profile and level indication in SDP.

The following syntax should be used:

a=fmtp:<format> profile-level-id=<value>

<value> being a  decimal representation of the MPEG-4 Audio Profile
Level indication value defined in ISO/IEC 14496-1. This parameter
indicates which MPEG-4 Audio tool subsets are applied to encode the
audio stream.

8.2.2.2 Indication of audio object type

Senders MAY transmit the audio object type indication in SDP.

The following syntax should be used:

a=fmtp:<format> object-type=<value>

<value> being a  decimal representation of the MPEG-4 Audio Object Type
value defined in ISO/IEC 14496-3. This parameter specifies the tool used
by the encoder. It CAN be used to limit the capability within the
specified "profile-level-id".

8.2.2.3 Indication of audio bitrate

Senders MAY transmit the audio bitrate in SDP.

The following syntax should be used:

a=fmtp:<format> bitrate=<value>

<value> being a decimal representation of the audio bitrate in bits per
second for the audio bit stream.

8.2.2.4 Indication of audio decoder configuration

Senders MAY transmit the audio decoder configuration in SDP.


Gentric et al.                                                  14








RTP Payload Format for MPEG-4 Streams               December 2000



The following syntax should be used:

a=fmtp:<format> config=<value>

<value> being a hexadecimal representation of an octet string that
expresses the audio payload configuration data "StreamMuxConfig", as
defined in ISO/IEC 14496-3. Configuration data is mapped onto the octet
string in an MSB-first basis. The first bit of the configuration data
SHALL be located at the MSB of the first octet. In the last octet, zero-
padding bits, if necessary, shall follow the configuration data.

8.2.3 Indications for MPEG-4 video streams

8.2.3.1 Indication of profile and level

Senders MAY transmit the video profile and level indication in SDP.

The following syntax should be used:

a=fmtp:<format> profile-level-id=<value>

<value> being a decimal representation of MPEG-4 Visual Profile
Level indication value (profile_and_level_indication) defined in Table
G-1 of ISO/IEC 14496-2. This parameter MAY be used in the capability
exchange or session setup procedure to indicate MPEG-4 Visual Profile
and Level combination of which the MPEG-4 Visual codec is capable. If
this parameter is not specified by the procedure, its default value of 1
(Simple Profile/Level 1) is used.

8.2.3.2 Indication of video decoder configuration

Senders MAY transmit the video decoder configuration in SDP. This
parameter indicates the configuration of the corresponding MPEG-4 visual
bitstream. It SHALL NOT be used to indicate the codec capability in the
capability exchange procedure.

The following syntax should be used:

a=fmtp:<format> config=<value>

<value> being a hexadecimal representation of an octet string that
expresses the MPEG-4 Visual configuration information, as defined in
subclause 6.2.1 Start codes of ISO/IEC14496-2[2][4][9]. The
configuration information is mapped onto the octet string in an MSB-
first basis. The first bit of the configuration information SHALL be
located at the MSB of the first octet. The configuration information
indicated by this parameter SHALL be the same as the configuration
information in the corresponding MPEG-4 Visual stream, except for
first_half_vbv_occupancy and latter_half_vbv_occupancy, if exist, which



Gentric et al.                                                  15








RTP Payload Format for MPEG-4 Streams               December 2000


may vary in the repeated configuration information inside an MPEG-4
Visual stream (See 6.2.1 Start codes of ISO/IEC14496-2).

8.3 Concatenation of fmtp parameters

Multiple fmtp parameters SHOULD be expressed as a MIME media type
string, in the form of a semicolon separated list of parameter=value
pairs.
8.4 SDP file example

In the following is an example of SDP syntax for the description of a
session containing one MPEG-4 audio stream, one MPEG-4 video and one
MPEG-4 system stream, transported using this format. Note that the video
stream Decoding Time Stamps are encoded on 4 bits in this example.

o= ....
I= ....
c=IN IP4 123.234.71.112
m=video 1034 RTP/AVT 97
a=fmtp:97 decodingTimeStampDeltaLength=6;
remainingSLPacketHeaderSizeLength=2
a=rtpmap:97 mpeg4-sl
m=audio 810  RTP/AVT 98
a=rtpmpa:98 mpeg4-sl
m=application 1234  RTP/AVT 99
a=rtpmap:99 mpeg4-sl

9. Examples of usage of this payload format

9.1 MPEG-4 Video

Let us consider the case of a 30 frames per second MPEG-4 video stream
which bit rate is high enough that Access Units have to be split in
several SL packets (typically above 300 kb/s).

Let us assume also that the video codec generates in that case Video
Packets suitable to fit in one SL packet i.e that the video codec is MTU
aware and the MTU is 1500 bytes. We assume furthermore that this stream
contains B frames and that decodingTimeStamps are present.

9.1.1 Typical SLConfigDescriptor for video streams

In this example the SLConfigDescriptor is:

class SLConfigDescriptor extends BaseDescriptor : bit(8)
tag=SLConfigDescrTag {
  bit(8) predefined;
  if (predefined==0) {
    bit(1) useAccessUnitStartFlag; = 1
    bit(1) useAccessUnitEndFlag; = 0
    bit(1) useRandomAccessPointFlag; = 1
    bit(1) hasRandomAccessUnitsOnlyFlag; = 0
    bit(1) usePaddingFlag; = 0

Gentric et al.                                                  16








RTP Payload Format for MPEG-4 Streams               December 2000


    bit(1) useTimeStampsFlag; = 1
    bit(1) useIdleFlag; = 0
    bit(1) durationFlag; = 0
    bit(32) timeStampResolution; = 30
    bit(32) OCRResolution; = 0
    bit(8) timeStampLength;     // must be <= 64  = 32
    bit(8) OCRLength;           // must be <= 64 = 0
    bit(8) AU_Length;           // must be <= 32 = 0
    bit(8) instantBitrateLength; = 0
    bit(4) degradationPriorityLength; = 0
    bit(5) AU_seqNumLength; // must be <= 16 = 0
    bit(5) packetSeqNumLength; // must be <= 16 = 0
    bit(2) reserved=0b11;
  }
  if (durationFlag) {
    bit(32) timeScale; // NOT USED
    bit(16) accessUnitDuration;  // NOT USED
    bit(16) compositionUnitDuration;  // NOT USED
  }
  if (!useTimeStampsFlag) {
    bit(timeStampLength) startDecodingTimeStamp; = 0
    bit(timeStampLength) startCompositionTimeStamp; = 0
  }
}

Note that:
useRandomAccessPointFlag is set so that the randomAccessPointFlag can
indicate that the corresponding SL packet contains a GOV followed by the
first Video Packet of an Intra coded frame.

9.1.2 Typical SL packet header structure for video streams

With this configuration we can extrapolate the following SL packet
header structure:
aligned(8) class SL_PacketHeader (SLConfigDescriptor SL) {
  if (SL.useAccessUnitStartFlag) bit(1) accessUnitStartFlag; // 1 bit
  if (accessUnitStartFlag) {
    if (SL.useRandomAccessPointFlag) bit(1) randomAccessPointFlag; // 1
bit
    if (SL.useTimeStampsFlag) {
      bit(1) decodingTimeStampFlag; // 1 bit
      bit(1) compositionTimeStampFlag; // 1 bit
    }
    if (decodingTimeStampFlag) bit(SL.timeStampLength)
decodingTimeStamp;
    if (compositionTimeStampFlag) bit(SL.timeStampLength)
compositionTimeStamp;
  }
}

9.1.3 SDP mapping information



Gentric et al.                                                  17








RTP Payload Format for MPEG-4 Streams               December 2000


decodingTimeStamps are encoded on 32 bits, which is much more than
needed. Therefore the sender will use decodingTimeStampDeltaLength in
the corresponding SDP to signal that only 6 bits are used for the coding
of relative DTS in the RTP packet.

The remainingSLPacketHeaderSize cannot exceed the value of 3 bits, which
is encoded on 2 bits and signaled by remainingSLPacketHeaderSizeLength.

The resulting fmtp line in SDP is:
a=fmtp: <format> decodingTimeStampDeltaLength=6;
remainingSLPacketHeaderSizeLength=2

9.1.4 RTP packet structure

Such SL packet headers can result in several reduced SL packet headers:

For packets that transport first fragments of Access Units:

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| RTP header                              |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| decodingTimeStampFlag = 1 (1 bit)       |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| decodingTimeStampDelta (6 bits)         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| remainingSLPacketHeaderSize = 3 (2 bits)|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| accessUnitStartFlag = 1 (1 bit)         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| randomAccessPointFlag  (1 bit)          |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| compositionTimeStampFlag = 1 (1 bit)    |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| 0000 (4 zero bits to byte alignment)    |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Video SL packet (N  bytes)              |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

For packets that transport non-first fragments of Access Units:

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| RTP header                              |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| decodingTimeStampFlag = 0 (1 bit)       |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| remainingSLPacketHeaderSize = 1 (2 bits)|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| accessUnitStartFlag = 0 (1 bit)         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| 0000 (4 zero bits to byte alignment)    |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Video SL packet (N  bytes)              |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Gentric et al.                                                  18








RTP Payload Format for MPEG-4 Streams               December 2000



Note the compositionTimeStamp is never present since it would be
redundant with the RTP time stamp. However the value of
compositionTimeStampFlag is still 1 to indicate that
compositionTimeStamp was present for this SL packet and should therefore
be restored by the receiver using the RTP time stamp.
In this example we have a RTP overhead of 40 + 2 bytes for 1400 bytes of
payload i.e. 3 % overhead.

9.2 MPEG-4 Audio

9.2.1. Typical SLConfigDescriptor for MPEG-4 Audio

Since CTS=DTS signaling of MPEG-4 time stamps is not needed.

We also assume here an audio Object Type for which all Access Units are
Random Access Points, which is signaled using the
hasRandomAccessUnitsOnlyFlag in the SLConfigDescriptor.

In this example the SLConfigDescriptor can be:

class SLConfigDescriptor extends BaseDescriptor : bit(8)
tag=SLConfigDescrTag {
  bit(8) predefined;
  if (predefined==0) {
    bit(1) useAccessUnitStartFlag; = 0
    bit(1) useAccessUnitEndFlag; = 0
    bit(1) useRandomAccessPointFlag; = 0
    bit(1) hasRandomAccessUnitsOnlyFlag; = 1
    bit(1) usePaddingFlag; = 0
    bit(1) useTimeStampsFlag; = 0
    bit(1) useIdleFlag; = 0
    bit(1) durationFlag; = 0
    bit(32) timeStampResolution; = 0
    bit(32) OCRResolution; = 0
    bit(8) timeStampLength;     // must be <= 64  = 0
    bit(8) OCRLength;           // must be <= 64 = 0
    bit(8) AU_Length;           // must be <= 32 = 0
    bit(8) instantBitrateLength; = 0
    bit(4) degradationPriorityLength; = 0
    bit(5) AU_seqNumLength; // must be <= 16 = 0
    bit(5) packetSeqNumLength; // must be <= 16 = 0
    bit(2) reserved=0b11;
  }
  if (durationFlag) {
    bit(32) timeScale; // NOT USED
    bit(16) accessUnitDuration;  // NOT USED
    bit(16) compositionUnitDuration;  // NOT USED
  }
  if (!useTimeStampsFlag) {
    bit(timeStampLength) startDecodingTimeStamp; = 0
    bit(timeStampLength) startCompositionTimeStamp; = 0
  }

Gentric et al.                                                  19








RTP Payload Format for MPEG-4 Streams               December 2000


}

9.2.2 Typical SL packet header for MPEG-4 Audio

With this configuration the SL header is empty.

This does not have to be indicated in SDP since the default value for
remainingSLPacketHeaderSizeLength and decodingTimeStampDeltaLength is
zero.
Therefore the absence of these fields in SDP indicates the absence of
decodingTimeStampFlag and remainingSLPacketHeaderSize in RTP packets.

9.2.3. Overhead estimation for MPEG-4 Audio

Depending on the actual MPEG-4 audio Object Type used the RTP overhead
(IP+UDP+RTP headers) can be very large since the SL packet payload can
be a few bytes or less.


10. References

[1] ISO/IEC 14496-1:2000 MPEG-4 Systems October 2000

[2] ISO/IEC 14496-2:1999/Amd.1:2000(E) MPEG-4 Visual January 2000

[3] ISO/IEC 14496-3:1999/FDAM 1:20000 MPEG-4 Audio January 2000

[4] ISO/IEC 14496-6 FDIS Delivery Multimedia Integration Framework,
November 1998.

[5] Schulzrinne, Casner, Frederick, Jacobson RTP: A Transport Protocol
for Real Time Applications  RFC 1889, Internet Engineering Task Force,
January 1996.

[6] S. Bradner, Key words for use in RFCs to Indicate Requirement
Levels, RFC 2119, March 1997.

[7] Y. Kikuchi, T. Nomura, S. Fukunaga, Y. Matsui, H. Kimata, RTP
payload format for MPEG-4 Audio/Visual streams, RFC 3016, Internet
Engineering Task Force, November 2000.

[8] B. Thompson, T. Koren, D. Wing, Tunneling multiplexed Compressed RTP
("TCRTP"), work in progress, draft-ietf-avt-tcrtp-01.txt, July 2000.

[9] D. Singer, Y Lim, A Framework for the delivery of MPEG-4 over IP-
based Protocols, work in progress, draft-singer-mpeg4-ip-01.txt,October
2000.

[10] Handley, Jacobson, SDP: Session Description Protocol, RFC 2327,
Internet Engineering Task Force, April 1998.


11. Authors' Addresses

Gentric et al.                                                  20








RTP Payload Format for MPEG-4 Streams               December 2000



Olivier Avaro
France Telecom
35 A Sch’tzenh’ttenweg
60598 Frankfurt am Main
Deutschland
e-mail: olivier.avaro@francetelecom.fr

Andrea Basso
AT&T Labs Research
200 Laurel Avenue
Middletown, NJ 07748
USA
e-mail: basso@research.att.com

Stephen L. Casner
Packet Design, Inc.
66 Willow Place
Menlo Park, CA 94025
USA
casner@acm.org

M. Reha Civanlar
AT&T Labs - Research
100 Schultz Drive
Red Bank, NJ 07701
USA
e-mail: civanlar@research.att.com

Philippe Gentric
Philips Digital Networks
22 Avenue Descartes
94453 Limeil-Brevannes CEDEX
France
e-mail: philippe.gentric@philips.com

Carsten Herpel
THOMSON multimedia
Karl-Wiechert-Allee 74
30625 Hannover
Germany
e-mail: herpelc@thmulti.com

Zvi Lifshitz
Optibase Ltd.
7 Shenkar St.
Herzliya 46120
Israel
e-mail: zvil@optibase.com

Young-kwon Lim
mp4cast (MPEG-4 Internet Broadcasting Solution Consortium)
1001-1 Daechi-Dong Gangnam-Gu

Gentric et al.                                                  21








RTP Payload Format for MPEG-4 Streams               December 2000


Seoul, 305-333,
Korea
e-mail : young@techway.co.kr

Colin Perkins
USC Information Sciences Institute
4350 N. Fairfax Drive #620
Arlington, VA 22203
USA
e-mail: csp@isi.edu

Jan van der Meer
Philips Digital Networks
Cederlaan 4
5600 JB Eindhoven
Netherlands
e-mail: jan.vandermeer@philips.com





































Gentric et al.                                                  22