[RFC] New audio output architecture

Sat May 4 01:23:17 CEST 2002

À (At) 23:31 +0200 3/05/2002, Christophe Massiot écrivait (wrote) :

>I'm updating my document with a few major new ideas I've had today, 
>and after that I think I'm gonna start writing the core functions.

Enjoy. I rewrote the ending.

1. General architecture of aout3
    =============================
Whereas aout2 relied on the DSP plug-in behavior, aout3 is designed 
for modern callback-based audio API, such as Mac OS X CoreAudio. Of 
course DSP can still be implemented, but using some kind of emulation.

| The following schematic has changed :

  +------+    +---------+    +---------+    +---------+
  | adec | -> |   Pre-  | -> |  Mix &  | -> |  Post-  | -> buffer #0
  |      |    | filters |    | Downmix |    | filters |    buffer #-1
  +------+    +---------+    +---------+    +---------+    buffer #-2    +--+
                                                           buffer #-3 -> |HW|
                                                                         +--+
  <--------------------->    <------------------------>    <---------------->
    audio decoder thread         audio mixer thread         audio output thr

              <--------->    <--------->    <--------->    <---------------->
modules :    aout filters    aout mixer    aout filters        aout aal

[aal = architecture abstraction layer]

As you see, the major idea behind aout3 is to split the audio output 
thread in two threads. As a matter of fact, the current aout fulfills 
two contradictory missions : heavy calculation for mixing and 
resampling, and wait for a VERY accurate date to DMA the buffer to 
the hardware.

The audio mixer thread takes up almost all functions of the current 
audio output. The new audio output IO thread only cares about taking 
the first spooled buffer and DMAing it to the hardware. The latter 
thread implementation will greatly differ between architectures. For 
instance the Mac OS X CoreAudio plug-in will not need spawning a 
thread, since this is already done by CoreAudio's IO thread. The DSP 
plug-in on the contrary will launch a new thread, and spend its time 
doing blocking write() calls.

On systems having real-time capabilities, the audio output IO thread 
should be assigned a VERY HIGH priority, so that DMA transfers are 
not even slightly delayed. This is the case in CoreAudio.

With that architecture, the only way we will have buffer underruns is 
if the audio mixer thread doesn't have time to prepare the samples. 
This implies a very long burst in the CPU load and on real-time 
systems, one will want to assign the audio mixer thread a high (but 
not as high as audio output's) priority.

And before every DMA transfer we will check the date and print a 
message if we're late. That way, at least we will know when we have 
to expect an audio underrun. Better than nothing.

2. aout3 data format
    =================
The schematic in chapter 1 is oversimplified. The audio output needs 
to understand a handful of input and output formats : u8, s8, u16, 
s16, float32, fixed24. It also must deal with an arbitrary number of 
channels (for instance #defined to 6), and with several input streams 
(this is debated [*]). We can no longer have an output per format (as 
it is the case in aout2), so we need some simplification.

I propose that the internal format for samples in the audio output be 
float32. It seems to be a very popular format ; for instance it is 
the native output format of liba52, and the native input format of 
Mac OS X CoreAudio. It allows for more precision in the samples, and 
as such I think it is the best choice that we have.

float32 processing may take more CPU time than currently. I may hurt 
your feelings : we do not care. On all machines where VLC runs, the 
audio output takes up 0 % CPU. We can afford trading more CPU for 
more precision. I also think we should use more expensive dithering 
algorithms for resampling. Audio output is the main source of 
complains from our users, we must do something, even at the expense 
of a higher CPU load.

| However, embedded systems usually do not have a floating-point unit, so
| we will also have a simpler mode using fixed24, the native format of
| libmad. It is not required for plug-in developers to support both float32
| and fixed24, since it is expected that embedded systems do not need as
| many features as workstations (for instance downmixing will probably
| never be useful on an embedded system).

[*] In case VLC reads several streams at once, there may be several 
instances of audio decoders at the same time, and thus several 
streams to mix before output. I'm not sure this is a good idea. In 
another thread I will speak about the multi-stream aspect of VLC, 
which can be debated.

| 3. aout3 filters
|    =============
aout3 is built as a pipeline of filters which have very different 
roles. A filter is a unit converting one stream format to another, a 
stream format being :

typedef struct audio_sample_format_s
{
     int i_type; /* u8, s8, u16, s16, float32, fixed24... */
     int i_rate;
     int i_channels; /* 1..6 */
} audio_sample_format_t;

3.1 Filter plug-ins

A filter plug-in takes one stream as input, and outputs one stream, 
with one or several parameters changed. Consequently, there will be 
three basic types of filter plug-ins :
- Converters, from one type to another (ie. u16 -> float32)
- Resamplers (change i_rate), either because the hardware doesn't support
the rate of the stream (48000 -> 44100), or because we have clock 
problems and need to go a little faster or slower (48000 -> 47500)
- Special effects plug-ins, which change the samples without changing 
the format ; for instance attenuation, balance or graphics effects.

For optimization purposes, a filter plug-in can combine several 
operations. For instance, a filter can convert (u16 -> float32) and 
resample (48000 -> 44100) at the same time.

When needing a conversion (ie. for the pre-filters and post-filters 
passes), the aout core functions will call modules for candidates for 
the whole conversion, eg. :
{ u16, 48000, 2 } -> { float32, 44100, 2 }
If there is no candidate, the core functions will split the transformation :
{ u16, 48000, 2 } -> { float32, 48000, 2 }
{ float32, 48000, 2 } -> { float32, 44100, 2 }
The type conversion will occur at the beginning (pre-filters pass) or 
at the end (post-filters pass).

In all cases, the user will have the ability to modify the pipeline 
and add or delete filters, provided the continuity of the formats is 
preserved.

3.2 Mixer plug-in

The mixer is a special type of filter, in that it takes several 
streams as inputs, and outputs one stream. The input streams must all 
be of the same rates and types. Only two types will be supported : 
float32 and fixed24.
Pre-filters are in charge of converting the streams to fulfill this 
requirement.

The number of channels in the output stream will depend on the 
hardware capabilities. When necessary, downmixing or upmixing will be 
performed on the fly.

We can implement several mixers with different complexities. For 
instance workstations can use a float32 mixer with dithering and 
precise downmixing. Embedded systems may use a much faster fixed24 
mixer with limited accuracy.

4. aout data flow
    ==============

| 4.1 Pre-filters

  +------+    +-----------+    +-----------+    +----------+    +-----------+
  | adec | -> | Converter | -> | Resampler | -> | Optional | -> | amix thr. |
  |      |    |           |    |           |    | effects  |    |           |
  +------+    +-----------+    +-----------+    +----------+    +-----------+

  <-------------------------------------------------------->    <----------->
                  audio decoder thread                        audio mixer thr

Pre-filters transform the samples from the arbitrary format of the 
decoder to the native format of the audio mixer.

Please notice that these operations take place in the audio decoder 
thread, and not in the audio mixer thread. I see three advantages :
- when working with several streams, the conversion and resampling of 
every stream can be done in parallel for performance reasons (SMP 
machines) ;
- the conversion is done immediately, so that the buffer can be 
immediately reused by the decoder ; we do not need setting up complex 
buffering services with the decoder, which is free to allocate its 
own buffers ;
- the filters are processed just after the decoding of the samples, 
which improves the processor cache efficiency, thus giving better 
performance.

And one drawback :
- the aout_PlayBuffer function doesn't return immediately. This must 
be taken into account when designing decoders.

When the audio decoder gives samples to output, it assigns them a 
calculated timestamp. The audio output tries to figure out an 
estimated timestamp of the output time of the first sample, with 
information provided by the architecture abstraction layer. If the 
two dates aren't equal, the audio output core functions may decide to 
1. do nothing if the difference is small 2. resample the current 
buffer to be in sync at the end of the buffer 3. skip samples if 
we're _really_ late. The resampling is done on a per-stream basis, 
because some decoders may be late while others aren't.

| 4.2 Mixer

The mixer combine several input streams into one output stream with 
the same number of channels as the hardware. See § 3.2 for more 
information on the data flow inside the mixer plug-in.

The mixer runs in its own thread because it can multiplex streams 
coming from several adec threads. When there is only one decoder (the 
majority of the cases), it may seem smarter to run the mixer inside 
the audio decoder thread, for the same reasons as the pre-filters, 
and because the audio mixer doesn't have hard real-time constraints. 
We may choose to support such behavior. In case suddenly a second 
stream appears, we can either do the mixing in one of the decoder 
thread, or spawn a new mixer thread.

| 4.3 Post-filters

Post-filters are run after the mixer, in the same thread, to control 
global properties such as volume or balance. They also convert the 
samples to the hardware type (float32 -> u16).

  +-------+    +----------+    +-----------+
  | mixer | -> | Optional | -> | Converter | -> spooled buffer for IO thread
  |       |    | effects  |    |           |
  +-------+    +----------+    +-----------+

  <---------------------------------------->
             audio mixer thread

| 4.4 Architecture abstraction layer

The AAL runs a loop which regularly triggers the transfer of the 
eldest spooled buffer to the hardware. No transformation is done. The 
AAL will provide the core functions with information on the playback, 
for instance if we're late or in advance in comparison with the 
timestamp of the buffer.

The AAL has very strong real-time constraints and will thus run in 
its own thread, either provided by the operating system (Mac OS X) or 
spawned on the fly.

-- 
Christophe Massiot.

-- 
This is the vlc-devel mailing-list, see http://www.videolan.org/vlc/
To unsubscribe, please read http://www.videolan.org/lists.html
If you are in trouble, please contact <postmaster at videolan.org>