[RFC] New audio output architecture
Christophe Massiot
massiot at via.ecp.fr
Sat May 4 01:23:17 CEST 2002
À (At) 23:31 +0200 3/05/2002, Christophe Massiot écrivait (wrote) :
>I'm updating my document with a few major new ideas I've had today,
>and after that I think I'm gonna start writing the core functions.
Enjoy. I rewrote the ending.
1. General architecture of aout3
=============================
Whereas aout2 relied on the DSP plug-in behavior, aout3 is designed
for modern callback-based audio API, such as Mac OS X CoreAudio. Of
course DSP can still be implemented, but using some kind of emulation.
| The following schematic has changed :
+------+ +---------+ +---------+ +---------+
| adec | -> | Pre- | -> | Mix & | -> | Post- | -> buffer #0
| | | filters | | Downmix | | filters | buffer #-1
+------+ +---------+ +---------+ +---------+ buffer #-2 +--+
buffer #-3 -> |HW|
+--+
<---------------------> <------------------------> <---------------->
audio decoder thread audio mixer thread audio output thr
<---------> <---------> <---------> <---------------->
modules : aout filters aout mixer aout filters aout aal
[aal = architecture abstraction layer]
As you see, the major idea behind aout3 is to split the audio output
thread in two threads. As a matter of fact, the current aout fulfills
two contradictory missions : heavy calculation for mixing and
resampling, and wait for a VERY accurate date to DMA the buffer to
the hardware.
The audio mixer thread takes up almost all functions of the current
audio output. The new audio output IO thread only cares about taking
the first spooled buffer and DMAing it to the hardware. The latter
thread implementation will greatly differ between architectures. For
instance the Mac OS X CoreAudio plug-in will not need spawning a
thread, since this is already done by CoreAudio's IO thread. The DSP
plug-in on the contrary will launch a new thread, and spend its time
doing blocking write() calls.
On systems having real-time capabilities, the audio output IO thread
should be assigned a VERY HIGH priority, so that DMA transfers are
not even slightly delayed. This is the case in CoreAudio.
With that architecture, the only way we will have buffer underruns is
if the audio mixer thread doesn't have time to prepare the samples.
This implies a very long burst in the CPU load and on real-time
systems, one will want to assign the audio mixer thread a high (but
not as high as audio output's) priority.
And before every DMA transfer we will check the date and print a
message if we're late. That way, at least we will know when we have
to expect an audio underrun. Better than nothing.
2. aout3 data format
=================
The schematic in chapter 1 is oversimplified. The audio output needs
to understand a handful of input and output formats : u8, s8, u16,
s16, float32, fixed24. It also must deal with an arbitrary number of
channels (for instance #defined to 6), and with several input streams
(this is debated [*]). We can no longer have an output per format (as
it is the case in aout2), so we need some simplification.
I propose that the internal format for samples in the audio output be
float32. It seems to be a very popular format ; for instance it is
the native output format of liba52, and the native input format of
Mac OS X CoreAudio. It allows for more precision in the samples, and
as such I think it is the best choice that we have.
float32 processing may take more CPU time than currently. I may hurt
your feelings : we do not care. On all machines where VLC runs, the
audio output takes up 0 % CPU. We can afford trading more CPU for
more precision. I also think we should use more expensive dithering
algorithms for resampling. Audio output is the main source of
complains from our users, we must do something, even at the expense
of a higher CPU load.
| However, embedded systems usually do not have a floating-point unit, so
| we will also have a simpler mode using fixed24, the native format of
| libmad. It is not required for plug-in developers to support both float32
| and fixed24, since it is expected that embedded systems do not need as
| many features as workstations (for instance downmixing will probably
| never be useful on an embedded system).
[*] In case VLC reads several streams at once, there may be several
instances of audio decoders at the same time, and thus several
streams to mix before output. I'm not sure this is a good idea. In
another thread I will speak about the multi-stream aspect of VLC,
which can be debated.
| 3. aout3 filters
| =============
aout3 is built as a pipeline of filters which have very different
roles. A filter is a unit converting one stream format to another, a
stream format being :
typedef struct audio_sample_format_s
{
int i_type; /* u8, s8, u16, s16, float32, fixed24... */
int i_rate;
int i_channels; /* 1..6 */
} audio_sample_format_t;
3.1 Filter plug-ins
A filter plug-in takes one stream as input, and outputs one stream,
with one or several parameters changed. Consequently, there will be
three basic types of filter plug-ins :
- Converters, from one type to another (ie. u16 -> float32)
- Resamplers (change i_rate), either because the hardware doesn't support
the rate of the stream (48000 -> 44100), or because we have clock
problems and need to go a little faster or slower (48000 -> 47500)
- Special effects plug-ins, which change the samples without changing
the format ; for instance attenuation, balance or graphics effects.
For optimization purposes, a filter plug-in can combine several
operations. For instance, a filter can convert (u16 -> float32) and
resample (48000 -> 44100) at the same time.
When needing a conversion (ie. for the pre-filters and post-filters
passes), the aout core functions will call modules for candidates for
the whole conversion, eg. :
{ u16, 48000, 2 } -> { float32, 44100, 2 }
If there is no candidate, the core functions will split the transformation :
{ u16, 48000, 2 } -> { float32, 48000, 2 }
{ float32, 48000, 2 } -> { float32, 44100, 2 }
The type conversion will occur at the beginning (pre-filters pass) or
at the end (post-filters pass).
In all cases, the user will have the ability to modify the pipeline
and add or delete filters, provided the continuity of the formats is
preserved.
3.2 Mixer plug-in
The mixer is a special type of filter, in that it takes several
streams as inputs, and outputs one stream. The input streams must all
be of the same rates and types. Only two types will be supported :
float32 and fixed24.
Pre-filters are in charge of converting the streams to fulfill this
requirement.
The number of channels in the output stream will depend on the
hardware capabilities. When necessary, downmixing or upmixing will be
performed on the fly.
We can implement several mixers with different complexities. For
instance workstations can use a float32 mixer with dithering and
precise downmixing. Embedded systems may use a much faster fixed24
mixer with limited accuracy.
4. aout data flow
==============
| 4.1 Pre-filters
+------+ +-----------+ +-----------+ +----------+ +-----------+
| adec | -> | Converter | -> | Resampler | -> | Optional | -> | amix thr. |
| | | | | | | effects | | |
+------+ +-----------+ +-----------+ +----------+ +-----------+
<--------------------------------------------------------> <----------->
audio decoder thread audio mixer thr
Pre-filters transform the samples from the arbitrary format of the
decoder to the native format of the audio mixer.
Please notice that these operations take place in the audio decoder
thread, and not in the audio mixer thread. I see three advantages :
- when working with several streams, the conversion and resampling of
every stream can be done in parallel for performance reasons (SMP
machines) ;
- the conversion is done immediately, so that the buffer can be
immediately reused by the decoder ; we do not need setting up complex
buffering services with the decoder, which is free to allocate its
own buffers ;
- the filters are processed just after the decoding of the samples,
which improves the processor cache efficiency, thus giving better
performance.
And one drawback :
- the aout_PlayBuffer function doesn't return immediately. This must
be taken into account when designing decoders.
When the audio decoder gives samples to output, it assigns them a
calculated timestamp. The audio output tries to figure out an
estimated timestamp of the output time of the first sample, with
information provided by the architecture abstraction layer. If the
two dates aren't equal, the audio output core functions may decide to
1. do nothing if the difference is small 2. resample the current
buffer to be in sync at the end of the buffer 3. skip samples if
we're _really_ late. The resampling is done on a per-stream basis,
because some decoders may be late while others aren't.
| 4.2 Mixer
The mixer combine several input streams into one output stream with
the same number of channels as the hardware. See § 3.2 for more
information on the data flow inside the mixer plug-in.
The mixer runs in its own thread because it can multiplex streams
coming from several adec threads. When there is only one decoder (the
majority of the cases), it may seem smarter to run the mixer inside
the audio decoder thread, for the same reasons as the pre-filters,
and because the audio mixer doesn't have hard real-time constraints.
We may choose to support such behavior. In case suddenly a second
stream appears, we can either do the mixing in one of the decoder
thread, or spawn a new mixer thread.
| 4.3 Post-filters
Post-filters are run after the mixer, in the same thread, to control
global properties such as volume or balance. They also convert the
samples to the hardware type (float32 -> u16).
+-------+ +----------+ +-----------+
| mixer | -> | Optional | -> | Converter | -> spooled buffer for IO thread
| | | effects | | |
+-------+ +----------+ +-----------+
<---------------------------------------->
audio mixer thread
| 4.4 Architecture abstraction layer
The AAL runs a loop which regularly triggers the transfer of the
eldest spooled buffer to the hardware. No transformation is done. The
AAL will provide the core functions with information on the playback,
for instance if we're late or in advance in comparison with the
timestamp of the buffer.
The AAL has very strong real-time constraints and will thus run in
its own thread, either provided by the operating system (Mac OS X) or
spawned on the fly.
--
Christophe Massiot.
--
This is the vlc-devel mailing-list, see http://www.videolan.org/vlc/
To unsubscribe, please read http://www.videolan.org/lists.html
If you are in trouble, please contact <postmaster at videolan.org>
More information about the vlc-devel
mailing list