[RFC] New audio output architecture

Wed May 1 23:44:19 CEST 2002

Dear friends,

We're having more and more problems with our current audio output 
architecture, and I am fully convinced that the only option we have, 
is to annihilate it. Nuke it.

I have spent quite a long time debugging aout_macosx.c, trying to 
avoid cracks and gltiches whenever the CPU load is a little higher 
than usual. This is pointless. From what I have learned with this 
experience, the problem is that our architecture is completely f*cked 
up.

In this document I would like to propose ideas for a new audio output 
architecture, code-named aout3. I'm not saying here that _I_ will 
write it, though if nobody volunteers, I'll eventually end up doing 
it, with my usual huge latency. Comments from audio specialists are 
welcome, because I'm definitely not one of them (and, I must say, I'm 
really dumb as for audio matters).

0. Current audio output implementation (code-named aout2)
    ======================================================
In this paragraph I'll try to emphasize on what's wrong in aout2. 
First, please bear in mind that aout2 is probably the oldest piece of 
code in VLC, and has been written at the very beginning of the 
project, when I hadn't even started writing the first MPEG video 
decoder... We were young and hadn't much experience.

  +----------+               +----------+     +--------+     +-----------+
  |   Audio  | ------------> |   Mix &  | --> | UNIQUE | --> | soundcard |
  | decoders | [aout_fifo_t] | Resample |     | buffer |     +-----------+
  +----------+               +----------+     +--------+

                             <------------------------->     <----------->
                                    audio output                plug-in

[if you can't see this drawing, try with a fixed-size font such as Courier]

Here is the course of action of the audio output thread :

   Mix & Resample aout fifos <------------------------------------------+
                                                                        |
   Write a unique buffer of size AOUT_BUFFER_DURATION                   |
                                                                        |
   Ask the hardware how much bytes are remaining in the card's buffer,  |
   and calculate the output date of the next byte [pf_getbufinfo]       |
                                                                        |
   Write the unique buffer in the card's buffer [pf_play]               |
                                                                        |
   Block until this is done --------------------------------------------+

A first problem, which arises the complexity of the audio output, is 
the multiplicity of the aout_fifo_t formats. We currently support :
  - unsigned 8-bit
  - signed 8-bit
  - unsigned 16-bit
  - signed 16-bit
Each in stereo or mono mode. That makes 8 input formats to account 
for. There is one audio output per format. Incidentally, we don't 
accept the popular float32 (liba52) or fixed24 (libmad) formats, and 
multi-channel output (5.1) is not even scheduled.

The second problem, and the most important, is the unique buffer. 
Since we only have 100 ms of data in store, if, for one reason or 
another, the scheduler doesn't schedule the thread for 100 ms, we're 
lost. There is not much to do for the DSP plug-in used on *NIX-like 
OSes, but hey, you know what, kernel developers have done some major 
advances these last 10 years.

Take for instance Mac OS X CoreAudio architecture, of which I am now 
officially a major fan. Writing the buffer is not done by a simple 
system call (write), but by a clever mechanism of callback. Whenever 
CoreAudio is starving, it wakes up (with a VERY high priority) a 
thread called the IO thread which calls your callback, so that data 
you have prepared in advanced can be DMAed immediately. 
Unfortunately, with the unique buffer, there is no way we could 
prepare data in advance. This is true also for DSP output plug-in, 
when we're stuck in the write() system call, we could already start 
preparing the next buffer.

1. General architecture of aout3
    =============================
Whereas aout2 relied on the DSP plug-in behavior, aout3 is designed 
for modern callback-based audio API, such as Mac OS X CoreAudio. Of 
course DSP can still be implemented, but using some kind of emulation.

  +------+    +----------+    +---------+    +---------+
  | adec | -> |   Mix &  | -> | Sound   | -> | Channel | -> buffer #0
  |      | -> | Resample |    | effects |    | downmix |    buffer #-1
  +------+    +----------+    +---------+    +---------+    buffer #-2
                                                            buffer #-3 -> HW

              <---------------------------------------->    <-------------->
                         audio mixer thread                 audio output thr

              <---------->    <------------------------>    <-------------->
               aout core        aout filter plug-in(s)        aout plug-in

As you see, the major idea behind aout3 is to split the audio output 
thread in two threads. As a matter of fact, the current aout fulfills 
two contradictory missions : heavy calculation for mixing and 
resampling, and wait for a VERY accurate date to DMA the buffer to 
the hardware.

The audio mixer thread takes up almost all functions of the current 
audio output. The new audio output IO thread only cares about taking 
the first spooled buffer and DMAing it to the hardware. The latter 
thread implementation will greatly differ between architectures. For 
instance the Mac OS X CoreAudio plug-in will not need spawning a 
thread, since this is already done by CoreAudio's IO thread. The DSP 
plug-in on the contrary will launch a new thread, and spend its time 
doing blocking write() calls.

On systems having real-time capabilities, the audio output IO thread 
should be assigned a VERY HIGH priority, so that DMA transfers are 
not even slightly delayed. This is the case in CoreAudio.

With that architecture, the only way we will have buffer underruns is 
if the audio mixer thread doesn't have time to prepare the samples. 
This implies a very long burst in the CPU load and on real-time 
systems, one will want to assign the audio mixer thread a high (but 
not as high as audio output's) priority.

And before every DMA transfer we will check the date and print a 
message if we're late. That way, at least we will know when we have 
to expect an audio underrun. Better than nothing.

2. aout3 data flow
    ===============
The schematic in chapter 1 is oversimplified. The audio output needs 
to understand a handful of input and output formats : u8, s8, u16, 
s16, float32, fixed24. It also must deal with an arbitrary number of 
channels (for instance #defined to 6), and with several input streams 
(this is debated [**]). We can no longer have an output per format 
(as it is the case in aout2), so we need some simplification.

I propose that the internal format for samples in the audio output be 
float32. It seems to be a very popular format ; for instance it is 
the native output format of liba52, and the native input format of 
Mac OS X CoreAudio. It allows for more precision in the samples, and 
as such I think it is the best choice that we have.

float32 processing may take more CPU time than currently. I may hurt 
your feelings : we do not care. On all machines where VLC runs, the 
audio output takes up 0 % CPU. We can afford trading more CPU for 
more precision. I also think we should use more expensive dithering 
algorithms for resampling. Audio output is the main source of 
complains from our users, we must do something, even at the expense 
of a higher CPU load. [*]

Since all internal calculations will be done on float32, we need a 
bunch of converters to and from float32. The flow of operations on 
data is as follows :

  -------------------------------------------------+------------------------
  Conversion from the adec format to float32       | AOUT_CONVERTER plug-in
                                                   +------------------------
  Mixing several input streams & resampling        | main audio mixer module
                                                   +------------------------
  Sound effects [optional]                         | AOUT_FILTER plug-in
                                                   +------------------------
  Downmixing if the output plug-in doesn't support | AOUT_FILTER plug-in #2
  as many channels                                 |
                                                   +------------------------
  Conversion from float32 to native output format  | AOUT_CONVERTER plug-in
  -------------------------------------------------+------------------------

Notes
[*] I'm not completely sure about that. The consequences of float32 
on embedded systems without hardware FPU support need to be 
evaluated. For these systems, fixed24 (the native format of libmad) 
may be more clever. Perhaps it would be a good idea to have a version 
of the audio mixer using fixed24 for embedded systems. Caution, I'm 
not saying that we should support two native formats at the same 
time, I'm just saying that there could be a #define AOUT_FORMAT 
fixed24 for some architectures. This implies adapting the sound 
effects and downmixing modules too, but embedded systems probably do 
not need such complicated things...

[**] In case VLC reads several streams at once, there may be several 
instances of audio decoders at the same time, and thus several 
streams to mix before output. I'm not sure this is a good idea. In 
another thread I will speak about the multi-stream aspect of VLC, 
which can be debated.

3. APIs
    ====

All these APIs do not pretend to be exhaustive. It's just a quick 
look on what needs to be done.

3.1 Decoder API (aout_ext-dec.h)

I suggest that we use the same approach as for the video output. It 
is indeed far easier to understand than having a single shared 
structure without any function at all.

void * aout_NewStream( int i_format, int i_channels, int i_rate );
void aout_EndStream( void * p_stream );
void aout_PlaySound( void * p_stream, byte_t * p_samples, size_t i_size,
                      mtime_t play_date );
[play_date is MANDATORY]

When the audio mixer thread has got all samples for all streams, it 
can start mixing the streams. In case all streams do not have the 
same number of channels, the highest number will be chosen. It may 
also change the sound frequency if necessary.

In order to avoid too many malloc()s, it might be a good idea to have 
a system of buffer cache, such as what we did for the input buffers.

3.2 AOUT_CONVERTER plug-in

An aout converter module shall be a very simple object :

int Probe( int * i_input_format, int i_output_format, int i_channels );
byte_t * Convert( int i_input_format, int i_output_format, int i_channels,
                   byte_t * p_input_samples, byte_t * p_output_samples,
                   size_t i_input_size, size_t * pi_output_size );
[if p_output_samples is too small, returns a pointer to the first 
sample of p_input_samples which haven't been converted]

3.3 AUDIO_OUTPUT plug-in

An audio output plug-in deals with the hardware. It is also very simple :
int aout_Init( ... ); -> spawns the audio output IO thread if necessary
static void IOCallback( ... ) -> takes the first buffer prepared by 
the audio mixer and DMA it to the hardware

3.4 AOUT_FILTER plug-in

An AOUT_FILTER takes a buffer with n channels and returns a buffer 
with m channels and maybe another transformation (such as displaying 
a graph, or equalizing, or whatever...).

4. Suggested actions
    =================
This document is a request for comments. So for now, comments are 
welcome. The next step will be a call for volunteers and we'll see 
who does what.

-- 
Christophe Massiot.

-- 
This is the vlc-devel mailing-list, see http://www.videolan.org/vlc/
To unsubscribe, please read http://www.videolan.org/lists.html
If you are in trouble, please contact <postmaster at videolan.org>