Saturday, 11 July

Since the work on NihAV is nearing the point when I can release it to public without that much shame (main features I wanted to implement are there and I’ve even documentation for all public interfaces plus some overview, you can’t ask for more than that) I want to give $title.

nihav-tool

This is the oldest tool oriented mostly to test decoders functionality. By default it will try to decode every stream in a file and output it either into a wave file or a sequence of images (PPM for RGB video, PGMYUV for YUV). Beside that it can also not decode a stream (and if you choose to decode neither then it tests demuxer or dumps raw frames).

Here is the list of switches it understands:

  • -noout makes it decode data but not produce any output (good for testing decoding process if you don’t currently care about decoder output);
  • -an/-vn makes it ignore audio or video streams correspondingly;
  • -nm=count/pktpts/frmpts make nihav-tool write frame numbers as a sequence or using PTS from input packet or decoded frame correspondingly;
  • -skip=key/inter tells video codec (if it is willing to listen) to skip less significant frames and decode only keyframes or intra- and interframes but no B-frames;
  • -seek time tells the tool to seek to the given position before decoding;
  • -apfx/-vpfx prefix specify the prefix for output filename(s) which comes useful when decoding files in a batch;
  • -ignerr tells nihav-tool to keep decoding ignoring errors the decoders report;
  • -dumpfrm tells nihav-tool to dump raw frames. This is both useful for obtaining raw audio frames (I could not make avconv do that) and because of the way it is implemented (it dumps packet contents first and then tries to decode it) if you use it along with the decoder and it errors out you’ll have raw frame on which it errored out.

Additionally you can specify end time after giving input name if you don’t need to decode the whole file.

As you can see this is not the most feature-rich tool but it works good enough for the declared goal (hence I use it mostly a debug build of it).

nihav-player

This is another quick and dirty tool that appeared when I decided that looking at long sequences of images is not the best way to ensure that decoding goes right. So I wrote something that can pass in a bad light for a player since it can show moving pictures and play sound that sometimes even goes in sync instead of deadlocking audio playback thread.

Currently it’s written using patched SDL1 crate (removing dependencies on num and rand and adding YUV overlay support and audio interface that you can actually use from Rust; patches will be available in the same repository) because my primary development system is too old and I don’t want to mess with various libraries or finding which version of sdl2 crate would compile using my current version of Rust (1.31 or 1.33.

In either case it’s a temporary solution used mostly for visual debugging and I want to write a proper media player based on SDL2 that would play audio-only files just as fine (so I can move to dogfooding). After all, can you really call yourself a multimedia developer if haven’t written a single player?

nihav-encoder

And finally the tool that appeared out of need to debug encoders instead of decoders. Hopefully it will become more useful than that one day but at least its interface should give you the idea what it does and what it will do in the future.

I still consider one of the main problems with ffmpeg (the tool) and later avconv the positional order of arguments. Except when the order does not matter. If you’ve never been annoyed by the fact you should put some arguments before -i infile in order for them to take effect on input and the rest of arguments should be put before output file name—well, in this case you’re luckier than me. So I’ve decided to have it in a more free-form format.

nihav-encoder command line looks a list of options in no particular order and some of them take complex arguments and then you provide a comma-separated list in form --options-list option1,option2=value,option3=.... Here is the list of recognised options:

  • --list-{decoders,encoders,demuxers,muxers} obviously lists the corresponding category and quits after listing all requested lists and options (see the next item);
  • --query-{decoder,encoder,demuxer,muxer}-options name prints the list of options supported by the corresponding codec or (de)muxer. Of course you can request options for several different things to be listed by adding this option several times;
  • --input inputfile and --output outputfile;
  • --input-format format and --output-format format force (de)muxer to use the provided format when autodetection fails;
  • --demuxer-options options takes a comma-separated list of options for demuxer (BTW you can also force input format with e.g. --demuxer-options format=avi);
  • --muxer-options options takes a comma-separated list of options for muxer (BTW you can also force output format with e.g. --muxer-options format=avi);
  • --no-audio and --no-video tell nihav-encoder to ignore all audio or video streams correspondingly;
  • --start time and --end time tell nihav-encoder to start decoding at given time and end at given time. The times are absolute so --start 1:10:00 --end 1:11:00 will process just a second of data;
  • --istreamX options and --ostreamX options set options for input and output streams with given numbers (starting with zero of course). More about them below.

nihav-encoder has two modes of operation: query mode, in which you specify which e.g. demuxers or codec options you want listed, and the program quits after listing them; and transcode mode, in which you specify input and output file and what you want to do with them. Maybe I’ll add a probe mode but I’ve never cared much about it before.

So what happens when you specify input and output? nihav-encoder will try to see which streams can be output (e.g. when transcoding from AVI to WAV there’s no point to even attempt to do anything with video stream), then it will try to copy input streams to the output unless anything else is specified. Of course you can specify that you want to discard some input stream with e.g. --istream0 drop. And for output streams you can also specify encoder and its parameters. For example my command line for testing Cinepak encoding looks like this:

./nihav-encoder –input laser05.avi –output cinepak.avi –no-audio –ostream0 encoder=cinepak,quant_mode=mediancut,nstrips=4

It takes input file laser05.avi, discards audio stream, encodes remaining video stream with Cinepak encoder that has options quant_mode and nstrips set explicitly, and writes the result to cinepak.avi.

As you can see, this tool has enough features to serve as a daily base transcoder but no complex features like taking input from several files, arbitrary mapping input streams from them to output streams and maybe applying some effects while at it. In my opinion that’s the task for some more complex application that builds a complex processing graph probably using a domain-specific language to specify inputs and outputs and what to do with them (and it should be a proper command file instead of command line that is impossible to type correctly even from the eighth try). Since I never had interest in GStreamer I’m definitely not going even to play with that. But a simple transcoder should serve my needs just fine.

Saturday, 04 July

So instead of doing something productive like adding missing functionality bits and writing documentation I wasted my time on adding some QuickTime decoders. And while wasting time on adding SVQ1, SVQ3, QDMC and QDM2 decoders it became apparent why NihAV is a good thing to exist.

Implementing two of them was not a very big deal but implementing SVQ3 and QDM2 decoders took more than a week each because there are only two specifications available for them and both are equally hard to comprehend: the first one is the official binary specification, the second one is source code in libavcodec which is derived from the former.

The problem arises when somebody wants to understand how it works and/or reimplement the code and both SVQ3 and QDM2 decoder demonstrate two different aspects of that problem.

SVQ3 decoder is based on some draft of H.264 (or ex-MPEG/AVC if you’re from Piedmont) with certain extensions mostly related to motion compensation. Documentation for it was scarce and because of optimisations and integration with common H.264 decoder bits it’s hard to understand some of the things. One of those is intra prediction with two modes having SVQ3-specific hacks hidden in libavcodec/h264pred.c (those are 16×16 plane prediction mode giving transposed result and 4×4 diagonal down prediction being simplified and not relating on pixels not immediately top/left from the block) and another one is block coefficients decoding function. It took me quite a while to realize that it actually decodes three different kinds of blocks: single 4×4 block with zigzag scan, 4×4 block divided into two parts with interlaced scan, and 2×2 block. I’ve documented most of that in The Wiki (before that nobody has touched that page for almost ten years; sometimes I feel like I’m the only person contributing there).

QDM2 is horrible in different way. It is slightly improved translation of the original binary specification with hardly any idea how it works (there are still names like local_int_8 in the code). Don’t get me wrong, back in 2003-2005 when reverse engineering was done the only tools you had were debugger, disassembler (you’re lucky if it’s not the one provided by debugger) and no decompilers at all (IIRC rec appeared much later and was of limited usefulness, especially on multi-megabyte QT monolith—and that’s assuming you’re not doing that on Mac with even less tools available). I did some of such work back then as well so I understand how hard it is and how you’re happy that it works somehow and you can ship it and forget about it.

Another thing is that now it’s clear that QDMC and QDM2 are predecessors of DT$ LBR (aka Express) and use the same principles (QDMC simply coded noise and tones, QDM2 is almost like LBR but without some features like LPC or multichannel audio and with different chunk structure), but back in the day there was no documentation on LBR (or LBR itself for that matter).

But the main problem is that nobody has tried to understand the code since. It became a so-called category killer i.e. its existence prevents others from doing something similar. At least until some idiot tried to do another implementation in NihAV.

And here we have the reason for NihAV to exist: it advances the understanding of codecs for me (and I document results in The Wiki) resulting in different implementations that are (hopefully) easier to understand and sometimes even fix long-standing bugs. I hope this shall convince you that sometimes it’s good to have reimplementation of the decoder even if an existing implementation is good enough (as far as I remember the only time a decoder was rewritten in FFmpeg was when a reverse-engineered Indeo 3 decoder that crashed on damaged content almost every time was replaced with a reverse-engineered Indeo 3 decoder where a guy had the idea how it works).

But back to QDM2: while my decoder is not finished yet and I probably won’t bother with inter-frames in it (I’ve never seen any samples with those), it still decodes sweeps much better. That’s mostly because of the various bugs I’ve uncovered (also while discovering that Ghidra effectively does not allow to edit about a megabyte large decoder context). Since I have no incentive to produce a patch and people who created the decoder are long gone from the project, here are some spotted bugs: wrong coarse quantiser band selection (resulting in noise generated in wrong frequency range), reading bits past the chunk end (because is some cases checks are missing), ignoring group 4 tones because of the wrong conditions, some initial variables are set in the wrong way too. Nevertheless it mostly works and it was very useful for mapping the functions in the binary specification (fun fact: QDM2 decoder is located in QuickTime.qts while QDMC is located in QuickTimeInternetExtras.qtx).

Sunday, 07 June

I’m happy to announce that NihAV has finally taken more or less complete form. Sure there are some concepts I wanted to play with (like raw streams handling) but I had no need for them so far so it can wait until much much later. But all major features required to build a transcoder are there as well as working transcoder itself.

As I wrote in the previous post I wanted to play with vector quantisation so first I implemented image palettisation but since that was not enough I implemented two encoders using vector quantisation: 15-bit MS Video 1 and Cinepak. I have no doubts that Tomas Härdin has written a much better encoder but why should that stop me from NIHing? Of course such encoder is not very useful by itself (and it was useless to begin with) so I needed a muxer to represent encoder output in some form. And then simply fiddling with parameters and recompiling became boring so I finally introduced generic options and in order to use those options without recompiling the binary every time I had to write a transcoder as well. But that means that now I can use NihAV to recode media into something else even if it’s just two crappy video encoders, MS ADPCM and PCM encoder with the large variety of supported output containers (AVI and WAV!). I called it conceptually done because all the essential concepts are there, not because there’s nothing left to do.

Now about video encoders. I’ll describe the NihAV design and how it works on a separate page, for now I just mention that while decoders are working on “frame in-picture/audio out” principle, encoders accept single picture or audio buffer for encoding and then may output a series of encoded packets. Why such asymmetry in design? Because decoders are expected to produce single output for single input (and frame reordering is handled externally) while most encoders are expected to have at least a single audio frame or couple of pictures of lookahead to make decisions about coding of a current input. For modern video codecs it may be a decision what frame type to assign or where to start a new scene, for audio codecs like AAC you may need to change current frame type if the following frame type has transients and previous frame type didn’t have them.

Anyway, back to the technical details about the encoders. MS Video 1 operates on 4×4 blocks that can be coded as skipped, filled with single colour, filled with two colours in a pattern, or split into 2×2 sub-blocks each filled with its own two colours in a pattern. Sounds perfect for median cut. Cinepak is much more complex. It splits frame into several strips and each strip is also split into 4×4 blocks that may be coded as skipped, single 2×2 YUV codeword (2×2 Y block and single U and V values) scaled twice or four YUV codewords from different codebook. Essentially for a good encoding you need to determine how to partition frame into strips optimally, split blocks into single and four-vector ones and find optimal codebooks for them separately. Since I wanted to write a working encoder mostly to check whether vector quantisation is working, I simply have fixed amount of strips and add every block as a candidate for both coding schemes without a following refining steps.

Here are some numbers if you really care about those. Input is laser05.avi (320×240 Indeo2 file with 196 video frames from the standard samples place). Encoding with MS Video 1 encoder takes about 4 seconds . Encoding Cinepak with median cut takes six seconds. Encoding Cinepak with ELBG and randomly-generated codebooks takes 36 seconds and result looks bad (but recognizable). Encoding Cinepak with ELBG that takes codebooks produced with median cut as the initial ones takes 68 seconds but the quality is higher than merely median cut and the output file is slightly smaller too.


Now with all of this done I should probably fix the knowingly bad decoders (RV6 and Bink2), add whatever missing decoders and features I see fit and start documenting it all. I have strong doubts about VDD this year but maybe I’ll be able to present my stuff at FOSDEM 2021.

Sunday, 31 May

While NihAV had support for paletted formats before, now it has more use cases covered. Previously I could only decode paletted format and convert picture into some other format. Now it can handle palette in standard containers like AVI and MOV and even palette change in AVI (it’s done via NASideData which is essentially the same thing I NIHed more than nine years ago). In addition to that it can convert image into paletted format as well and below I’d like to give a brief review of methods employed.

I always wanted to try vector quantisation and adding conversion to paletted formats is a perfect opportunity to do so.

It should be no surprise to you that conventional algorithms for this are based on ideas from 1980s.

Median cut (described in Color image quantization for frame buffer display by Paul Heckbert in 1982) is a very simple approach: you gather all data into a single box, split it on the largest dimension (e.g. if maximum green value minus minimum green value is greater than red or blue differences then split data into two boxes, one with green components less than average green value and one with larger than average green value), apply recursively to the resulting boxes until you get a desired amount of boxes or can’t split any more. This method works moderately fast and unlike other approaches it does not need an initial palette.

NeuQuant is an algorithm proposed by Anthony Dekker in 1994 but it’s based on work of Teuvo Kohonen in 1980s. Despite being based on neural network, the algorithm works and it’s quite simple to understand. Essentially you have nodes corresponding to palette entries and each new pixel is used to update value in both the nearest-matching palette entry and also its neighbours (with decreasing weight of course). It works fast and produces good results even on partially sampled image but its result is heavily affected by the order pixels are sampled. That’s why for the best results you need to walk image in pseudo-random order while other methods do not care about order at all.

ELBG is the enhancement of Linde-Buzo-Gray algorithm (from 1980) proposed in 2000. In the base algorithm you select random centroids (essentially the centres of pixel clusters that will serve as palette entries in the end), calculate which pixels are the closest to which centroid, calculate proper centroid for these clusters (an average of those pixels essentially), repeat last two steps until centroids don’t move much (or moving them does not improve the quantisation error by much). An enhancement comes from an observation that some clusters might have larger dispersion than the others so moving a centroid from a small cluster of pixels near another small cluster of pixels to a large cluster of pixels might reduce quantisation error. In my experience this is the slowest method of three I tried and the enhancement actually makes it about twice as slow (but of course I did not try to optimise any of these methods beside the very basic stuff). Also since it constantly searches for suitable centroids for every pixel (more than once if you consider enhancement step) it gets really slow with an increased number of pixels. But it’s supposed to give the best results.

And here are some words about palette lookup. After quantisation step is done you still have to assign palette indices to each pixel. The same paper by Heckbert lists three methods for doing that: the usual brute force, local search and k-d tree. Brute force obviously means comparing pixel against every palette entry to find the closest one. Local search means matching pixel against palette entries in its quantised vicinity (I simply keep a list of nearest palette entries for each possible 15-bit quantised pixel value). K-d tree means simply constructing a k-dimensional tree in about the same way as median cut but you do that on palette entries and nodes contain component threshold (e.g. “if red component is less than 120 take left branch, otherwise take right branch”). In my experience k-d tree is the fastest one but its results are not as good as local search.

Output image quality may be improved further by applying dithering but I did not care enough to play with it.

P.S. I’ve also made a generic version of median cut and ELBG algorithms that can be used to quantise any kind of input data and I want to use it in some encoder. But that’s a story for another day. There’s other new and questionably exciting stuff in NihAV I want to blog about meanwhile.

Tuesday, 19 May

I’ve finally had time and documented my findings so if somebody is interested in it he can have at least something to start with.

Saturday, 09 May

It seems that as a programmer and especially during these days you have an obligation to bake bread (the same way if you belonged to MPlayer community you had to watch anime). So here’s me doing it:

It’s made after a traditional recipe from Norrland: barley flour, wheat flour, milk, yeast, cinnamon, a bit of salt and molasses. IMO it goes fine with some gravad lax or proper cheese.

P.S. And if you think I should have made a sour-dough bread—I can always order some from Sweden instead.

Tuesday, 05 May

Before moving to improving parts of NihAV not related to decoding I decided to implement some small family of formats and I picked VivoActive since somebody complained some of it was unsupported.

This family consists of one custom container format and three codecs based on ITU standards. Container format is simple, intended just for one video and one audio stream with video frame most likely split into 128-byte chunks (probably for better streaming), the only interesting thing is that it stores header in text form which is too flexible compared to the rest of format.

First audio codecs is ITU G.723.1 and it was painful to implement it. As a proper speech codec it has a lot of proper speech codec math like “multiply 32-bit value A by 16-bit value B and shift result by 15 bits” which requires explicit casts in Rust. On the other hoof it has saturating_add() and friends which help in many other cases. There are places where functions take the same data as input and output while in other places the same functions have different input and output arrays. Plus I wanted to have a slightly better design structure so there are functions inherent to subframes, some functions belong to the decoder instance and some are used by both. And then I had to debug it. To give it a perspective, G.723.1 decoder takes 110 kB in source form and code part is 37 kB; for Siren the numbers are 45 kB and 15 kB respectively; Vivo video decoder is merely 19 kB because most of the decoding is done by base H.263 decoder in nihav-codec-support.

Siren (or more officially Polycom Siren 7) is a codec that served as a base for ITU G.722.1. Since RealAudio Cook is based on G.722.1 and I’ve written a decoder for it already, this one was quite easy to implement. Especially considering that some guy wrote an opensource decoder and encoder for it back in early 2000s. Also this might be the case when having 5*2^N FFT finally paid off since Siren frames are 320 samples long so I still can use my standard IMDCT implementation here (it outputs samples in reverse order but that’s no problem).

And finally Vivo Video. It’s yet another codec based on H.263 (but with slightly different headers) and notable mostly for how it represents codebooks. The codebooks are stored as a single set (but not in order e.g. codebook definition number two is used for codebook number fourteen), each codebook can represent codes up to eight bits long (for longer codes you have escape prefix which means that e.g. codes starting with 0000 10 have their tails defined in another codebook set). Another interesting feature is that the codes are stored as text strings with ones, zeroes, and spaces (yes, the decoder parses them to get the actual code). Additionally it has a weird decoding mode where you keep a state ID, there’s a special table to map it to the actual codebook number, and codebook tells you how to change state ID when you decoded a new code. This mode can be used to decode the whole stream or just macroblock coefficients.

As for the codec itself, there are two flavours of it: Vivo/1.0 (or Vivo/0.90) and Vivo/2.0. The first version is plain H.263 that does not use any special features, the second version has PB-frames (i.e. frames where B-frame macroblock data is stored together with P-frame macroblock data) and it employs AIC (advanced intra coding mode). It’s probably the only codec I’ve seen that actually has AIC in P-frames and not just in I-frames. Reconstruction of P-frames because of this AIC mode is not perfect but as with G.723.1 decoder it’s good enough to demonstrate that it works and I don’t want to waste more time on it.

All in all it was a meh-y experiment with mediocre results and I should move on.

Sunday, 19 April

Spoilers: as it turns out, a French company needs some Belgian technology in their life.

In last post I mentioned that NihAV can decode all of VMDs except maybe from the latest generation. Since I was provided with the samples I looked what changed there.

First of all the header got shorter. They’ve finally decided to throw out the initial palette which is not used by the 16- and 24-bit videos (that area is filled with what looks like some random excerpt from a DOS executable and it looks the same in all files I’ve seen) and adding two 16-bit fields at the end. What are those fields for? I’m getting to this.

Looking at the code for reading this new VMD header I could see that while old header was 816 bytes long, it handled header sizes 820 or 52 (820-768) which meant four additional bytes first added to the old header and then palette was thrown out. And the audio size was variable (previously it was 1:1, 1:2 or 1:4 fixed compression scheme but it seems to be not the case any more). In the files I have those additional fields at the end of the header were always 4 and 1152. The second number looks suspiciously like the number of samples for some audio codec and it turned out to be true.

It turns out that while Urban Runner encoded FMVs with Indeo 3 (and rest of the stuff with conventional 16- or 24-bit VMD video compression), those education games decided to use some external audio codecs. Studying the executable revealed that they load some external library depending on the audio type in the VMD header and use it for decoding. Those known types are:

  1. type 3 — L&H StreamTalk 15kbps @ 8kHz;
  2. type 4 — L&H StreamTalk 50kbps @ 22kHz;
  3. type 5 — L&H StreamTalk 25kbps @ 11kHz;
  4. type 6 — L&H StreamTalk CELP 4.8kbps @ 8kHz.

Since only type 4 is present in the audio data and st500f22.dll is present in the binaries the names for types 3 and 5 are interpolated from it (I could find st48w.dll for type 6 on some random DLL download site though).

As you can see from the name, Coktel Vision (a French company) decided to use codecs from Lernout&Hauspie (a Belgian company) but you knew that from the spoiler in the beginning.

For completeness sake I had to loop at the codec so that my VMD decoding would be complete (yes, I’ve mentioned three other codecs but I don’t think they’ve actually used them in production but if they did it should not be that hard to add support for them as well).

This StreamTalk 50kbps @ 22kHz codec turned out to be a simplified version of MPEG Audio Layer II. While it builds on all the sample principles, the frames are variable in size and don’t have headers. Actually they start at the next bit after the previous frame and if you have frame data stored in separate chunks (like in VMD) it will skip some bits from the first input byte if previous frame had some free bits left (yes, that means repeating last byte of the previous packet as the first byte of the current packet).

The content is the usual 36 sets of 32 sub-band samples (giving 1152 samples per frame) with bit allocation parameters and scales coded for each third of the frame, each three sub-band samples coded together. Synthesis is performed with the same QMF as in MP2 too. The main difference is that there’s only one bit allocation class and frames do not have fixed length and can be anything between 16.5 and 517 bytes long (yes, the reference decoder checks input length against those two values).

Overall this turned out to be a simple codec with familiar concepts (the only thing I had troubles with was that continuous decoding with skipping bits in the first byte of input) and I guess two other codecs are the same but use a different bit allocation class. CELP codec is something different but it looks not that complex either so it can be done in reasonable time if the need arises (hopefully never though).

Thursday, 16 April

As the certain doctor from ScummVMTrek reminded me, the VMD format (developed by Coktel Vision that was bought by Sierra) was used in their own games too (and even more so, there VMDs were used for many animations where simple sprite would suffice as well) and they kept making some educational games way into 2000s. So I decided to look at those as well.

The Last Dynasty (which I’ve never played and unlikely to ever play) features VMD that can be decoded mostly fine except that in 320×161 video you sometimes have sudden 640×322 frames near the end of video.

Some education game from Adi 4.0 generation. This one has some videos in 15-bit RGB format plus IMA ADPCM compressed audio track. But it can’t beat the weirdness of…

Urban Runner. Here we have a mix for 15-bit RGB VMD, 24-bit RGB VMD and VMDs with Indeo 3 video. And if you thought this was simple enough here’s another fun trick for you. Despite having different depths, all non-Indeo3 VMDs use the same compression methods, just in some cases the buffer should be interpreted as bytes, sometimes as 16-bit little-endian words and sometimes like triplets. So far so good. But they had a bright idea of sometimes storing the image dimensions in pixels and sometimes in bytes. In result I look at VMD header to see what flavour it has there to see if I need to scale frame dimensions by bytes per pixel before decoding or not. And this game also features some videos that have 312×136 resolution except that the last frame is 624×272 (I had to allow my Indeo 3 decoder to change dimensions to handle that particular case).

At least it could be done mostly by guesswork (except for audio, I had to look into ADI4.exe using Ghidra to find out that it has IMA ADPCM now) and all files can be decoded fine now.

If somebody can provide me with some samples and binary for their latest generation I’d look at it as well.

Monday, 13 April

Since we all in Europe have to suffer from the lockdown until things get better I can’t travel for now and in result I have to spend free time in different ways (mind you, I still work despite it’s work from home so it’s not that much free time added). Nevertheless I’ve spent a significant part of that time working on NihAV and in result I have some progress to report.

First of all, I have finally improved H.263-based decoders from the level “it decodes bitstream correctly and outputs something recognizable” all the way up to “decodes with just minor artefacts”. That required a lot of debugging and even some reverse engineering.

Currently I have just three decoders based on H.263: RealVideo 1, RealVideo 2 and Intel 263. And all of those decoders have features not present in the other ones. RealVideo 1 has the wonderful OBMC (overlapped block motion compensation) where you need to know motion vectors for all four neighbours (yes, including the bottom one) before you can perform motion compensation, RealVideo 2 has proper standalone B-frames, and Intel263 has PB-frames where P-frame and B-frame macroblock data is stored interleaved. I mentioned before that in some aspects H.263 is worse than MPEG-4 ASP or H.264. H.263 has more annexes than any other video codec specification with almost every letter of Latin alphabet taken and those annexes change decoding process significantly. Inter block with four motion vectors instead of one? Annex F. Alternative intra block coding mode with coefficient prediction? Annex I. PB-frames? Annex G. PB-frames with B part not always being bidirectional? That’s Annex M. In-loop filter? Annex J. Different quantisation mode especially for chroma? Annex T.
At least it all works now good enough and I don’t intend to return to it any time soon.

Second, I’ve checked and fixed my VMD decoder. The problem was audio part. First, there’s annoyingly two variants of stereo audio compression. They both use the same prediction scheme but one stores predictors in the beginning of each block (IIRC that’s the newer scheme that I’ve seen being used only in Shivers 2) and another one just stores the deltas. And here’s the annoyance: many clips from Lighthouse store audio payload in blocks of 4409 bytes. Each delta is exactly one byte and there are two channels and there are only deltas in the block—do the maths yourself. In case you can’t, it’s 2204 samples for one channel and 2205 for another one. I had to buffer that leftover sample and add it back in the beginning of the next block.

And finally I’ve added some framework for future support of common compression formats. Currently I have implemented only deflate decompressor and it was tricky to do that properly. The format itself is simple and I’ve implemented a decoder in a couple of hours (including distractions to other activities) but making it work in desired conditions took more than a day. What I mean by that? The fact that while normally simple uncompress() with known input and output lengths would suffice, sometimes you need to decode it exactly as a stream with an unknown input and output sizes. In order to do that you need either to somehow juggle inputs/outputs maybe even via callbacks (further complicated by Rust ownership concept) or save the decompressor state in case of input or output buffer exhaustion and letting the caller to do that. I took the second approach obviously and implemented it as a state machine where if there’s not enough input bits or no space left in the output buffer I simply save the bit reader state (without the data pointer) and return an error code telling caller either to feed more data or to do something with the output data and keep decoding the same block. It’s probably very ineffective but it works fine even in case of single byte input and output buffers.
This means that now I can write a decoder for some screen codec without any external dependencies and if the need arises I might even write an encoder counterpart. I’ve written simple LZ77- and LZ78-based compressors before so hopefully I can implement something decent for an encoder even if not as good as zlib.

All in all, this was probably not the most pleasant work to do but it sure feels good having this done. Now I can move to something different like Vividas container and codecs family or improving nihav-tool or implementing a proper player or something else entirely. I don’t make plans because they fail so only the future will show what will be done.

Saturday, 21 March

Since we have this wonderful situation in Europe and I need to stay at home why not do something useless and comment on the features of AV1 especially since there’s a nice paper from (some of?) the original authors is here. In this post I’ll try to review it and give my comments on various details presented there.

First of all I’d like to note that the paper has 21 author for a review that can be done by a single person. I guess this was done to give academic credit to the people involved and I have no problems with that (also I should note that even if two of fourteen pages are short authors’ biographies they were probably the most interesting part of paper to me).

The abstract starts with “In 2018, the Alliance for Open Media (AOMedia) finalized its first video compression format AV1” and it fails to continue with the word twice. Not a big deal but I still shan’t forget how well the standardisation process went.

Introduction

It starts with what I can describe as usual marketing words that amount to nothing and then it has this passage:

As a codec for production usage, libvpx-VP9 considerably outperforms x264, a popular open source encoder for the most widely used format H.264/AVC, while is also a strong competitor to x265, the open source encoder of the state-of-the-art royalty-bearing format H.265/HEVC codec on HD content.

While this is technically true, this sentence looks like a PR department helped to word it. “A codec for production usage” is not an exact metric (and in production you have not just quality/bitrate trade-off but rather quality/bitrate/encoding time), VP9 is a H.265 rip-off and thus it’s expected to beat H.264 encoder and be on par with H.265 encoder is a given (unless you botch your encoder) and cherry on top is “royalty-bearing format H.265”. H.264 is “royalty-bearing format” too and if Sisvel succeeds so are VP9 and AV1 and claims like those make me want for such organisations to achieve that goal. Not because I like the current situation with greedy IP owners and patent trolls but rather because you have the choice between paid and “shitty but free” (does anybody miss Theora BTW?).

Then you have

The focus of AV1 development includes, but is not limited to achieving: consistent high-quality real-time video delivery, scalability to modern devices at various bandwidths, tractable computational footprint, optimization for hardware, and flexibility for both commercial and non-commercial content.

Somehow I interpret it as “our focus was all these nice things but it’s not limited by the actual requirement to achieve all this”. And it makes me wonder what image features make the content non-commercial.

Since the soft freeze of coding tools in early 2018, libaom has made significant progress in terms of productization-readiness, by a radical acceleration mainly achieved via machine learning powered search space pruning and early termination, and extra bitrate savings via adaptive quantization and improved future reference frame generation/structure.

I’m sorry, do humans write such sentences? I have no problems with what is done for extra bitrate savings (though I’d expect that any modern advanced encoder would do the same) but what’s written before that makes me want to cry Bingo! for some reason.

The rest of introduction is dedicated to mentioning various AV1 encoders and decoders and ranting about tests. That’s perfectly normal since most of those tests conducted by entities with their own interests and biases. And then you have “…30% reduction in average bitrate compared with the most performant libvpx VP9 encoder at the same quality“—as the most sceptical Kostya who blogs on multimedia.cx I find this phrase funny.

AV1 Coding Tools

Now for the part that describes how AV1 achieves better compression than its predecessors. This is the reason why I looked at the paper in the first case: there have been many presentations about those coding tools already but having it all in a single paper instead of multiple videos is much better.

Coding block partition. Essentially “the H.265 tree partitioning from VP9 remains but we extend it a bit to allow larger and smaller sizes and more partitioning ways”. It also sounds like VP9 had some limitations that H.265 had not and now they are relaxed.

Intra prediction. Some of it sounds like “VP9 had intra prediction like in H.264, AV1 borrowed more from H.265” (H.264 has 8 directional modes plus DC and plane, VP9 had the same, H.265 had 35 modes most of them defined via universal directional predictor, AV1 has 56 directional modes most of them defined via universal directional predictor).

AV1 has also two new gradient prediction modes—smooth gradient (calculated from either top, left or both neighbours and using quadratic interpolation; this sounds like a logical development of plane prediction mode that was not used before for computational considerations) and Paeth prediction (you might remember it from PNG).

There’s also recursive-filtering-based intra predictor which predicts 4×2 pixel block from its top and left neighbours and the whole block is predicted by essentially filling it with those predicted blocks (somewhat like intra prediction in general without residue coding and with fixed prediction mode). This is an interesting idea and makes me wonder if H.266 uses it as well.

Of course there’s this proposal for H.265 rejected for computation complexity but picked up in AV1—Chroma-from-Luma mode. The idea is rather simple: you usually have correlation between luma and chroma so by multiplying luma values and adding some bias you can predict chroma values quite good in some cases.

Another method is palette mode. This reminds me a lot about screen coding extensions for ITU H.EVC.

And finally, intra block copy. This mode does what really old codecs did and copies data from already decoded part of the frame. Its description also features this passage:

To facilitate hardware implementations, there are certain additional constraints on the reference areas. For example, there is a 256-horizontal-pixel-delay between current superblock and the most recent superblock that IntraBC may refer to. Another constraint is that when the IntraBC mode is enabled for the current frame, all the in-loop filters, including deblocking filters, loop-restoration filters, and the CDEF filters, must be turned off.

I’d argue that only the first limitation is for hardware implementation. Having various features enabled would result in decoding frame in very specific order e.g. usually you may decode frame and deblock rows that are not used for intra prediction while you decode other rows, or you may decode frame and only that perform deblocking. With the IntraBC feature and deblocking you’d need to perform deblocking at the specific stage of decoding or you won’t have the right input. Same with the constraint that IntraBC should reference only the data in the same tile—obviously it would do wonders to multi-threaded decoding otherwise (and you need multi-threaded decoding for fast decoding on large frames packed with modern codecs). If you wonder if this feature is a bit alien to the codec, the paper puts it straight: “Despite all these constraints, the IntraBC mode still brings significant
compression improvement for screen content videos.
” And thus it was accepted (why shouldn’t it?).

Inter prediction. The paper lists five features: extended reference frames, dynamic spatial and temporal motion vector referencing, overlapped block motion compensation, warped motion compensation, and advanced compound prediction.

Extended reference frames means mimicking B-pyramid with VPx approach. I.e. instead of coded B-frames that may reference other B-frames (so you need to carry around the lists of which frames you may want to reference from the current frame) you have some previous frames and some future frames (coded but not displayed) like you had since VP7 times.

Dynamic spatial and temporal motion vector referencing means that co-located motion vectors from H.264 and H.265 got to AV1 at last.

Overlapped block motion compensation is a novel technique that some of us still remember from MPEG-4 ASP times, Snow and “that thing that Daala tried but decided not to use”.

Warped motion compensation also reminds of MPEG-4 ASP (and a bit about exotic VP7 motion compensation modes).

Advanced compound prediction means combining two sources usually at an angle (an interesting idea IMO, not sure how much it complicates decoding though). But also seems to include the familiar weighted motion compensation from H.264.

Transform coding. Essentially “we use the same DCT and ADST from VP9 but we got to separable transforms now”.

Entropy coding. There are two radical things here: moving to old-fashioned multi-symbol adaptive arithmetic coding (which went out of favour because adapting large models is much slower than adaptive binary coding) and using coefficient coding in a way that resembles H.264 (but with the new feature of multiple symbol coding it looks different enough).

In-loop filtering tools and post-processing. Those tools include: constrained directional enhancement filter (IIRC the brilliant thing made by collaboration of Xiph and Cisco), loop restoration filters (a bit more complex than usual deblocking filters), frame super-resolution (essentially the half-done scalability feature present along with full scalability; also it reminds me of VC1 feature), film grain synthesis (postprocessing option that has nothing to do with the codec itself but Some flix company made it normative).

Tiles and multi-threading. Nothing remarkable except for large-scale tiles described as

The large-scale tile tool allows the decoder to extract only an interesting section in a frame without the need to decompress the entire frame.

The description is vague and the specification is not any better. And it does not seem to be a dedicated feature by itself but rather “if you configure encoder to tile frames in certain way you can then use decoder to decode just a part of the frame efficiently”.

Performance Evaluation

Here I have two questions. The first question is why PSNR was chosen for a metric. As most people should know, the current “objective” metrics do not reflect visual quality perceived by humans well (so encoded content with high PSNR can look worse to humans than different content with lower PSNR). IIRC (and I’m not following it closely) SSIM has been used as a fairer metric since long time (look at the famous MSU codec comparison reports for example) and PSNR is synonymous with people who either don’t know what they’re doing or trying to massage data to present their encoder favourably.

The second question is the presented encoding and decoding complexity. The numbers look okay at the first glance and then you spot that x265 and HM have comparable encoding time and HM and libvpx have comparable decoding time. And for the record HM is the reference implementation that has just plain C++ code while others (maybe not VTM at the time of benchmarking) have a lot of the platform specific SIMD code. I can understand x265 wasting more power on encoding (by trying more combinations and such) but decoding time still looks very fishy.

Conclusion

Nothing to say here, it simply re-iterates what’s been said before and concludes that the goal of having format with 30% of bitrate saving with the same (PSNR?) quality over VP9 is reached.

Author biographies

As I said, still probably the most interesting part for me since it tells you enough information about people involved in creating AV1 and what they are responsible for. Especially the four people from Duck (so I don’t have to wonder why there are segments named WILK_DX in the old binary decoders from On2 era).


And now my conclusion.

The paper is good and worth reading for all those technical details of the format but it would be even better if it didn’t try to sell AV1 to me. The abstract and introduction look like they were written by marketing or PR division with some technical details inserted afterwards. The performance evaluation leaves a slight hint of dishonesty like on the one hand they wanted to conduct a fair benchmarking but on the other hand they were told to demonstrate AV1 superiority no matter what and thus the final report was compiled from whatever they could produce without outright lying. But if you ignore that it’s still a good source of some technical information.

Saturday, 22 February

For last couple of weeks I’ve been working on documenting and restructuring NihAV. In result I’ve documented every public thing in my crates (except H.263 decoder skeleton but I need to need to debug and maybe rework it anyway) and NihAV have final crate structure.

Speaking about crate structure, modern languages often suffer from npm.js syndrome—when almost any trivial action has a separate package and most of the packages consist of imports from other packages. The other extremity would be to have two or three monolithic libraries with everything. I don’t think there’s a perfectly balanced solution so I split features using a few principles and I’ll stick to the scheme:

  1. nihav-core—the basis structure definitions like frame, packet, demuxer and decoder interfaces etc etc and utility code that should be used by both crates implementing NihAV format support and various users (like my own decoding tool and player);
  2. nihav-registry contains essentially three things: codec descriptions, codec mapping from e.g. FOURCC to codec name used by NihAV (IMO it’s better to use a string as codec identifier instead of arbitrary number that may or may be not recognized by the different version of the library) and container detection code (i.e. something like what file utility on UNIX does). This functionality can belong to nihav-core but it’s expected to be updated way more often than the base code so I decided to finally split it out;
  3. nihav-codec-support contains various pieces of code and data that are reused by many various decoders. It is intended just for decoders and has such bits as functions for testing decoder on some file, the skeleton for H.263 decoder (just add some functions for parsing headers and your new decoder is ready), motion compensation code, audio DSP bits (including FFT) and more;
  4. various crates that cover codec families and related containers: nihav-commonfmt for AVI and codecs like AAC; nihav-duck, nihav-indeo, nihav-rad and nihav-realmedia for supporting corresponding codec families with e.g. Bink or RealMedia demuxers as well; and nihav-game for supporting various codecs from various games with their unique demuxers;
  5. and finally nihav-allstuff that simply re-exports decoder and demuxer registrations in single nihav_register_all_codecs() and nihav_register_all_demuxers(). Also it has a test to check that all registered decoders have codec description in nihav-registry but nobody beside me should care about that.

Now with all of this done at last I can return to polishing other decoders which I still find more pleasant than documenting.

Saturday, 15 February

I’ve finally finished polishing out decoders for all Duck codecs (before it was bought by Baidu) and now they all seem to work fine (except AVC, that one can wait for later—much much later). And while I moved to even more hairier and painful tasks (reorganising nihav-core and even documenting it) now, as I have full understanding how those codecs work, I can give an overview of their design (not the bit-by-bit description of the format, we have The Wiki for that but rather most notable features and similarities to other codecs) and form my opinion on them.

TrueMotion 1

Somehow this might be their most original codec. While it’s simple codec with delta prediction I can’t remember any other codec that used a variable-length codebook with byte indices. Also this is the only codec in the family that works with RGB (16- and 24-bit modes even; the rest of codecs use YUV).

TrueMotion RT

This one is a trivial codec for real-time video capturing (hence the name) that codes deltas with fixed quantisation scheme (2, 3 or 4 bits deltas with predefined step sizes).

TrueMotion 2

This codec is still based on delta coding but now instead of working with individual pixels it works with 4×4 blocks that can have different amount of deltas and even employ motion compensation (instead of coding deltas). Also the data is separated into different streams and each of them is Huffman coded.

The approach with coding different kinds of information in separate chunks will be used in later codecs as well.

TrueMotion 2X

TrueMotion 2X is some weird amalgamation of TrueMotion 1 and TrueMotion 2. It works with 8×8 blocks that may have different amount of deltas like TM2 and information is grouped into chunks like TM2 but it uses variable codebook approach from TM1.

The main distinguishing features of this codec though are having multiple chunk variants for holding the same data and obfuscating data using XORing with 32-bit key derived from a key stored in a frame by passing it through LSFR a couple of times. IIRC frame data also contains the name of person owning the copy of the program so it might be some kind of protection scheme but it looks dubious at best.

3- and 4-bit ADPCM

As you can guess these codecs are based on DVI ADPCM (4-bit variant is essentially IMA ADPCM with different block header), 3-bit variant simply expands three deltas into four samples by interpolating coded differences (which has been done by other formats as well but I don’t remember which ones).

VP3-VP4

Starting with this format Duck moved to the codec design approach which I can describe as “make an equivalent of some existing codec but with some crazy thing replacing some less important stage”. It’s not like they are the only company doing this but it’s probably the only one leaving you with “how did they manage to come up with that idea?” question and VP3 is a very good example of that.

First of all, VP3 has an unusual block clustering: 8×8 blocks are grouped into 16×16 macroblocks and into 32×32 superblocks; blocks in superblocks are walked in Hilbert pattern but macroblocks in superblocks use zigzag pattern. Except that when you have four motion vectors in a macroblock they are stored also in zigzag pattern. Oh, and superblocks are walked in raster format plane after plane. Macroblock having data for both luma and chroma? Leave that to other codecs.

Then we have another feature familiar from TM2 times: data is grouped by type. First you have superblock information (intra/skip/inter), then macroblock information (which kind of motion it uses), then motion vectors and finally block coefficients.

Speaking of motion vectors, there are four features related to them that make these codecs different. First, motion vector prediction uses last/second last motion vector (in the order of decoding) as the base instead of median prediction in other codecs (this scheme will live up until VP9 with some modifications; I guess it’s done so because of the scan order but who knows). Second, motion interpolation is done as averaging two pixels—and for (½,½) case you average pixels on diagonal, which one of two depends on motion vector direction (averaging all four pixels? who would do that?!). Third, the introduction of golden frame as an alternative reference frame (don’t confuse it with altref frame introduced in VP8). This one is probably done to avoid B-frames that were patented at the time (at least that’s what people think). Fun fact: in VP31-VP5 golden frame is selected as last intra frame, in later codecs it can be selected with a special bit or even partially updated but in VP30 any frame with low enough quantiser automatically becomes new golden frame. And fourth, VP4 moved the loop filtering to motion compensation process so the reference picture does not have its edges filtered but when you perform motion compensation you apply it on source block edges using the current strength. This scheme remained until VP7 where they moved to the usual in-loop deblocking again (also it’s fun to encounter blocky intra frame image that gets smoothed with the following frames).

Now the block coefficients coding. VP3-VP9 used essentially the same scheme: decode special token that tells you what you have—a run of end-of-block flags, a run of zeroes, some small non-zero value or a larger value falling into certain range. Then you decode trailing bits if needed and expand token to form coefficient block. For some (error resiliency?) reasons VP3 had those tokens stored by coefficient number for all blocks (with some skips if zero run was coded) while VP4 had them grouped by block.

I should also mention DC prediction here. For obvious reasons it’s not median predicted either but rather calculated as weighted sum of neighbour block DCs in VP3 or “if you have two neighbour values available take their average, otherwise use the last predicted value” in VP4.

And final pet peeve is the DCT they used in VP3-VP6. While it’s good to have clearly defined integer DCT instead of a mess with different DCT implementations in H.263 / MPEG-4 ASP era, they decided to use transform coefficients in range 12785-64277 so essentially you have to multiply signed 16-bit input coefficient by unsigned 16-bit transform coefficient (and discard low 16 bits immediately). Now realize you have SIMD instruction for either signed*signed->take high or unsigned*unsigned->take high operations and not for this case. Sigh.

VP5

The main difference of VP5 from VP4 is the support for interlaced coding mode. And maybe also new binary range coder (named bool coder) that’s been in use even in VP9.

So now all non-binary data in the frame is coded using trees with fixed probabilities (i.e. you read bit with probability stored in the node and it’s zero take left branch, otherwise take right branch). Those probabilities might be constant or set to some new values at the beginning of the frame.

Frame data still contains macroblock information first and coefficient data last.

Motion vectors are predicted using nearest and second nearest (called simply near) motion vectors from already decoded macroblocks scanned in certain order. Also the information about found prediction candidates is used as one of the context variables used to select some probability set in decoding process.

DC prediction is a bit weird and it’s easier to describe it in the form “you have a special cache for top/left DC values and you use them for prediction” except that you have an additional special case for chroma in the first macroblock.

VP6

There are several things that got changed from VP5, mainly coefficient data location and coding method and motion compensation. Also now you can signal that you want this particular inter frame to become new golden frame. And you can enjoy new alpha mode which is coded essentially as a separate frame after the first one but with just one plane.

First, now there are two coefficient ordering modes: the old “MB info first, coefficients later” and the mode where macroblock information interleaves coefficient data.

Second, now you have Huffman coding for coefficient data. You take the original tree with probabilities, calculate weights for each leaf and construct new Huffman tree that might be completely different from the original. And then you decode data by reading macroblock information with bool coder from one place and variable-length codes for DCT tokens from another.

Third, motion interpolation now uses either a special set of bicubic filter coefficients or simple bilinear interpolation. Also there’s a special mode for switching between interpolation methods depending on source block variance (i.e. if it’s greater than certain threshold then use bicubic interpolation, otherwise use bilinear interpolation). I don’t think this feature has been used after VP6 though.

Also it’s worth noting that now VP6 can change block scan per frame (probably it improves compression a bit by eliminating or shortening some zero runs).

Another fun fact is that depending on container (AVI or FLV) VP6 picture might be coded upside-down or downside-up.

AVC

My favourite audio codec. Essentially it’s simplified AAC-LC rip-off (just bands and coefficients, no noise codebooks or pulses or TNS) except for the special frame mode where you can have half of the frame or the whole frame coded with special mode which is essentially some arbitrarily selected subbands that should be merged together in certain order to reconstruct audio. I have the idea how it all works but I don’t want to debug my decoder yet.

VP7

The codec is not like H.264 at all: H.264 has plane prediction mode and VP7 has TrueMotion prediction mode. There is one thing though introduced in VP7 and dropped in VP8 (and resurrected in some form in VP9) called features (there’s also special frame fading mode but hardly anybody cares about that). Features is an alternative mode that may be present for some macroblocks: different quantiser, different deblocking strength, a flag to signal this macroblock should be used to update golden frame and special block drawing mode (related to interlacing but not quite). There are up to four possible feature values where it makes sense (i.e. not for golden frame update flag).

Last feature (called pitch) defines how block coefficients should be put and how motion compensation should be performed. So you can put decoded coefficients in interlaced mode or even doubly interlaced mode (i.e. using every fourth line instead of every second). Motion compensation has these modes too and more: you can get 4×4 block from 16×1 line or from a slanted block (i.e. every next line starts one pixel earlier/later than the previous one).

Another characteristic of VP7 is being evolved rather than designed. There are several places in the codec where you can safely claim they simply have written code (maybe with some bugs) and relied on its behaviour instead of making the code follow some principle. Below are some examples.

Motion vector candidates search may get wrong macroblock coordinates. Here are the words of Peter Ross from his VP7 decoder:

The vp7 reference decoder uses a padding macroblock column (added to right edge of the frame) to guard against illegal macroblock offsets. The algorithm has bugs that permit offsets to straddle the padding column.

Inter DC prediction for DC superblock that says “if three previously decoded DCs were the same then you should use it for prediction” is fine but why should you keep the history from the last frame? I understand it might improve compression if you have the same value for the whole previous frame but it still looks a bit strange.

Spatial (intra) prediction also behaves counter-intuitively. In 4×4 prediction mode when top right block is not available the bottom of macroblock right above should be used instead. And when it’s the last block in row then top right prediction is the replicated pixel from the top macroblock as well. This is hard to explain from codec design perspective but easy from implementer’s point of view: you have top pixels line cached and you update it after you decode the block (so if the data is unavailable you use last decoded data here instead of replicating last available pixel like in H.264).

Conclusion and final thoughts

I hope I was able to demonstrate in this post that Duck codecs have an element of originality but quite often they go so far in originality that you start wondering why they were doing it like that. While some of it might be because of the patent workarounds some things are showing that in some cases they were fiddling with the code instead of trying proper ideas first and implementing codec after the idea (no, idea “let’s use codec X as the base” does not count).

Also while I’m not going to deal with VP8 and VP9 unless I really have to, I can say that the people behind Duck codecs developing AV1 is both good and terrible thing. Good because they know how to propose stuff that looks different but still works similarly to some conventional codec. Terrible because they still don’t know how to design a codec properly—not writing some ad hoc code that does something but rather gather ideas, evaluate them and only after that implementing it. I heard the story that shortly before releasing VP8 to the public Baidu actually showed it to some opensource multimedia people and asked for their opinions and input; somebody (from Xiph IIRC) found a design flaw but it was left unfixed because the encoder relied on it as well and they were reluctant to change it.

AV1.0 Errata 1 shows similar design problems partly for the same reasons and I don’t expect AV2 to be conceptually better. Especially after hearing rumours that Baidu is working on it already probably to force mostly complete work on AOM so the codec is ready by the same time as H.266 (or MPEG/VVC as they say it in Italy). And since most opensource multimedia people are working on AV1 nowadays, the chances of some competitor appearing are slim. So don’t ask questions, just consume AV1 and then get excited for AV2.

Sunday, 09 February

Since I have nothing better to do (obviously) I want to talk about marzipan pig situation in Sweden.

There’s no Christmas in Sweden, there’s only the traditional Jul which is closer to Hogswatchnight than a Christian holiday. And one of the Jul traditions is consuming various pig products starting from Yule ham and ending with—you guessed it—marzipan pigs. And of course during the season you can find all such products like julskinka (the original Yule ham), julpressylta (kinda like head cheese but better), julprinskorvar (small sausages) and of course marsipangris (no prize for guessing). And of course any decent bakery offers some marsipangrisor as well (especially in Gävle for some reason).

They come in various shapes but usually there are three standard shapes being present everywhere: very small standing pink pigs (you can spot them on two photos below), round white piggy with lower half dipped into chocolate (quite often with a coin sticking from its back too) and larger white pig’s head holding a marzipan apple (somewhat like the second picture but smaller and shorter). As I mentioned, the most variety of them I’ve seen in Gävle so I’d be sure to drop by again if I be nearby during Jul season.

And now the pictures (I’d ask for forgiveness for the photos quality but there’s no quality there):







P.S. Obviously I wanted to write this post a bit earlier but you can see how lazy I am.

Saturday, 14 December

As I mentioned in the previous post, I’m polishing NihAV and improving some bits here and there. In this post I’d like to describe what has been done and what should be done (but not necessarily will be done).

For about a month I’ve been working on adding various functionality and functionality to support that functionality and resolving recursive dependency on that functionality as well. I wanted to have a better means to test decoder output so I tried to write a simple player. Which required implementing missing bits in nihav_core::scale and implementing sound conversion. And then implementing support for floating point numbers in nihav_core::io::byteio (at least Rust offers functionality to convert float into its integer representation and back). And then optimising scaler since it took two thirds of playback CPU. And also implementing seeking in demuxers (so that nihav-tool can start decoding from some position and skip uninteresting first frames).

Let’s start with the lest significant feature—seeking. I’ve introduced SeekIndex structure that can be filled by demuxer on opening format and later used to calculate seek position. I wonder if it’s worth to add automatic index building or not. At least I expect to have significantly less demuxers than decoders so it can be changed later with relative ease. Currently AVI and RM demuxers support seeking.

Now to testing. There are two different things: internal crate testing and manual testing with nihav-player. First, let’s talk about internal testing.

Rust allows modules to have separate test functionality so normally built crate won’t contain it but cargo test will build special version with tests enabled, run it and report which tests failed with the output they produce. Usually I abuse this functionality to debug decoder (since I don’t have to build full nihav-tool but rather just two-three crates required for decoder and its family). So previously test function just decoded payload and optionally output sequence of images or wave file. It’s good enough for manual checking or to ensure that it works but it’s not good enough for regression testing. So I’ve finally added MD5 hash support in testing function so now I can also compare decoded output to MD5 for the whole output or per-frame (I still need a better solution for floating point audio but it can wait). Also because I always operate either packed buffers or in native endianness and calculate MD5 directly on those buffers in the way defined by me it should produce the same hash on big-endian systems as well (it’s just I still remember the times when I forgot to add output colourspace conversion for high-bitdepth or RGBA output for FATE test in order to fix them on PowerPC).

And finally the hairiest part—nihav-player. It’s very primitive player that can only play audio and video hopefully in sync and quit when ‘q’ is pressed. And yet it both required a lot of additional code to be written and reduced the complexity of testing—for instance it demonstrates chroma artefacts (or simply swapped chroma planes) and playing a file takes less time than waiting for decoding to finish and reviewing frames one by one.

Here’s a list of features I had to add or improve just for player:

  • scaler performance—I had to change YUV to RGB conversion from the straight floating-point matrix multiplication to the conventional four tables. Also I’ve introduced special cases for nearest-neighbour scaling to make it work faster (yes, I still don’t have better scaler and it does not bother me);
  • sound format conversion (resampling is still pending)—this converts between various sample formats and somewhat channel layouts (up-/down-mixing works only for anything to mono and 5.1 to stereo). Internally it works by iterating over a set of samples for each channel using intermediate 32-bit integer or floating point format. That required adding traits for sample format conversions like 16-bit integer into float or float into 8-bit unsigned int plus special readers that produce set of samples in given format no matter what input format is. Maybe I should use the same design for scaler as well;
  • make all decoders thread-safe (so I can run decoders in different threads). That required non-trivial changes only in Indeo 4/5 decoder because of its custom frame management;
  • finally adding frame reorderers. Unlike other frameworks my decoders work on frame in-frame out principle with no reordering. Test decoding outputs frames with PTS as part of their name so lazy sort worked fine with no need for reordering. In result I had to add it only for the player.

Another thing I had to improve for my player is sdl crate. Yes, my main development system is too old for SDL2 (with or without sdl2-build crate) so I stuck to SDL1. The Rust wrapper lacks support for YUV overlays (it was trivial to add) and sane audio handling (i.e. audio callback is just a function pointer which is hardly usable; I’ve changed it to accept some audio callback track in style of SDL2 wrapper).

The player itself took a lot of fiddling because I barely remember how I played with SDL more than ten years ago and I’ve not played with multi-threading much. In either case spawning threads with moving variables and channels for passing data between threads was a bit bumpy but it works. In result I have main thread that demuxes data, passes it to separate audio and video decoding threads, and then receives decoded video surfaces or overlays from video thread and displays them when the time comes (or discards if it’s too late—I should add a control for decoder to skip e.g. B-frames during decoding). Audio thread decodes audio and puts it into FIFO which gets emptied by audio callback when SDL calls it. Video thread decodes video, converts and/or scales into RGB32 or YUV420 and sends results back to main thread in either SDL RGB surface or YUV overlay (to my surprise they are not mutually exclusive).

In either case that’s just a first attempt to build proof of a concept player just for looking how decoders work and eventually I should write a proper one with better syncing, interactivity and other bells and whistles. The current one does its task good so I could watch a lot of samples for various decoders and see what should be fixed. I’ve fixed bugs in Indeo 3-5 decoders already plus swapped chroma planes in other decoders but H.263-based decoders still suck a lot and VP3-7 decoders are even worse. So there’s still a lot of fixing for me left.

At least after all those fixes and rewriting player I should have something substantial to show the world.

Wednesday, 27 November

In the previous post I’ve mentioned that one of the levels contain an unfinished witch puzzle probably based on Match Two that also contains various photos of the (presumably team and their families (about a dozen of them actually). I also wanted to tell a bit more about it and since I’m working on boring parts of NihAV and it will take a while, why not tell the story now?

Here’s the actual mini-game background:

(click for full picture as usual)

There are two areas of interest there—actual playing field where tiles are placed and smaller area where probably the clip from the movie would play on winning (the dimensions seem to fit). There are about twenty various symbols that can be placed there (plus some that form the word “witch”). Here are some of them:

The other pictures include scales, crumbs or stones, ladle, lead ingot, fire, newt (…it will get better!), log, bridge, apples and more.

And the playing area is suspiciously the same size as the pictures. I can only make theories that the pictures would be revealed there if you win the game normally or that it was an Easter egg in an unfinished mini-game. Here’s one for the reference:

I still have a faint hope that ScummVM developers will support the game one day and then we find out how it’s supposed to work (or that my guesses were all wrong as usual). Until then I can just occupy myself with something else.

Sunday, 17 November

First of all, unlike Rheinland Palatine railways I’ve not travelled on all local railways here because of certain geographic peculiarities it takes less time to get to the farthest corners of Rheinland-Pfalz from here than to places around Bodensee. And I can’t visit all historic railways either because one of them is located in the middle of nowhere with the only connection being a bus that comes there maybe twice a day.

Nevertheless, I’ve seen most of them and here I’d like to describe my impressions about them.

Railways of Baden

Let’s start with the main line. Due to its strategic position at the border of Germany, Baden has connections with various neighbouring countries. So the main line would be the backbone of Europe connecting North and South, from Rotterdam to Genoa along the Rhine. One of the important parts of it is the track from Mannheim to Basel with branches that connect it to France as well. Unfortunately its importance was demonstrated when because of botched tunnelling they had to close the part between Rastatt and Offenburg and it had been causing chaos for many weeks because there are no sane alternatives to it: all other lines are single-track and French ones are not electrified—while Rheintalbahn is four-track for most of its length (which was mandated by an international agreement between Switzerland, Germany and Italy—yes, it’s that important).

I mentioned branches to France before. There are two of those: around Strasbourg and Mulhouse (there had been more but France decided to cut ties like they were afraid of German army or something) and those are essentially the main connections between Germany and France. The only other double-track line goes via Saarland and it’s rather a joke. It takes 1 hour 22 minutes from Mannheim to Saarbrücken with express train like ICE or TGV and 1 hour 36 minutes with regional train. That’s all you need to know about how fast that line is. The other lines are unelectrified single-track railways connecting rural towns: Strasbourg-Lauterbourg-Wörth am Rhein, Strasbourg-Wissembourg-Winden(-Neustadt an der Weinstraße), Metz-Apach-Perl(-Trier).

Now let’s talk about less important lines that are mostly connected to the main one. In Mannheim there’s a line going to Eberbach, Lauda and Würzburg (very scenic, I’d definitely visit it again) with branches to Miltenberg and Aschaffenburg. Near Bruchsal there’s also a high-speed branch to Stuttgart. From Karlsruhe you can take Residenzbahn to Stuttgart (fun fact: previously there were both Regio-Express and Inter-Regio-Express trains from Karlsruhe to Stuttgart with no other difference beyond that Inter- part, now there are only IREs left). From Offenburg the main line branches into Schwarzwaldbahn to Konstanz and further to Switzerland (very scenic but I’ve seen almost everything there). From Freiburg there’s a line to Donaueschingen (i.e. to Schwarzwaldbahn) known mostly for running through Höllental with a station Himmelreich there (for non-German speaking people like me: it’s Heaven Kingdom town in Hell Valley).

And in Basel the main line transforms into Hochrheinbahn that runs along the Rhine to Konstanz. The line has several connections to Switzerland, like Waldshut and Koblenz are connected with the oldest rail bridge over the Rhine. It is also remarkable that some stations on Swiss territory still belong to Germany, the most known is Basel Badischer Bahnhof but there are more in e.g. Beringen and Neuhausen.

Of course it’s worth mentioning that some rail lines are serviced by trams (it’s called Karlsruhe model for a reason), sometimes only by trams. For example you can ride S5 from Karlsruhe to Bietigheim-Bissingen which is also the end station for S5 coming from Stuttgart. And S4 for a long time hold the record for the longest tram line in the world (more than 150 kilometres from Öhringen to Karlsruhe and then to Freudenstadt), now it has been split into S4 and S8.

And now historical railways.

There are only three historical railways in Baden plus some smaller neglected branch lines (like Neckarbischofsheim-Hüffenhardt) that have only occasional traffic. And some routes are so scenic that they have historic trains running on them like part of Schwarzwaldbahn from Triberg or the ring around Kaiserstuhl. The lines I’m talking about are Kandertalbahn (Kandern-Haltingen, close to Swiss border), Sauschwänzlebahn and international historical railway Etzwilen-Singen.

Kandertalbahn is nice but without any distinct features, Sauschwänzlebahn is so remarkable that I wrote a separate post about it long time ago and Etzwilen-Singen line is worth talking about here. It runs across the Rhine from Switzerland to Baden (unfortunately not the full way to Singen, the track between Rielasingen and Singen requires some re-constructing. So while it’s Etzwilen-Singen line, the train I took ran from Stein am Rhein (hauled by electric locomotive to Etzwilen and changed to proper steam engine there) to Rielasingen. And the second class part of carriages looked exactly like third class part but that’s some Swiss peculiarities I don’t care about.

Railways of Württemberg

It might be a surprise but despite being part of high-speed corridor Paris-Stuttgart-München, Württemberg has just one bit of high speed line connecting Stuttgart to Mannheim and Stuttgart-Ulm part is very slow speed (essentially it takes an hour no matter what train kind it is). Why? Mountains and too steep curve. That’s why they’re building a new line and burying main station in Stuttgart (the infamous Stuttgart 21 project). And since it costs a lot of money they decided this is more important than building missing tracks on Rheintalbahn as the agreement demands. At least the railway from Stuttgart to Friedrichshafen via Ulm plays central role in the song Auf de’ schwäb’sche Eisebahne, the only railway-related German song I know.

The other notable railways are Stuttgart-Aalen-Crailsheim-Nürnberg, Miltenberg-Lauda-Crailsheim (it’s known as Westfrankenbahn), Aalen-Ulm and so-called Ring Railway (Villingen-Rottweil-Aldingen-Tuttlingen-Geisingen-Donaueschingen-Villingen). There are many other railways but they are not that remarkable.

And I should mention the quarter of German cogwheel railways (or half if you exclude Bavaria as Bavarians do) is located in Stuttgart—Zacke (the line Marienplatz-Degerloch). Speaking of German cogwheel railways, there was one connecting Reutlingen and Kleinengstingen but it got dismantled back in 1969 leaving confusing names like Südbahnhof Nord (South Station North) station in Reutlingen and Zahnradbahn Honau-Lichtenstein society that sometimes run steam trains near Reutlingen on tracks that have no relation to that old cogwheel railway at all.

Speaking about still existing historical railways, there is a lot of them of varying forms and gauges.

  • fully historical normal-gauge railway (i.e. exclusively for an occasional rides of old trains)—like Lokalbahn Amstetten-Gerstetten;
  • fully historical narrow-gauge railway—like 750mm Öchsle and 1000mm Albbähnle (this one also goes to Amstetten so you can catch two different rides the same day);
  • historical railway operated regularly but with rail buses from 1960s—that would be Schwäbisch-Alb-Bahn (which is the longest of them all with ~43km still in service);
  • branch Trossingen-Trossingen Stadt which operates rail bus normally but electric historic tram on special occasions (the line it connects to is not electrified, only 4km of the branch are);
  • mixed system that I’ve never seen elsewhere: part of branch line is in regular service but occasionally historic trains go to the very end of the line. There’s Korntalbahn where you can go from Korntal to Heimerdingen by rail bus every day but Korntal-Heimerdingen-Weissach train runs only couple of times per year. There’s Schorndorf-Rudersberg track in regular use but Schorndorf-Rudersberg-Welzheim is served only on Sundays and with historic train;
  • and final (dis)honourable mention is Härtsfeldbahn—a narrow-gauge railway from Aalen to Dillingen am Donau that got dismantled over the years so only 2-3km in the middle of nowhere remain (that’s the railway mentioned in the very beginning). Still enthusiasts run trains there on some holidays and even try to build some of it back.

And now for something about the same. Here’s a picture of train on Öchsle:

While talking about historical railways in Württemberg it’s worth mentioning Ulmer Eisenbahn-Freunde socienty (Railway Friends of Ulm if translated directly). They run steam trains on some of those lines and on other railways too (like Karlsruhe-Ettlingen-Bad Herrenalb which is serviced by trams).

Here’s a train operated by them in Gerstetten:

It was built in 1921 in Karlsruhe and it’s worth saying some words about the manufacturer. Maschinenbau-Gesellschaft Karlsruhe went bankrupt in 1929 but before that it was not merely a place where locomotives were built and whose founder later founded Maschinenfabrik Esslingen as well (this one still exists) but it was also a place where such people as Niklaus Riggenbach (one of the pioneers of cogwheel railways; he moved to Karlsruhe when he decided he wants to know how to build locomotives), Carl Benz (son of locomotive driver and inventor of car), Gottlieb Daimler (the guy who put internal combustion engine wherever possible) and Wilhelm Maybach (guess why he got a car and couple of streets named after him). It’s always nice to find out about such development of seemingly unremarkable things.

Thursday, 31 October

Finally NihAV got full-feature VP7 decoding support (well, except one very exotic case for a very exotic mode) so now I can move to other things like actually making various decoders bit-exact, fixing other bugs in them, adding missing pieces of code for player and even documenting stuff. I hope to give a presentation of my work on VDD 2020 or FOSDEM 2021 (whichever accepts it) and I want to have something decent to present by then.

Anyway, here’s a review of VP7.

First of all, I’d like to note complete lack of cooperation from Baidu. Back in the day when they just bought On2 and presented VP8 people asked them to release VP7 as open source. To which they replied that their developers are working on adding new features to the codec and they don’t have time to locate the old sources. Even better, we have managed to find VP6 and VP7 specification on some Chinese site and Baidu people could not even bother to release that. As it was said in certain series of games, press <F>+<U> to pay respect.

Well, maybe they were too ashamed of it because it looks like half-finished VP8 specification without source code in the appendix. It has the same passages written by Captain Obvious like this passage:

A reasonable “divide and conquer” approach to implementation of a decoder is to begin by decoding streams composed exclusively of key frames. After that works reliably, interframe handling can be added more easily than if complete functionality were attempted immediately.

And very informative things like:

const Prob kfUVmodeProb [numUVmodes - 1] = { ??, ??, ??};

Nevertheless, it was my primary source along with the binary specification which obviously has all those missing tables, pieces of code and more.

So let’s move to overall VP7 description. VP7 is Duck rip-off of H.264: the same coding method as used in VP5 and VP6 (to the point where I could reuse bits for decoding coefficients and motion vectors from my VP5/6 decoder) with overall structure of H.264 (4×4 blocks clustered into 16×16 macroblocks, optional separate 4×4 block for luma DCs, spatial prediction and such). It’s funny that if you consider how it does certain things then VP8 looks like a very minor improvement on VP7 (fixed probabilities are internally calculated from the fixed tree weights, golden frame can be partially updated after decoding a frame and other details). Another fun fact is that VP7 decoder recognizes three FOURCCs—VP70, VP71 and ONYX (and you can still see onyxc_int.h in libaom). VP71 is claimed to be error-resilient profile and it has some small changes in frame header and it can signal to preserve scan order and some frequencies (that cannot be changed regardless) and restore them after the frame has been decoded.

The nastiest part there is so-called macroblock features: if they are enabled then each macroblock can have something in its decoding process altered. First feature forces decoder to use different quantiser for current macroblock, second one tells decoder to use different loop filter strength, third one tells decoder to update golden frame with this macroblock and the last feature tells decoder to reconstruct macroblock in weird way. There are two parts there: residue reconstruction (i.e. how to apply reconstructed 4×4 blocks to the macroblock) and motion prediction. Residue reconstruction has four modes: normal, interlaced, very interlaced (i.e. using every fourth line instead every second one) and koda extremely interlaced (each 4×4 subblock is treated as 16×1 line instead). Motion prediction has all these modes and also warped motion (when you copy not from a rectangle but rather from a parallelogram with 45-degree angle). I can understand many things but “copy from a line like it’s a 4×4 block” is too much for me and thus I left it unimplemented (not that I have any samples for it either).

But other than that I have all features implemented (even though I don’t have samples to test them) and decoder produces recognizable picture so I’m happy with that and can move to other things as I stated in the beginning.

And in the unlikely case somebody reads this post and asks that question—no, I’m not going to implement VP8/VP9/AV1 decoder. Those are not Duck codecs and I’d rather reverse engineer some obscure codec instead.

Saturday, 28 September

I’ve spent two days on something that I wanted to do long time ago but postponed because of various reasons including complexity. But now thanks to Ghidra I’ve finally managed to pry into resources of Monty Python and the Quest for the Holy Grail and decode videos (and more!) stored there.

Probably it was VP7 (again) that made me look into it, but since Ghidra has decompiler for 16-bit code as well, I’ve tried it on libraries of surprisingly small game engine (pity that ScummVM does not even plan to support it but I’m in a similar situation with codecs). And what do you know, they have a special library dealing with resource files that included unpacking images (but not audio—there’s audio library for that).

First, let’s talk about resource file organisation and why it baffled me for too long. The files are organised into several chunks: header that contains offsets of the other chunks at the end of it (around bytes 0xE2..0x141), then you have actual table of contents (more about it later), then some small binary data chunk, then list of strings related to file names, then some chunk that looks like game script (in binary form), then palette and finally another list of strings that looks like variables list. It is no wonder I could not do anything useful with it since actual table of contents is hidden somewhere near the end of file but not exactly at the end. Also each game scene corresponds to its own archive which makes it clearer what to expect there (previous version used in Monty Python’s Complete Waste of Time even has a description right in the header and table of contents stored right after it, I might look at it later but no promises).

Now, files in the resource archive. The catalogue has only file type, its size and offset in the archive and nothing else. There are more than a dozen file types, 1 being images (both background and sprites), 2 being movies (the thing I was hunting), 4 is for MIDI tracks, 9 is for digitised audio, the rest I don’t care about.

Let’s move to simple things like music and audio. Music is stored in its custom format with four bytes per command except when high bit is set (then it’s delta time for next command). Why so? Probably because it’s being played by sending single MIDI commands and at least on Windows it requires packing them into 32-bit word anyway. Audio is stored in files with some header and either raw form or with IMA ADPCM compression. The notable thing is that it does not use any multiplications in any form—instead it has precomputed table for all eight possible values for each step.

Images are another interesting topic. First of all, palette is stored as 944-byte file with 32-bit entries (so it’s just 236 colours) but somehow decoded files use it with bias of ten, i.e. decoded index 13 corresponds to palette entry 3, index 27 is for entry 17 etc etc. I have no idea what it does with first colours in the palette (though sometimes images are not colorised properly so maybe they need different palette bias or a different palette at all).
Second, images employ two different compression schemes: RLE and LZW. RLE is rather trivial except that it uses opcode zero to signal end of line (and double zero for end of image):

Obvious words skipped

LZW-compressed image splits image into chunks that may be uncompressed or compressed with LZW using 10, 11 or 12 bits per index. And after all these years I was still able to correctly guess it was LZW and write a decoder that worked fine without resorting to any documentation or even studying the decompiled code (I just needed to realize that it reads sequences of n-bits indexes and does something with them to output bytes).

And finally the movie format. Those are actually stored in two files—movie data and companion frame index (i.e. frame type, size and offset). After extracting frames I could not realize what to do with them until I noticed that frame type 2 all have the constant size of 11025 except for first few. Obviously they turned out to be IMA ADPCM compressed audio frames (and since they were the first frames in every movie it was hard to guess the format looking at semi-random data without header). Frame type 0 turned out to be the same image format as normal pictures.

Frame type 1 turned out to be inter-frames that use index 0 for pixels that should not be updated.

Unfortunately even if I now have a way to decode the files I cannot add direct support for them in NihAV since it requires at least three different files to play it (palette, frame index and frame data). At least now I know I can transcode them into something when the need arises.

Now let’s talk about Ghidra a bit. Without it this probably would have not happened since I’m not that young to spend time on translating tons of disassembly (and REC failed to do a good job). Ghidra support for 16-bit code is far from perfect with all that mess with far pointers consisting of two 16-bit words, external DLL functions linked by ordinals (so I had to match them by hoof and rename where possible) and general confusion with int treated as 4-byte in some contexts but 2-byte in another. And yet it did the job and helped me to understand how the game libraries work and that’s what really counts.

And as a bonus here’s a team photo extracted from concentration mini-game related to the witch scene (not the Simon Says clone with fires but probably Match Two clone that I’ve never seen in-game before):

Thursday, 26 September

VP7 is such a nice codec that I decided to distract myself a little with something else. And that something else turned out to be MidiVid codec family. It turned out to be quite peculiar and somehow reminiscent of Duck codecs.

The family consists of three codecs:

  1. MidiVid — the original codec based on LZSS and vector quantisation;
  2. MidiVid Lossless — exactly what is says on a tin, based on LZSS and bunch of other technologies;
  3. MidiVid 3 — a codec based on simplified integer DCT and single codebook for all values.

I’ve actually added MidiVid decoder to NihAV because it’s simple (two hundred lines including boilerplate and tests) and way more fun than working on VP7 decoder. Now I’ll describe them and hopefully you’ll understand why it reminds me of Duck codecs despite not being similar in design.

MidiVid

This is a simple hold-and-modify video codec that had been used in some games back in PS2/Xbox era. The frame data can be stored either unpacked or packed with LZSS and it contains the following kinds of data: change mask for 8×8 blocks (in case of interframe—if it’s zero then leave block as is, otherwise decode new data for it), 4×4 block codebook data (up to 512 entries), high bits for 9-bit indices (if we have 257-512 various blocks) and 8-bit indexes for codebook.

The interesting part is that LZSS scheme looked very familiar and indeed it looks almost exactly like lzss.c from LZARI author (remember that? I still do), the only differences is that it does not use pre-filled window and flags are grouped into 16-bit word instead of single byte.

MidiVid Lossless

This one is a special best as it combines two completely different compression methods: the same LZSS as before and something used by BWT-based compressor (to the point that frame header contains FTWB or ZTWB IDs).

I’m positively convinced it was copied from some BTW-based compressor not just because of those IDs but also because it seems to employ the same methods as some old BTW-based compressor except for the Burrows–Wheeler transform itself (that would be too much for the old codecs): various data preprocessing methods (signalled by flags in the frame header), move-to-front coding (in its classical 1-2 coding form that does not update first two positions that much) plus coding coefficients in two groups: first just zero/one/large using order-3 adaptive model and then values larger than one using single order-1 adaptive model. What made it suspicious? Preprocessing methods.

MVLZ has different kinds of preprocessing methods: something looking like distance coding, static n-gram replacement, table prediction (i.e. when data is treated as series of n-bit numbers and the actual numbers are replaced with the difference between previous ones) and x86 call preprocessing (i.e. that trick when you change function call address from relative into absolute for better compression ratio and then undo it during decompression; known also as E8-preprocessing because x86 call opcode is E8 <32-bit offset> and it’s easy to just replace them instead of adding full disassembler to the archiver). I had my suspicions as n-gram replacement (that one is quite stupid for video codecs and it only replaces some values with some binary values that look more related to machine code than video) but the last item was a dead give-away. I’m pretty sure that somebody who knows open-source BWT compressors of late 1990s will probably recognize it even from this description but sadly I’ve not been following it that closely being more attracted to multimedia.

MidiVid 3

This codec is based on some static codebook for packing all values: block types, motion vectors and actual coefficients. Each block in macroblock can be coded with one of four modes: empty (fill with 0x80 in case of intra), DC only, just few coefficients DCT, and full DCT. As usual various kinds of data are grouped and coded as single array.

Motion compensation is full-pixel and unlike its predecessor it operates in YUV420 format.


This was an interesting detour but I have to return back to failing to start writing VP7 decoder.

P.S. I’ll try to document them with more details in the wiki soon.
P.P.S. This should’ve been a post about railways instead but I guess it will have to wait.

Tuesday, 10 July

I am working on PowerPC SIMD optimizations for x264. I was playing with SAD functions and was thinking it would be nice to have something similar to x86 PSADBW for computing the sum of absolute differences. Luca suggested me to try using the #power9 vec_absd.  Single vec_absd( fenc, pix0v ) replaces vec_sub( vec_max( fencv, pix0v ), vec_min( fencv, pix0v) ). My patch can be found here. To make it work -mcpu=power9 must be set. The patch contains the macro that makes its code backward compatible with POWER8:

#ifndef __POWER9_VECTOR__
#define vec_absd(a, b) vec_sub(vec_max(a, b), vec_min(a, b))
#endif


I got very nice results using vec_absd (the numbers are ratios of AltiVec/C checkasm timings):


I benchmarked overall encoding performance with perf and using vec_absd for  SAD and SSD functions makes 8% improvement. Which is amazing for such a small change.

Thanks Raptor Computing Systems for the chance to experiment with Power9 early.

Sunday, 30 April

I decided to write SIMD optimizations for HEVC decoder inverse transform (which is IDCT approximation) for ARMv7. (Here is an interesting post about DCT.) The inverse transform for HEVC operates on 4x4, 8x8, 16x16 and 32x32 blocks and I have finished them recently. For each block there are 2 functions, one for 8 bitdepth and the other for 10 bitdepth:

  • 4x4 block: ff_hevc_idct_4x4_8/10_neon, the speed up vs C code on A53 core ~3x
  • 8x8 block: ff_hevc_idct_8x8_8/10_neon, speed up (A53) ~4x (github)
  • 16x16 block: ff_hevc_idct_16x16_8/10_neon, speed up (A53) ~8x (github)
  • 32x32 block: ff_hevc_idct_32x32_8/10_neon, speed up (A53) ~13x (github)
Here are some things I learned about NEON ARMv7.
  • The values of q4 - q7 have to be preserved (with vpush/vpop {q4-q7} ) when one wants to use them. VPUSH/VPOP pushes and pops to/from stack.
  • Try to do things in parallel because many of the smaller ARM cores do not have out-of-order execution like x86 does.
  • Do not forget to preserve LR (link register) value when calling a function. When LR value is preserved it can be used as any other GPR.
  • If LR value was pushed to stack, one does not have to do pop lr and then bx lr to return but it's better to return with simply pop {pc} .
  • Use VSHL (Vector Shift Left) instead of VMUL (Vector multiply by scalar) when possible, it's much faster. (The same is valid in general, for example for x86.)
  • Align loads/stores when possible, it's faster.
  • To align the stack and allocate a temporary buffer there (rx is some GPR) mov rx, sp and rx, sp, #15 add rx, rx, #buffer_size sub sp, sp, rx sp now points to the buffer. After using the buffer, the stack pointer has to be restored with add sp, sp, rx .
  • Try to keep functions small. If it is needed to call some big macro several times, make a function of such a macro. Too big futions may fail to build and may hurt performance.

  • Always try to play with the instruction order when it is possible and benchmark the results. But what improves the performance on one core (mine is A53) may cause (or may not) a slowdown on some other core (A7, A8, A9).
Many of the things I learned could be found in ARM Architecture Reference Manual ARMv7-A or the other ARM documentation therefore it is important to read such documents. Many thanks to Kostya Shishkov, who introduced me to ARM and many thanks to Martin Storsjö, an ARM expert who reviewed my patches and helped me a lot with optimizing them.

Friday, 28 April

I tried my skills at optimising HEVC. My SIMD IDCT (Inverse Discrete Cosine Transform) for HEVC decoder was merged lately. What I did was 4x4, 8x8, 16x16 and 32x32 IDCTs for 8 and 10 bitdepths. Both 4x4 and 8x8 are supported on 32-bit CPUs but 16x16 and 32x32 are 64-bit only.

The larger transforms calls the smaller ones, 32 calls 16, 16 calles 8 and so on, so 4x4 is used by all the other transforms. Here is how the actual assembly looks:
; void ff_hevc_idct_4x4__{8,10}_(int16_t *coeffs, int col_limit)
; %1 = bitdepth
%macro IDCT_4x4 1
cglobal hevc_idct_4x4_%1, 1, 1, 5, coeffs
mova m0, [coeffsq]
mova m1, [coeffsq + 16]

TR_4x4 7, 1, 1
TR_4x4 20 - %1, 1, 1

mova [coeffsq], m0
mova [coeffsq + 16], m1
RET
%endmacro
*coeffs is a pointer to coefficients I want to transform. They are loaded to XMM registers and then TR_4x4 macro is called. This macro transforms the coeffs according to the following equations: res00 = 64 * src00 + 64 * src20 + 83 * src10 + 36 * src30
res10 = 64 * src01 + 64 * src21 + 83 * src11 + 36 * src31
res20 = 64 * src02 + 64 * src23 + 83 * src12 + 36 * src32
res30 = 64 * src03 + 64 * src23 + 83 * src13 + 36 * src33
Because the transformed coefficients are written back to the same place, "res" (as residual) is used for the results and "src" for the initial coefficients. The results from the calculations are then scaled res = (res + add_const) >> shift and the (4x4) block of the results is transposed. The macro is called again to perform the same transform but this time to rows.
; %1 - shift
; %2 - 1/0 - SCALE and Transpose or not
; %3 - 1/0 add constant or not
%macro TR_4x4 3
; interleaves src0 with src2 to m0
; and src1 with scr3 to m2
; src0: 00 01 02 03 m0: 00 20 01 21 02 22 03 23
; src1: 10 11 12 13 -->
; src2: 20 21 22 23 m1: 10 30 11 31 12 32 13 33
; src3: 30 31 32 33

SBUTTERFLY wd, 0, 1, 2

pmaddwd m2, m0, [pw_64] ; e0
pmaddwd m3, m1, [pw_83_36] ; o0
pmaddwd m0, [pw_64_m64] ; e1
pmaddwd m1, [pw_36_m83] ; o1

%if %3 == 1
%assign %%add 1 << (%1 - 1)
mova m4, [pd_ %+ %%add]
paddd m2, m4
paddd m0, m4
%endif

SUMSUB_BADC d, 3, 2, 1, 0, 4

%if %2 == 1
psrad m3, %1 ; e0 + o0
t psrad m1, %1 ; e1 + o1
psrad m2, %1 ; e0 - o0
psrad m0, %1 ; e1 - o1
;clip16
packssdw m3, m1
packssdw m0, m2
; Transpose
SBUTTERFLY wd, 3, 0, 1
SBUTTERFLY wd, 3, 0, 1
SWAP 3, 1, 0
%else
SWAP 3, 2, 0
%endif
%endmacro
The larger transforms are a bit more complicated but they works in a similar way.

There are the results benchmarked by checkasm bench_new() function for the bitdepth 8 (the results are similar for bitdepth 10). Checkasm can benchmark SIMD functions with --bench option, in my case the full command was:

    ./tests/checkasm/checkasm --bench=hevc_idct.
    The overall HEVC performace was benchmarked with perf:
    perf stat -r5 ./avconv -threads 1 -i sample.mkv -an -f null -. The sample details: duration 0:12:14, bitrate 200kb/s, yuv420p, 1920x1080, Divx encode of Tears of Steel. The result is 10% speed up after my SIMD optimisations.

    Many thanks to Kostya Shishkov and Henrik Gramner for their advices during the development process.

    Sunday, 22 January

    I asked Kostya Shishkov, an experienced ARM developer, to check my basic NEON knowledge. So here are his questions and my answers to them:

    • Where do you find informations about instruction details?
    • ARM Architecture Reference Manual ARMv7-A and ARMv7-R edition.
    • How many SIMD registers are there and how can you address them?
    • SIMD registers can be viewed as 32 64-bit registers named D0 - D31 or as 16 128-bit registers called Q0 - Q15. VFP view as 32 32-bit registers S0-S31 (mapped to D0- D15) can be also used.
    • In what ways can you load/store data to/from NEON registers and why use one over another?
      • use immediate constant to fill SIMD register:
      • vmov.i32 q0, #42 - move immediate constant 42 to q0 SIMD register, suffix i32 specifies the data type, 32-bit integer in this case, as q0 is 128-bit register, there will be 4 32-bit 42 constants
      • use GPR to store the offset for load/store instruction:
      • mov r1, #16 - move number 16 to r1 GPR register add r0, r1 - add 16 bytes to the address stored in r0 vst1.s16 {q0}, [r0] - store the content of q0 to the address stored in r1
      • update the address after loading (storing)
      • add r1, r0, \offset - add the offset to r0 and store the result in r1mov r2, \step - move the constant step to r2 vld1.s16 {d0}, [r1], r2 - store d0 content to the address in r1, then update r1 = r1 + r2
      • move data between GPR and SIMD registers:
      • vmov d0[0], r1
    • Why it's better to use all different registers in NEON instruction for all its arguments?
    • ( for example: why it's better to use vadd.s16 q1, q2, q3 instead of vadd.s16 q1, q1, q2?)Because it is faster in some cases.
    • What are the differences between vld1.16 q1, [r0] and vld1.16 q1, [r0,:128]?
    • :128 means the data are loaded/stored 128bit aligned. Aligned store means the addr I'm storing at minus the addr of the first array member is 128bit (in this case) multiple.
    • Why some instructions use e.g. vxxx.i8 form and others use vxxx.s8 form?
    • i stands for integer, s for signed integer, I is used when signedness does not matter just to know the element size.
    • Where would you use VZIP, VUZP and VTRN?
    • Those are useful for matrix transposition and various data shuffling.
    • When one NEON instruction can replace several different operations in other SIMDs?
      • VMLAL - multiply corresponding elements in 2 vectors and add the products to destination vector
      • VQRSHRN - Vector Saturating Shift Right Narrow - Shifts right vector elements and clips them into narrower element.

    Monday, 31 October

    Using hwaccel

    Had been a while since I mentioned the topic and we made a huge progress on this field.

    Currently with Libav12 we already have nice support for multiple different hardware support for decoding, scaling, deinterlacing and encoding.

    The whole thing works nicely but it isn’t foolproof yet so I’ll start describing how to setup and use it for some common tasks.

    This post will be about Intel MediaSDK, the next post will be about NVIDIA Video Codec SDK.

    Setup

    Prerequisites

    • A machine with QSV hardware, Haswell, Skylake or better.
    • The ability to compile your own kernel and modules
    • The MediaSDK mfx_dispatch

    It works nicely both on Linux and Windows. If you happen to have other platforms feel free to contact Intel and let them know, they’ll be delighted.

    Installation

    The MediaSDK comes with either the usual Windows setup binary or a Linux bash script that tries its best to install the prerequisites.

    # tar -xvf MediaServerStudioEssentials2017.tar.gz
    MediaServerStudioEssentials2017/
    MediaServerStudioEssentials2017/Intel(R)_Media_Server_Studio_EULA.pdf
    MediaServerStudioEssentials2017/MediaSamples_Linux_2017.tar.gz
    MediaServerStudioEssentials2017/intel_sdk_for_opencl_2016_6.2.0.1760_x64.tgz
    MediaServerStudioEssentials2017/site_license_materials.txt
    MediaServerStudioEssentials2017/third_party_programs.txt
    MediaServerStudioEssentials2017/redist.txt
    MediaServerStudioEssentials2017/FEI2017-16.5.tar.gz
    MediaServerStudioEssentials2017/SDK2017Production16.5.tar.gz
    MediaServerStudioEssentials2017/media_server_studio_essentials_release_notes.pdf
    

    Focus on SDK2017Production16.5.tar.gz.

    tar -xvf SDK2017Production16.5.tar.gz
    SDK2017Production16.5/
    SDK2017Production16.5/Generic/
    SDK2017Production16.5/Generic/intel-opencl-16.5-55964.x86_64.tar.xz.sig
    SDK2017Production16.5/Generic/intel-opencl-devel-16.5-55964.x86_64.tar.xz.sig
    SDK2017Production16.5/Generic/intel-opencl-devel-16.5-55964.x86_64.tar.xz
    SDK2017Production16.5/Generic/intel-linux-media_generic_16.5-55964_64bit.tar.gz
    SDK2017Production16.5/Generic/intel-opencl-16.5-55964.x86_64.tar.xz
    SDK2017Production16.5/Generic/vpg_ocl_linux_rpmdeb.public
    SDK2017Production16.5/media_server_studio_getting_started_guide.pdf
    SDK2017Production16.5/intel-opencl-16.5-release-notes.pdf
    SDK2017Production16.5/intel-opencl-16.5-installation.pdf
    SDK2017Production16.5/CentOS/
    SDK2017Production16.5/CentOS/libva-1.67.0.pre1-55964.el7.x86_64.rpm
    SDK2017Production16.5/CentOS/libdrm-devel-2.4.66-55964.el7.x86_64.rpm
    SDK2017Production16.5/CentOS/intel-linux-media-devel-16.5-55964.el7.x86_64.rpm
    SDK2017Production16.5/CentOS/intel-i915-firmware-16.5-55964.el7.x86_64.rpm
    SDK2017Production16.5/CentOS/install_scripts_centos_16.5-55964.tar.gz
    SDK2017Production16.5/CentOS/intel-opencl-devel-16.5-55964.x86_64.rpm
    SDK2017Production16.5/CentOS/ukmd-kmod-16.5-55964.el7.src.rpm
    SDK2017Production16.5/CentOS/libdrm-2.4.66-55964.el7.x86_64.rpm
    SDK2017Production16.5/CentOS/libva-utils-1.67.0.pre1-55964.el7.x86_64.rpm
    SDK2017Production16.5/CentOS/intel-linux-media-16.5-55964.el7.x86_64.rpm
    SDK2017Production16.5/CentOS/kmod-ukmd-16.5-55964.el7.x86_64.rpm
    SDK2017Production16.5/CentOS/intel-opencl-16.5-55964.x86_64.rpm
    SDK2017Production16.5/CentOS/libva-devel-1.67.0.pre1-55964.el7.x86_64.rpm
    SDK2017Production16.5/CentOS/drm-utils-2.4.66-55964.el7.x86_64.rpm
    SDK2017Production16.5/CentOS/MediaSamples_Linux_bin-16.5-55964.tar.gz
    SDK2017Production16.5/CentOS/vpg_ocl_linux_rpmdeb.public
    SDK2017Production16.5/media_server_studio_sdk_release_notes.pdf
    

    Libraries

    The MediaSDK leverages libva to access the hardware together with an highly extended DRI kernel module.
    They support CentOS with rpms and all the other distros with a tarball.

    BEWARE: if you use the installer script the custom libva would override your system one, you might not want that.

    I’m using Gentoo so it is intel-linux-media_generic_16.5-55964_64bit.tar.gz for me.

    The bits of this tarball you really want to install in the system no matter what is the firmware:

    ./lib/firmware/i915/skl_dmc_ver1_26.bin
    

    If you are afraid of adding custom stuff on your system I advise to offset the whole installation and then override the LD paths to use that only for Libav.

    BEWARE: you must use the custom iHD libva driver with the custom i915 kernel module.

    If you want to install using the provided script on Gentoo you should first emerge lsb-release.

    emerge lsb-release
    bash install_media.sh
    source /etc/profile.d/*.sh
    echo /opt/intel/mediasdk/lib64/ >> /etc/ld.so.conf.d/intel-msdk.conf
    ldconfig
    

    Kernel Modules

    The patchset resides in:

    opt/intel/mediasdk/opensource/patches/kmd/4.4/intel-kernel-patches.tar.bz2
    

    The current set is 143 patches against linux 4.4, trying to apply on a more recent kernel requires patience and care.

    The 4.4.27 works almost fine (even btrfs does not seem to have many horrible bugs).

    Libav

    In order to use the Media SDK with Libav you should use the mfx_dispatch from yours truly since it provides a default for Linux so it behaves in an uniform way compared to Windows.

    Building the dispatcher

    It is a standard autotools package.

    git clone git://github.com/lu-zero/mfx_dispatch
    cd mfx_dispatch
    autoreconf -ifv
    ./configure --prefix=/some/where
    make -j 8
    make install
    

    Building Libav

    If you want to use the advanced hwcontext features on Linux you must enable both the vaapi and the mfx support.

    git clone git://github.com/libav/libav
    cd libav
    export PKG_CONFIG_PATH=/some/where/lib/pkg-config
    ./configure --enable-libmfx --enable-vaapi --prefix=/that/you/like
    make -j 8
    make install
    

    Troubleshooting

    Media SDK is sort of temperamental and the setup process requires manual tweaking so the odds of having to do debug and investigate are high.

    If something misbehave here is a checklist:
    – Make sure you are using the right kernel and you are loading the module.

    uname -a
    lsmod
    dmesg
    
    • Make sure libva is the correct one and it is loading the right thing.
    vainfo
    strace -e open ./avconv -c:v h264_qsv -i test.h264 -f null -
    
    • Make sure you aren’t using the wrong ratecontrol or not passing all the parameters required
    ./avconv -v verbose -filter_complex testsrc -c:v h264_qsv {ratecontrol params omitted} out.mkv
    

    See below for some examples of working rate-control settings.
    – Use the MediaSDK examples provided with the distribution to confirm that everything works in case the SDK is more recent than the updates.

    Usage

    The Media SDK support in Libav covers decoding, encoding, scaling and deinterlacing.

    Decoding is straightforward, the rest has still quite a bit of rough edges and this blog post had been written mainly to explain them.

    Currently the most interesting format supported are h264 and hevc, but even other formats such as vp8 and vc1 are supported.

    ./avconv -codecs | grep qsv
    

    Decoding

    The decoders can output directly to system memory and can be used as normal decoders and feed a software implementation just fine.

    ./avconv -c:v h264_qsv -i input.h264 -c:v av1 output.mkv
    

    Or they can decode to opaque (gpu backed) buffers so further processing can happen

    ./avconv -hwaccel qsv -c:v h264_qsv -vf deinterlace_qsv,hwdownload,format=nv12 -c:v x265 output.mov
    

    NOTICE: you have to explicitly pass the filterchain hwdownload,format=nv12 not have mysterious failures.

    Encoding

    The encoders are almost as straightforward beside the fact that the MediaSDK provides multiple rate-control systems and they do require explicit parameters to work.

    ./avconv -i input.mkv -c:v h264_qsv -q 20 output.mkv
    

    Failing to set the nominal framerate or the bitrate would make the look-ahead rate control not happy at all.

    Rate controls

    The rate control is one of the most rough edges of the current MediaSDK support, most of them do require a nominal frame rate and that requires an explicit -r to be passed.

    There isn’t a default bitrate so also -b:v should be passed if you want to use a rate-control that has a bitrate target.

    Is it possible to use a look-ahead rate-control aiming to a quality metric passing -global_quality -la_depth.

    The full list is documented.

    Transcoding

    It is possible to have a full hardware transcoding pipeline with Media SDK.

    Deinterlacing

    ./avconv -hwaccel qsv -c:v h264_qsv -i input.mkv -vf deinterlace_qsv -c:v h264_qsv -r 25 -b:v 2M output.mov
    

    Scaling

    ./avconv -hwaccel qsv -c:v h264_qsv -i input.mkv -vf scale_qsv=640:480 -c:v h264_qsv -r 25 -b:v 2M -la_depth 10 output.mov
    
    

    Both at the same time

    ./avconv -hwaccel qsv -c:v h264_qsv -i input.mkv -vf deinterlace_qsv,scale_qsv=640:480 -c:v h264_qsv -r 25 -b:v 2M -la_depth 10 output.mov
    

    Hardware filtering caveats

    The hardware filtering system is quite new and introducing it shown a number of shortcomings in the Libavfilter architecture regarding format autonegotiation so for hybrid pipelines (those that do not keep using hardware frames all over) it is necessary to explicitly call for hwupload and hwdownload explictitly in such ways:

    ./avconv -hwaccel qsv -c:v h264_qsv -i in.mkv -vf deinterlace_qsv,hwdownload,format=nv12 -c:v vp9 out.mkv
    

    Future for MediaSDK in Libav

    The Media SDK supports already a good number of interesting codecs (h264, hevc, vp8/vp9) and Intel seems to be quite receptive regarding what codecs support.
    The Libav support for it will improve over time as we improve the hardware acceleration support in the filtering layer and we make the libmfx interface richer.

    We’d need more people testing and helping us to figure use-cases and corner-cases that hadn’t been thought of yet, your feedback is important!

    Tuesday, 18 October

    I decided to organise the Libav sprint again, this time in a small village near Pelhřimov. The participants:


    • Luca Barbato - came with a lot of Venchi chocolate
    • Anton Khirnov
    • Kostya Shishkov - came with a lot of Läderach chocolate
    • Mark Thompson
    • Alexandra Hájková (me) 
    All the chocolate was sooo tasty, we ate all of it of course.
    Topics:
    • Luca - Altivec
    • Anton and Mark - coworking on QSV and VP9
    • Alexandra - x86 SIMD HEVC IDCT
    • Kostya - consultations for the rest of us

    I rented a cosy cottage for the sprint. It was surprisingly warm for the end of September and we enjoyed a nice garden sitting not only with the table and chairs but even with a couch. The sitting was under the roof and it was possible to work outside which was really pleasant for me. There was also a grill so we grilled some sausages for one of our dinners. It started to rain the last day and making a fire in the fireplace made a nice feeling.

    Because the weather was really nice we decided to explore the countyside a bit, we finally found the path to the forest and spent mid-afternoon on its fresh air.

    Both Luca and me like to cook and to try new foods and dishes. We cooked a lot during the sprint, Luca prepared some delicious Italian meals, Kostya cooked us traditional ukranian dish from millet called куліш which was very tasty and I want to try it at home sometime. At least for me the most interesting thing of the cooking part was making another traditional Ukrainian meal вареники which is kind of filled-pasta. We filled one half of them with potato-salami  with fried onion filling and the other half with cottage cheese, both very good I can't decide which one was better.  The вареники was eaten with sour cream, there're also some dried cranberries on the picture.

    I almost finished my x86 SIMD  optimisation for HEVC IDCT there, Luca introduced altivec to me and I wrote PPC optimised 4x4 IDCT (github).

    A lot of work was done during the sprint, many patches sent (ML), many patches pushed, all of this in a friendly atmosphere in a comfortable cottage, fresh air and with a good cuisine. Personally I enjoyed the sprint very much, I'm glad I organised it and I hope the other people liked it as well.

    Thank you everyone for coming!



    Monday, 02 May

    Some time ago Niels Möller proposed a new method of bitreading that should be faster then the current one (here). It is an interesting idea and I decided to try it. Luca Barbato considered it to be a good idea and had his company sponsored this work. The new bitstream reader (bitstream.h) is faster in many cases and is never  slower than the existing one (get_bits.h).

    All the new and equivalent old bitreading functions was benchmarked using TIME macros in a simple test program. Because the results were good, I converted all the decoders to use the new bitstream reader. The performances of the most important decoders using the new and old bitreaders was benchmarked with perf stat (using x86_64, 64-bit (Intel Core i3-2120, 3.30GHz)) are pretty good and even on arm32 I could not see speed regressions.

    The old bitstream reader is quite inconsistent, with its core API made of macros and with at least 3 or 4 higher level functions reading a not easy to guess number of bits.
    static inline unsigned int get_bits(GetBitContext *s, int n){ register int tmp; OPEN_READER(re, s);
    UPDATE_CACHE(re, s); tmp = SHOW_UBITS(re, s, n); LAST_SKIP_BITS(re, s, n); CLOSE_READER(re, s); return tmp;}
    The new bitstream reader is written to be easier to use, more consistent and to be easier to follow. It is better documented, the functions are named according to the current naming convetions and to be consistent with the bytestream reader naming.
    Some of bitstream.h functions replaces several ones from get_bits.h at once:
    • bitstream_read_32() reads bits from the 0-32 range and replaces
      • get_bits()
      • get_bits_long()
      • get_bitsz()
    • bitstream_peek_32() replaces
      • show_bits()
      • show_bits_long()
      • show_bits1()
    • bitstream_skip() replaces
      • skip_bits1()
      • skip_bits()
      • skip_bits_long()
    The get_bits.h bitreading macros have to be used directly sometimes to achieve better decoding performance. Reading or writing the code that uses these macros requires good knowledge about how this bitreader works and they can be surprising at times since they create local variables.
    The new bitreader usage does not require such a deep knowledge, all needed operations require just to use a smaller set of function that happen to be faster in many usage patterns.

    Many thanks to Luca Barbato for his advices and consultations during the developing process.

     I hope this new bitreader could become a useful piece of the Libav code. Opinions and suggestions are welcome.

    Friday, 01 April

    swscale is one of the most annoying part of Libav, after a couple of years since the initial blueprint we have something almost functional you can play with.

    Colorspace conversion and Scaling

    Before delving in the library architecture and the outher API probably might be good to make a extra quick summary of what this library is about.

    Most multimedia concepts are more or less intuitive:
    encoding is taking some data (e.g. video frames, audio samples) and compress it by leaving out unimportant details
    muxing is the act of storing such compressed data and timestamps so that audio and video can play back in sync
    demuxing is getting back the compressed data with the timing information stored in the container format
    decoding inflates somehow the data so that video frames can be rendered on screen and the audio played on the speakers

    After the decoding step would seem that all the hard work is done, but since there isn’t a single way to store video pixels or audio samples you need to process them so they work with your output devices.

    That process is usually called resampling for audio and for video we have colorspace conversion to change the pixel information and scaling to change the amount of pixels in the image.

    Today I’ll introduce you to the new library for colorspace conversion and scaling we are working on.

    AVScale

    The library aims to be as simple as possible and hide all the gory details from the user, you won’t need to figure the heads and tails of functions with a quite large amount of arguments nor special-purpose functions.

    The API itself is modelled after avresample and approaches the problem of conversion and scaling in a way quite different from swscale, following the same design of NAScale.

    Everything is a Kernel

    One of the key concept of AVScale is that the conversion chain is assembled out of different components, separating the concerns.

    Those components are called kernels.

    The kernels can be conceptually divided in two kinds:
    Conversion kernels, taking an input in a certain format and providing an output in another (e.g. rgb2yuv) without changing any other property.
    Process kernels, modifying the data while keeping the format itself unchanged (e.g. scale)

    This pipeline approach gets great flexibility and helps code reuse.

    The most common use-cases (such as scaling without conversion or conversion with out scaling) can be faster than solutions trying to merge together scaling and conversion in a single step.

    API

    AVScale works with two kind of structures:
    AVPixelFormaton: A full description of the pixel format
    AVFrame: The frame data, its dimension and a reference to its format details (aka AVPixelFormaton)

    The library will have an AVOption-based system to tune specific options (e.g. selecting the scaling algorithm).

    For now only avscale_config and avscale_convert_frame are implemented.

    So if the input and output are pre-determined the context can be configured like this:

    AVScaleContext *ctx = avscale_alloc_context();
    
    if (!ctx)
        ...
    
    ret = avscale_config(ctx, out, in);
    if (ret < 0)
        ...
    

    But you can skip it and scale and/or convert from a input to an output like this:

    AVScaleContext *ctx = avscale_alloc_context();
    
    if (!ctx)
        ...
    
    ret = avscale_convert_frame(ctx, out, in);
    if (ret < 0)
        ...
    
    avscale_free(&ctx);
    

    The context gets lazily configured on the first call.

    Notice that avscale_free() takes a pointer to a pointer, to make sure the context pointer does not stay dangling.

    As said the API is really simple and essential.

    Help welcome!

    Kostya kindly provided an initial proof of concept and me, Vittorio and Anton prepared this preview on the spare time. There is plenty left to do, if you like the idea (since many kept telling they would love a swscale replacement) we even have a fundraiser.

    Tuesday, 29 March

    Another week another API landed in the tree and since I spent some time drafting it, I guess I should describe how to use it now what is implemented. This is part I

    What is here now

    Between theory and practice there is a bit of discussion and obviously the (lack) of time to implement, so here what is different from what I drafted originally:

    • Function Names: push got renamed to send and pull got renamed to receive.
    • No separated function to probe the process state, need_data and have_data are not here.
    • No codecs ported to use the new API, so no actual asyncronicity for now.
    • Subtitles aren’t supported yet.

    New API

    There are just 4 new functions replacing both audio-specific and video-specific ones:

    // Decode
    int avcodec_send_packet(AVCodecContext *avctx, const AVPacket *avpkt);
    int avcodec_receive_frame(AVCodecContext *avctx, AVFrame *frame);
    
    // Encode
    int avcodec_send_frame(AVCodecContext *avctx, const AVFrame *frame);
    int avcodec_receive_packet(AVCodecContext *avctx, AVPacket *avpkt);
    

    The workflow is sort of simple:
    – You setup the decoder or the encoder as usual
    – You feed data using the avcodec_send_* functions until you get a AVERROR(EAGAIN), that signals that the internal input buffer is full.
    – You get the data back using the matching avcodec_receive_* function until you get a AVERROR(EAGAIN), signalling that the internal output buffer is empty.
    – Once you are done feeding data you have to pass a NULL to signal the end of stream.
    – You can keep calling the avcodec_receive_* function until you get AVERROR_EOF.
    – You free the contexts as usual.

    Decoding examples

    Setup

    The setup uses the usual avcodec_open2.

        ...
    
        c = avcodec_alloc_context3(codec);
    
        ret = avcodec_open2(c, codec, &opts);
        if (ret < 0)
            ...
    

    Simple decoding loop

    People using the old API usually have some kind of simple loop like

    while (get_packet(pkt)) {
        ret = avcodec_decode_video2(c, picture, &got_picture, pkt);
        if (ret < 0) {
            ...
        }
        if (got_picture) {
            ...
        }
    }
    

    The old functions can be replaced by calling something like the following.

    // The flush packet is a non-NULL packet with size 0 and data NULL
    int decode(AVCodecContext *avctx, AVFrame *frame, int *got_frame, AVPacket *pkt)
    {
        int ret;
    
        *got_frame = 0;
    
        if (pkt) {
            ret = avcodec_send_packet(avctx, pkt);
            // In particular, we don't expect AVERROR(EAGAIN), because we read all
            // decoded frames with avcodec_receive_frame() until done.
            if (ret < 0)
                return ret == AVERROR_EOF ? 0 : ret;
        }
    
        ret = avcodec_receive_frame(avctx, frame);
        if (ret < 0 && ret != AVERROR(EAGAIN) && ret != AVERROR_EOF)
            return ret;
        if (ret >= 0)
            *got_frame = 1;
    
        return 0;
    }
    

    Callback approach

    Since the new API will output multiple frames in certain situations would be better to process them as they are produced.

    // return 0 on success, negative on error
    typedef int (*process_frame_cb)(void *ctx, AVFrame *frame);
    
    int decode(AVCodecContext *avctx, AVFrame *pkt,
               process_frame_cb cb, void *priv)
    {
        AVFrame *frame = av_frame_alloc();
        int ret;
    
        ret = avcodec_send_packet(avctx, pkt);
        // Again EAGAIN is not expected
        if (ret < 0)
            goto out;
    
        while (!ret) {
            ret = avcodec_receive_frame(avctx, frame);
            if (!ret)
                ret = cb(priv, frame);
        }
    
    out:
        av_frame_free(&frame);
        if (ret == AVERROR(EAGAIN))
            return 0;
        return ret;
    }
    

    Separated threads

    The new API makes sort of easy to split the workload in two separated threads.

    // Assume we have context with a mutex, a condition variable and the AVCodecContext
    
    
    // Feeding loop
    {
        AVPacket *pkt = NULL;
    
        while ((ret = get_packet(ctx, pkt)) >= 0) {
            pthread_mutex_lock(&ctx->lock);
    
            ret = avcodec_send_packet(avctx, pkt);
            if (!ret) {
                pthread_cond_signal(&ctx->cond);
            } else if (ret == AVERROR(EAGAIN)) {
                // Signal the draining loop
                pthread_cond_signal(&ctx->cond);
                // Wait here
                pthread_cond_wait(&ctx->cond, &ctx->mutex);
            } else if (ret < 0)
                goto out;
    
            pthread_mutex_unlock(&ctx->lock);
        }
    
        pthread_mutex_lock(&ctx->lock);
        ret = avcodec_send_packet(avctx, NULL);
    
        pthread_cond_signal(&ctx->cond);
    
    out:
        pthread_mutex_unlock(&ctx->lock)
        return ret;
    }
    
    // Draining loop
    {
        AVFrame *frame = av_frame_alloc();
    
        while (!done) {
            pthread_mutex_lock(&ctx->lock);
    
            ret = avcodec_receive_frame(avctx, frame);
            if (!ret) {
                pthread_cond_signal(&ctx->cond);
            } else if (ret == AVERROR(EAGAIN)) {
                // Signal the feeding loop
                pthread_cond_signal(&ctx->cond);
                // Wait
                pthread_cond_wait(&ctx->cond, &ctx->mutex);
            } else if (ret < 0)
                goto out;
    
            pthread_mutex_unlock(&ctx->lock);
    
            if (!ret) {
                do_something(frame);
            }
        }
    
    out:
            pthread_mutex_unlock(&ctx->lock)
        return ret;
    }
    

    It isn’t as neat as having all this abstracted away, but is mostly workable.

    Encoding Examples

    Simple encoding loop

    Some compatibility with the old API can be achieved using something along the lines of:

    int encode(AVCodecContext *avctx, AVPacket *pkt, int *got_packet, AVFrame *frame)
    {
        int ret;
    
        *got_packet = 0;
    
        ret = avcodec_send_frame(avctx, frame);
        if (ret < 0)
            return ret;
    
        ret = avcodec_receive_packet(avctx, pkt);
        if (!ret)
            *got_packet = 1;
        if (ret == AVERROR(EAGAIN))
            return 0;
    
        return ret;
    }
    

    Callback approach

    Since for each input multiple output could be produced, would be better to loop over the output as soon as possible.

    // return 0 on success, negative on error
    typedef int (*process_packet_cb)(void *ctx, AVPacket *pkt);
    
    int encode(AVCodecContext *avctx, AVFrame *frame,
               process_packet_cb cb, void *priv)
    {
        AVPacket *pkt = av_packet_alloc();
        int ret;
    
        ret = avcodec_send_frame(avctx, frame);
        if (ret < 0)
            goto out;
    
        while (!ret) {
            ret = avcodec_receive_packet(avctx, pkt);
            if (!ret)
                ret = cb(priv, pkt);
        }
    
    out:
        av_packet_free(&pkt);
        if (ret == AVERROR(EAGAIN))
            return 0;
        return ret;
    }
    

    The I/O should happen in a different thread when possible so the callback should just enqueue the packets.

    Coming Next

    This post is long enough so the next one might involve converting a codec to the new API.

    Monday, 21 March

    Last weekend, after few months of work, the new bitstream filter API eventually landed.

    Bitstream filters

    In Libav is possible to manipulate raw and encoded data in many ways, the most common being

    • Demuxing: extracting single data packets and their timing information
    • Decoding: converting the compressed data packets in raw video or audio frames
    • Encoding: converting the raw multimedia information in a compressed form
    • Muxing: store the compressed information along timing information and additional information.

    Bitstream filtering is somehow less considered even if the are widely used under the hood to demux and mux many widely used formats.

    It could be consider an optional final demuxing or muxing step since it works on encoded data and its main purpose is to reformat the data so it can be accepted by decoders consuming only a specific serialization of the many supported (e.g. the HEVC QSV decoder) or it can be correctly muxed in a container format that stores only a specific kind.

    In Libav this kind of reformatting happens normally automatically with the annoying exception of MPEGTS muxing.

    New API

    The new API is modeled against the pull/push paradigm I described for AVCodec before, it works on AVPackets and has the following concrete implementation:

    // Query
    const AVBitStreamFilter *av_bsf_next(void **opaque);
    const AVBitStreamFilter *av_bsf_get_by_name(const char *name);
    
    // Setup
    int av_bsf_alloc(const AVBitStreamFilter *filter, AVBSFContext **ctx);
    int av_bsf_init(AVBSFContext *ctx);
    
    // Usage
    int av_bsf_send_packet(AVBSFContext *ctx, AVPacket *pkt);
    int av_bsf_receive_packet(AVBSFContext *ctx, AVPacket *pkt);
    
    // Cleanup
    void av_bsf_free(AVBSFContext **ctx);
    

    In order to use a bsf you need to:

    • Look up its definition AVBitStreamFilter using a query function.
    • Set up a specific context AVBSFContext, by allocating, configuring and then initializing it.
    • Feed the input using av_bsf_send_packet function and get the processed output once it is ready using av_bsf_receive_packet.
    • Once you are done av_bsf_free cleans up the memory used for the context and the internal buffers.

    Query

    You can enumerate the available filters

    void *state = NULL;
    
    const AVBitStreamFilter *bsf;
    
    while ((bsf = av_bsf_next(&state)) {
        av_log(NULL, AV_LOG_INFO, "%s\n", bsf->name);
    }
    

    or directly pick the one you need by name:

    const AVBitStreamFilter *bsf = av_bsf_get_by_name("hevc_mp4toannexb");
    

    Setup

    A bsf may use some codec parameters and time_base and provide updated ones.

    AVBSFContext *ctx;
    
    ret = av_bsf_alloc(bsf, &ctx);
    if (ret < 0)
        return ret;
    
    ret = avcodec_parameters_copy(ctx->par_in, in->codecpar);
    if (ret < 0)
        goto fail;
    
    ctx->time_base_in = in->time_base;
    
    ret = av_bsf_init(ctx);
    if (ret < 0)
        goto fail;
    
    ret = avcodec_parameters_copy(out->codecpar, ctx->par_out);
    if (ret < 0)
        goto fail;
    
    out->time_base = ctx->time_base_out;
    

    Usage

    Multiple AVPackets may be consumed before an AVPacket is emitted or multiple AVPackets may be produced out of a single input one.

    AVPacket *pkt;
    
    while (got_new_packet(&pkt)) {
        ret = av_bsf_send_packet(ctx, pkt);
        if (ret < 0)
            goto fail;
    
        while ((ret = av_bsf_receive_packet(ctx, pkt)) == 0) {
            yield_packet(pkt);
        }
    
        if (ret == AVERROR(EAGAIN)
            continue;
        IF (ret == AVERROR_EOF)
            goto end;
        if (ret < 0)
            goto fail;
    }
    
    // Flush
    ret = av_bsf_send_packet(ctx, NULL);
    if (ret < 0)
        goto fail;
    
    while ((ret = av_bsf_receive_packet(ctx, pkt)) == 0) {
        yield_packet(pkt);
    }
    
    if (ret != AVERROR_EOF)
        goto fail;
    

    In order to signal the end of stream a NULL pkt should be fed to send_packet.

    Cleanup

    The cleanup function matches the av_freep signature so it takes the address of the AVBSFContext pointer.

        av_bsf_free(&ctx);
    

    All the memory is freed and the ctx pointer is set to NULL.

    Coming Soon

    Hopefully next I’ll document the new HWAccel layer that already landed and some other API that I discussed with Kostya before.
    Sadly my blog-time (and spare time in general) shrunk a lot in the past months so he rightfully blamed me a lot.

    Saturday, 05 March

    Sometimes it's very useful to print out how some parameters changes during the program execution.

    When writing the new version of some piece of code one usually needs to compare it with the old one to be sure it behaves the same in every case. Especially the corner cases might be tricky and I spent a lot of time with them while my code worked fine in general.

    For example when I was working on my ASF demuxer, I was happy there's an old demuxer and I can compare their behaviour. When debugging the ASF, I wanted to know the state of I/O context. In that time lu_zero (who was mentoring me) created a set of macros which printed logs for every I/O function (here). For example there's the macro for avio_seek() function (which is equivalent to fseek()).

      #define avio_seek(s, o, w) ({ \
    int64_t _ret = avio_seek(s, o, w); \
    int64_t _pos = avio_tell(s); \
    av_log(NULL, AV_LOG_VERBOSE|AV_LOG_C(154), "0x%08"PRIx64" - %s:%d seek %p %"PRId64" %d -> %"PRId64"\n", \
    _pos, __FUNCTION__, __LINE__, s, o, w, _ret); \
    _ret; \
    })
     When such a macro was present in my demuxer, for all the calls of avio_seek the following information was printed
    • _pos = avio_tell(s): the offset in the demuxed file
    • __FUNCTION__ : preprocessor define that contains the name of the function being compiled to know which function called avio_seek
    • __LINE__ : preprocessor define that contains the line number of the original source file that is being compiled to know from what line avio_seek was called from
    • s, o, w : the values of the parameters avio_seek was called with
    • _ret: the avio_seek return value
    • __FILE__: preprocessor define contains the name of the file being compiled (this one was not used in the example but might be useful when one needs more complex log). 
    Parentheses are used around the define body because such a construct may appear as an expression in GNU C. There's _ret; as the last statement in this macro because its value serves as the value of the entire construct. If the last _ret; would be omitted in my example, the return value of this macro expression would be printf return value. The underscores in _ret or _pos variables are used to be sure it does not shadow some other variables with the same names.
    Working with a log created by a set of macros similar to this example might be more effective than debugging with gdb in some cases.

    Many thanks to lu_zero for teaching me about it. The support from the more experienced developers is the thing I really love about Libav.

    Wednesday, 09 December

    I made a split my complex dcadec bit-exact patch (https://patches.libav.org/patch/59013/) to the several parts. The first part which contains changing the dcadec core to work with integer coefficients instead of converting the coefficients to floats just after reading them was sent to the mailing list (https://patches.libav.org/patch/59141/). Such a change was expected to slow down the decoding process. Therefore I made some measurements to examine how much slower decoding is after my patch.
     I decoded this sample: samples.libav.org/A-codecs/DTS/dts/dtswavsample16.wav 10 times and measured the user time between invocation and termination with the "time" command:

     time ./avconv -f dts -i dtswavsample16.wav -f null -c pcm_f32le null, 
    counted the average real time of avconv run and repeated everything for the master branch. The duration of the dtswavsample16.wav is ~4 mins and I wanted to look at the slow down for the longer files. Hence I used relatively new loop option for the avconv (http://sasshkas.blogspot.cz/2015/08/the-loop-option-for-avconv.html) to create ~24 mins long file from the initial file by looping it 6x with
     ./avconv -loop 6 -i dtswavsample16.wav -c copy dts_long.wav. 
    I decoded this longer dts file 10x again for both new integer and old float coefficients core and counted the averages.
    According to my results the integer patch causes ~20% slow down. The question is if this is still acceptable. I see 2 options here
    • To consider the slowdown acceptable and to try to make some speedups like SIMDifying VQ decoding and using inline asm for 64-bit math.
    • Or alternatively both int and float modes can be kept for different decoding modes but this might make the code too hairy.

    Opinions and suggestions are welcome.

    Tuesday, 08 December


    When playing a multimedia file, one usually wants to seek to reach different parts of a file. Mostly, containers allows this feature but it might be problem for streamed files.
    Withing the libavformat, seeking is performed with function (inside the demuxer) called read_seek. This function tries to find matching timestamp  for the requested position (offset) in the played file.
    There are 2 ways to seek through the file. One of them is when file contains some kind of index, which matches positions with appropriate timestamps. In this case index entries are created by calling av_add_index_entry. If index entries are present, av_index_search_timestamp, which is called inside the read_seek, looks for the closest timestamp for the requested position. When the file does not provide such entries, one can look for the requested position with ff_seek_frame_binary. For doing so, read_timestamp function has to be created inside the demuxer.
    Read_timestamp takes required position and stream index and then tries to find offset of the beginning of the closest packet wich is key frame with matching stream index. While doing this, read_timestamp reads timestamps for all the packets after given position and creates index entries. When the key frame with matching stream index is found, read_timestamp upgrades required position and returns timestamp matching to it. 

    I was told to test my ASF demuxer with the zzuf utility. Zuff is a fuzzer, it changes random bits in the program's input which simulates damaged file or unexpected data.
    For testing ASF's behaviour I want to feed avconv with some corrupted wmv files and see what will happen. Because I want to fuzz in several different ways I want to vary seed (the initial value of zzuf’s random number generator). I'll do this with command:

    while true; SEED=$RANDOM; for file *wmv; do zzuf -M -l -r 0.00001 -q -U 60 -s $SEED ./avconv -i "file" -f null -c copy - || echo $SEED $file >> fuzz; done; done;.

    I got the file fuzz which is the list of seed paired with filename.  Now I need to use zzuf for creating damaged files to check the problem with valgrind. I'll use the list to determine the seed which caused some crash for creating my damaged file:

    zzuf -M -l -r 0.00001 -q -U 60 -s myseed < somefile.wmv | cat out.asf.

    Now I'll just use valgrind to find out what happened:

    valgrind ./avconv -i out.asf -f null -c copy -.


    I tried to test the ASF demuxer with different tricky samples and with FATE
    and the demuxer behaved well but testing with zzuf detected several new crashes. Mainly it was insane values sizes and it was easy to fix them by adding some more checks. Zzuf is a great thing for testing.

    Pelhřimov is small but very nice town in Czech Republic approximately 120 km from the capital Prague and I decided to organize a small but nice Libav sprint in it.
    The participants and the topics were:

    • Luca Barbato -
      • AVScale, especially documenting it
      • HW accelerated encoders
      • async API
    • Anton Khirnov - The Evil Plan II and fixing H.264 decoder for it
    • Kostya Shishkov - trolling motivating the others
    • Alexandra Hájková (me) - dcadec decoder, mainly testing the new patch
    • everyone - eating chocolate

    I finished and sent my dcadec "Integer core decoder" patch (which transforms the dcadec core decoder to work with integers) during the sprint. After the discussion and some hints from the others I tested my patch better and found out some interesting things:
    • It seems XLL output really is lossless when using my patch.
    • But LFE channel was broken - this was fixed during the sprint.
    • While lossless or force_fixed output looks fine my patch breaks a float (lossy) output a little bit - it feels the same for my ears but looking at the output in audacity and comparing it with "before the integer patch" output shows something is wrong there.
    • I discovered for myself an avconv option called channelsplit ( https://libav.org/documentation/avconv.html#channelsplit) that splits the file into per-channel files wich was very useful for comparing the output with some reference decoder with the other channel order (for example with https://github.com/foo86/dcadec).
    My post sprint dcadec plans are:
    • Fix the lossy output issue.
    • Fix detection of the extensions and add the options for disabling them.
    • Rewrite the dca_decode_frame to handle all the extensions and working with the new options for them more systematicly.

    I decided to improve the Libav DTS decoder - dcadec. Here I want to explain what are its problems now and what I would like to do about them.
    DTS encoded audio stream consists of core audio and may contain extended audio. Dcadec supports XCH and XLL extensions but X96, XXCH and XBR extensions are waiting to be implemented - I'd like to implement them later.
    For the DTS lossless extension - XLL, the decoded output audio should be a bit for bit accurate reproduction of the encoded input. However there are some problems:

    • The main problem is that the core decoder converts integer coefficients read from the bitstream to floats just after reading them (along with dequantization). All other steps of the audio reconstruction are done with floats and the output can not be the bitexact reproduction of the input so it is not lossless.
    When the coefficients are read from the bitstream the core decoder does the following:
    dequantization (with int -> float conversion)

    inverse ADPCM (when needed)

    VQ decoding (when needed)

    filtering: QMF, LFE, downmixing (when needed)

    float output.
    I'm working now on modifying the core to work with integer coefficients and then convert them to floats before QMF filtering for lossy output but use bitexact QMF (intermediate LFE coefficients should be always integers and I think it's not correct in the current version) for lossless output. Also I added an option called -force_fixed to force fixed-point reconstruction for any kind of input.
    • Another problem is XLL extension presence detection. During the testing I found out that XLL extension is not detected sometimes and the core audio only is decoded in this case. I want to fix this issue as well.

    Saturday, 21 November

    This is a sort of short list of checklists and few ramblings in the wake of Fosdem’s Code of Conduct discussions and the not exactly welcoming statements about how to perceive a Code of Conduct such as this one.

    Code of Conduct and OpenSource projects

    A Code of Conduct is generally considered a mean to get rid of problematic people (and thus avoid toxic situations). I prefer consider it a mean to welcome people and provide good guidelines to newcomers.

    Communities without a code of conduct tend to reject the idea of having one, thinking that it is only needed to solve the above mentioned issue and adding more bureaucracy would just actually give more leeway to macchiavellian ploys.

    Sadly, no matter how good the environment is, it takes just few poisonous people to get in an unbearable situation and a you just need one in few selected cases.

    If you consider the CoC a shackle or a stick to beat “bad guys” so you do not need it until you see a bad guy, that is naive and utterly wrong: you will end up writing something that excludes people due a, quite understandable but wrong, knee-jerk reaction.

    A Code of Conduct should do exactly the opposite, it should embrace people and make easier joining and fit in. It should be the social equivalent of the developer handbook or the coding style guidelines.

    As everybody can make a little effort and make sure to send code with spaces between operators everybody can make an effort and not use colorful language. Likewise as people would be more happy to contribute if the codebase they are hacking on is readable so they are more confident in joining the community if the environment is pleasant.

    Making an useful Code of Conduct

    The Code of Conduct should be a guideline for people that have no idea what the expected behavior is.
    It should be written thinking on how to help people get along not on how to punish who does not.

    • It should be short. It is pointless to enumerate ALL the possible way to make people uncomfortable, you are bound to miss few.
    • It should be understanding and inclusive. Always assume cultural biases and not ill will.
    • It should be enforced. It gets quite depressing when you have a 100+ lines code of conduct but then nobody cares about it and nobody really enforces it. And I’m not talking about having specifically designated people to enforce it. Your WHOLE community should agree on what is an acceptable behavior and act accordingly on breaches.

    People joining the community should consider the Code of Conduct first as a request (and not a demand) to make an effort to get along with the others.

    Pitfalls

    Since I saw quite some long and convoluted wall of text being suggested as THE CODE OF CONDUCT everybody MUST ABIDE TO, here some suggestion on what NOT do.

    • It should not be a political statement: this is a strong cultural bias that would make potential contributors just stay away. No matter how good and great you think your ideas are, those are unrelated to a project that should gather all the people that enjoy writing code in their spare time. The Open Source movement is already an ideology in itself, overloading it with more is just a recipe for a disaster.
    • Do not try to make a long list of definitions, you just dilute the content and give even more ammo to lawyer-type arguers.
    • Do not think much about making draconian punishments, this is a community on internet, even nowadays nobody really knows if you are actually a dog or not, you cannot really enforce anything if the other party really wants to be a pest.

    Good examples

    Some CoC I consider good are obviously the ones used in the communities I belong to, Gentoo and Libav, they are really short and to the point.

    Enforcing

    As I said before no matter how well written a code of conduct is, the only way to really make it useful is if the community as whole helps new (and not so new) people to get along.

    The rule of thumb “if anybody feels uncomfortable in a non-technical discussion, once they say they are, drop it immediately”, is ok as long:

    • The person uncomfortable speaks up. If you are shy you might ask somebody else to speak up for you, but do not be quiet when it happens and then fill a complaint much later, that is NOT OK.
    • The rule is not abused to derail technical discussions. See my post about reviews to at least avoid this pitfall.
    • People agree to drop at least some of their cultural biases, otherwise it would end up like walking on eggshells every moment.

    Letting situations going unchecked is probably the main issue, newcomers can think it is OK to behave in a certain way if people are behaving such way and nobody stops that, again, not just specific enforcers of some kind, everybody should behave and tell clearly to those not behaving that they are problematic.

    Gentoo is a big community, so gets problematic having a swift reaction: lots of people prefer not to speak up when something happens, so people unwillingly causing the problem are not made aware immediately.

    Libav is a much smaller community and in general nobody has qualms in saying “please stop” (that is also partially due how the community evolved).

    Hopefully this post would help avoid making some mistakes and help people getting along better.

    Sunday, 08 November

    This mini-post spurred from this bug.

    AVFrame and AVCodecContext

    In Libav there are a number of patterns shared across most of the components.
    Does not matter if it models a codec, a demuxer or a resampler: You interact with it using a Context and you get data in or out of the module using some kind of Abstraction that wraps data and useful information such as the timestamp. Today’s post is about AVFrames and AVCodecContext.

    AVFrame

    The most used abstraction in Libav by far is the AVFrame. It wraps some kind of raw data that can be produced by decoders and fed to encoders, passed through filters, scalers and resamplers.

    It is quite flexible and contains the data and all the information to understand it e.g.:

    • format: Used to describe either the pixel format for video and the sample format for audio.
    • width and height: The dimension of a video frame.
    • channel_layout, nb_samples and sample_rate for audio frames.

    AVCodecContext

    This context contains all the information useful to describe a codec and to configure an encoder or a decoder (the generic, common features, there are private options for specific features).

    Being shared with encoder, decoder and (until Anton’s plan to avoid it is deployed) container streams this context is fairly large and a good deal of its fields are a little confusing since they seem to replicate what is present in the AVFrame or because they aren’t marked as write-only since they might be read in few situation.

    In the bug mentioned channel_layout was the confusing one but also width and height caused problems to people thinking the value of those fields in the AVCodecContext would represent what is in the AVFrame (then you’d wonder why you should have them in two different places…).

    As a rule of thumb everything that is set in a context is either the starting configuration and bound to change in the future.

    Video decoders can reconfigure themselves and output video frames with completely different geometries, audio decoders can report a completely different number of channels or variations in their layout and so on.

    Some encoders are able to reconfigure on the fly as well, but usually with more strict constraints.

    Why their information is not the same

    The fields in the AVCodecContext are used internally and updated as needed by the decoder. The decoder can be multithreaded so the AVFrame you are getting from one of the avcodec_decode_something() functions is not the last frame decoded.

    Do not expect any of the fields with names similar to the ones provided by AVFrame to stay immutable or to match the values provided by the AVFrame.

    Common pitfalls

    Allocating video surfaces

    Some quite common mistake is to use the AVCodecContext coded_width and coded_height to allocate the surfaces to present the decoded frames.

    As said the frame geometry can change mid-stream, so if you do that best case you have some lovely green surrounding your picture, worst case you have a bad crash.

    I suggest to always check that the AVFrame dimensions fit and be ready to reconfigure your video out when that happens.

    Resampling audio

    If you are using a current version of Libav you have avresample_convert_frame() doing most of the work for you, if you are not you need to check that format channel_layout and sample_rate do not change and manually reconfigure.

    Rescaling video

    Similarly you can misconfigure swscale and you should check manually that format, width and height and reconfigure as well. The AVScale draft API on provides an avscale_process_frame().

    In closing

    Be extra careful, think twice and beware of the examples you might find on internet, they might work until they wont.

    Friday, 06 November

    This spurred from some events happening in Gentoo, since with the move to git we eventually have more reviews and obviously comments over patches can be acceptable (and accepted) depending on a number of factors.

    This short post is about communicating effectively.

    When reviewing patches

    No point in pepper coating

    Do not disparage code or, even worse, people. There is no point in being insulting, you add noise to the signal:

    You are a moron! This is shit has no place here, do not do again something this stupid.

    This is not OK: most people will focus on the insult and the technical argument will be totally lost.

    Keep in mind that you want people doing stuff for the project not run away crying.

    No point in sugar coating

    Do not downplay stupid mistakes that would crash your application (or wipe an operating system) because you think it would hurt the feelings of the contributor.

        rm -fR /usr /local/foo
    

    Is as silly as you like but the impact is HUGE.

    This is a tiny mistake, you should not do that again.

    No, it isn’t tiny it is quite a problem.

    Mistakes happen, the review is there to avoid them hitting people, but a modicum of care is needed:
    wasting other people’s time is still bad.

    Point the mistake directly by quoting the line

    And use at most 2-3 lines to explain why it is a problem.
    If you can’t better if you fix that part yourself or move the discussion on a more direct media e.g. IRC.

    Be specific

    This kind of change is not portable, obscures the code and does not fix the overflow issue at hand:
    The expression as whole could still overflow.

    Hopefully even the most busy person juggling over 5 different tasks will get it.

    Be direct

    Do not suggest the use of those non-portable functions again anyway.

    No room for interpretation, do not do that.

    Avoid clashes

    If you and another reviewer disagree, move the discussion on another media, there is NO point in spamming
    the review system with countless comments.

    When receiving reviews (or waiting for them)

    Everybody makes mistakes

    YOU included, if the reviewer (or more than one) tells you that your changes are not right, there are good odds you are wrong.

    Conversely, the reviewer can make mistakes. Usually is better to move away from the review system and discuss over emails or IRC.

    Be nice

    There is no point in being confrontational. If you think the reviewer is making a mistake, politely point it out.

    If the reviewer is not nice, do not use the same tone to fit in. Even more if you do not like that kind of tone to begin with.

    Wait before answering

    Do not update your patch or write a reply as soon as you get a notification of a review, more changes might be needed and maybe other reviewers have additional opinions.

    Be patient

    If a patch is unanswered, ping it maybe once a week, possibly rebasing it if the world changed meanwhile.

    Keep in mind that most of your interaction is with other people volunteering their free time and not getting anything out of it as well, sometimes the real-life takes priority =)

    Wednesday, 21 October

    You might be subtle like this or just work on your stuff like that but then nobody will know that you are the one that did something (and praise somebody else completely unrelated for your stuff, e.g. Anton not being praised much for the HEVC threaded decoding, the huge work on ref-counted AVFrame and many other things).

    Blogging is boring

    Once you wrote something in code talking about it gets sort of boring, the code is there, it works and maybe you spent enough time on the mailing list and irc discussing about it that once it is done you wouldn’t want to think about it for at least a week.

    The people at xiph got it right and they wrote awesome articles about what they are doing.

    Blogging is important

    JB got it right by writing posts about what happened every week. Now journalist can pick from there what’s cool and coming from VLC and not have to try to extract useful information from git log, scattered mailing lists and conversations on irc.
    I’m not sure I’ll have the time to do the same, but surely I’ll prod at least Alexandra and the others to write more.

    Thursday, 15 October

    In Libav we try to clean up the API and make it more regular, this is one of the possibly many articles I write about APIs, this time about deprecating some relic from the past and why we are doing it.

    AVPicture

    This struct used to store image data using data pointers and linesizes. It comes from the far past and it looks like this:

    typedef struct AVPicture {
        uint8_t *data[AV_NUM_DATA_POINTERS];
        int linesize[AV_NUM_DATA_POINTERS];
    } AVPicture;
    

    Once the AVFrame was introduced it was made so it would alias to it and for some time the two structures were actually defined sharing the commond initial fields through a macro.

    The AVFrame then evolved to store both audio and image data, to use AVBuffer to not have to do needless copies and to provide more useful information (e.g. the actual data format), now it looks like:

    typedef struct AVFrame {
        uint8_t *data[AV_NUM_DATA_POINTERS];
        int linesize[AV_NUM_DATA_POINTERS];
    
        uint8_t **extended_data;
    
        int width, height;
    
        int nb_samples;
    
        int format;
    
        int key_frame;
    
        enum AVPictureType pict_type;
    
        AVRational sample_aspect_ratio;
    
        int64_t pts;
    
        ...
    } AVFrame;
    

    The image-data manipulation functions moved to the av_image namespace and use directly data and linesize pointers, while the equivalent avpicture became a wrapper over them.

    int avpicture_fill(AVPicture *picture, uint8_t *ptr,
                       enum AVPixelFormat pix_fmt, int width, int height)
    {
        return av_image_fill_arrays(picture->data, picture->linesize,
                                    ptr, pix_fmt, width, height, 1);
    }
    
    int avpicture_layout(const AVPicture* src, enum AVPixelFormat pix_fmt,
                         int width, int height,
                         unsigned char *dest, int dest_size)
    {
        return av_image_copy_to_buffer(dest, dest_size,
                                       src->data, src->linesize,
                                       pix_fmt, width, height, 1);
    }
    
    ...
    

    It is also used in the subtitle abstraction:

    typedef struct AVSubtitleRect {
        int x, y, w, h;
        int nb_colors;
    
        AVPicture pict;
        enum AVSubtitleType type;
    
        char *text;
        char *ass;
        int flags;
    } AVSubtitleRect;
    

    And to crudely pass AVFrame from the decoder level to the muxer level, for certain rawvideo muxers by doing something such as:

        pkt.data   = (uint8_t *)frame;
        pkt.size   =  sizeof(AVPicture);
    

    AVPicture problems

    In the codebase its remaining usage is dubious at best:

    AVFrame as AVPicture

    In some codecs the AVFrame produced or consumed are casted as AVPicture and passed to avpicture functions instead
    of directly use the av_image functions.

    AVSubtitleRect

    For the subtitle codecs, accessing the Rect data requires a pointless indirection, usually something like:

        wrap3 = rect->pict.linesize[0];
        p = rect->pict.data[0];
        pal = (const uint32_t *)rect->pict.data[1];  /* Now in YCrCb! */
    

    AVFMT_RAWPICTURE

    Copying memory from a buffer to another when can be avoided is consider a major sin (“memcpy is murder”) since it is a costly operation in itself and usually it invalidates the cache if we are talking about large buffers.

    Certain muxers for rawvideo, try to spare a memcpy and thus avoid a “murder” by not copying the AVFrame data to the AVPacket.

    The idea in itself is simple enough, store the AVFrame pointer as if it would point a flat array, consider the data size as the AVPicture size and hope that the data pointed by the AVFrame remains valid while the AVPacket is consumed.

    Simple and faulty: with the AVFrame ref-counted API codecs may use a Pool of AVFrames and reuse them.
    It can lead to surprising results because the buffer gets updated before the AVPacket is actually written.
    If the frame referenced changes dimensions or gets deallocated it could even lead to crashes.

    Definitely not a great idea.

    Solutions

    Vittorio, wm4 and I worked together to fix the problems. Radically.

    AVFrame as AVPicture

    The av_image functions are now used when needed.
    Some pointless copies got replaced by av_frame_ref, leading to less memory usage and simpler code.

    No AVPictures are left in the video codecs.

    AVSubtitle

    The AVSubtitleRect is updated to have simple data and linesize fields and each codec is updated to keep the AVPicture and the new fields in sync during the deprecation window.

    The code is already a little easier to follow now.

    AVFMT_RAWPICTURE

    Just dropping the “feature” would be a problem since those muxers are widely used in FATE and the time the additional copy takes adds up to quite a lot. Your regression test must be as quick as possible.

    I wrote a safer wrapper pseudo-codec that leverages the fact that both AVPacket and AVFrame use a ref-counted system:

    • The AVPacket takes the AVFrame and increases its ref-count by 1.
    • The AVFrame is then stored in the data field and wrapped in a custom AVBuffer.
    • That AVBuffer destructor callback unrefs the frame.

    This way the AVFrame data won’t change until the AVPacket gets destroyed.

    Goodbye AVPicture

    With the release 14 the AVPicture struct will be removed completely from Libav, people using it outside Libav should consider moving to use full AVFrame (and leverage the additional feature it provides) or the av_image functions directly.

    Friday, 02 October

    During the VDD we had lots of discussions and I enjoyed reviewing the initial NihAV implementation. Kostya already wrote some more about the decoupled API that I described at high level here.

    This article is about some possible implementation details, at least another will follow.

    The new API requires some additional data structures, mainly something to keep the data that is being consumed/produced, additional implementation-callbacks in AVCodec and possibly a mean to skip the queuing up completely.

    Data Structures

    AVPacketQueue and AVFrameQueue

    In the previous post I considered as given some kind of Queue.

    Ideally the API for it could be really simple:

    typedef struct AVPacketQueue;
    
    AVPacketQueue *av_packet_queue_alloc(int size);
    int av_packet_queue_put(AVPacketQueue *q, AVPacket *pkt);
    int av_packet_queue_get(AVPacketQueue *q, AVPacket *pkt);
    int av_packet_queue_size(AVPacketQueue *q);
    void av_packet_queue_free(AVPacketQueue **q);
    
    typedef struct AVFrameQueue;
    
    AVFrameQueue *av_frame_queue_alloc(int size);
    int av_frame_queue_put(AVFrameQueue *q, AVPacket *pkt);
    int av_frame_queue_get(AVFrameQueue *q, AVPacket *pkt);
    int av_frame_queue_size(AVFrameQueue *q);
    void av_frame_queue_free(AVFrameQueue **q);
    

    Internally it leverages the ref-counted API (av_packet_move_ref and av_frame_move_ref) and any data structure that could fit the queue-usage. It will be used in a multi-thread scenario so a form of Lock has to be fit into it.

    We have already something specific for AVPlay, using a simple Linked List and a FIFO for some other components that have a near-constant maximum number of items (e.g. avconv, NVENC, QSV).

    Possibly also a Tree could be used to implement something such as av_packet_queue_insert_by_pts and have some forms of reordering happen on the fly. I’m not a fan of it, but I’m sure someone will come up with the idea..

    The Queues are part of AVCodecContext.

    typedef struct AVCodecContext {
        // ...
    
        AVPacketQueue *packet_queue;
        AVFrameQueue *frame_queue;
    
        // ...
    } AVCodecContext;
    

    Implementation Callbacks

    In Libav the AVCodec struct describes some specific codec features (such as the supported framerates) and hold the actual codec implementation through callbacks such as init, decode/encode2, flush and close.
    The new model obviously requires additional callbacks.

    Once the data is in a queue it is ready to be processed, the actual decoding or encoding can happen in multiple places, for example:

    • In avcodec_*_push or avcodec_*_pull, once there is enough data. In that case the remaining functions are glorified proxies for the matching queue function.
    • somewhere else such as a separate thread that is started on avcodec_open or the first avcodec_decode_push and is eventually stopped once the context related to it is freed by avcodec_close. This is what happens under the hood when you have certain hardware acceleration.

    Common

    typedef struct AVCodec {
        // ... previous fields
        int (*need_data)(AVCodecContext *avctx);
        int (*has_data)(AVCodecContext *avctx);
        // ...
    } AVCodec;
    

    Those are used by both the encoder and decoder, some implementations such as QSV have functions that can be used to probe the internal state in this regard.

    Decoding

    typedef struct AVCodec {
        // ... previous fields
        int (*decode_push)(AVCodecContext *avctx, AVPacket *packet);
        int (*decode_pull)(AVCodecContext *avctx, AVFrame *frame);
        // ...
    } AVCodec;
    

    Those two functions can take a portion of the work the current decode function does, for example:
    – the initial parsing and dispatch to a worker thread can happen in the _push.
    – reordering and blocking until there is data to output can happen on _pull.

    Assuming the reordering does not happen outside the pull callback in some generic code.

    Encoding

    typedef struct AVCodec {
        // ... previous fields
        int (*encode_push)(AVCodecContext *avctx, AVFrame *frame);
        int (*encode_pull)(AVCodecContext *avctx, AVPacket *packet);
    } AVCodec;
    

    As per the Decoding callbacks, encode2 workload is split. the _push function might just keep queuing up until there are enough frames to complete the initial the analysis, while, for single thread encoding, the rest of the work happens at the _pull.

    Yielding data directly

    So far the API mainly keeps some queue filled and let some magic happen under the hood, let see some usage examples first:

    Simple Usage

    Let’s expand the last example from the previous post: register callbacks to pull/push the data and have some simple loops.

    Decoding

    typedef struct DecodeCallback {
        int (*pull_packet)(void *priv, AVPacket *pkt);
        int (*push_frame)(void *priv, AVFrame *frame);
        void *priv_data_pull, *priv_data_push;
    } DecodeCallback;
    

    Two pointers since you pull from a demuxer+parser and you push to a splitter+muxer.

    int decode_loop(AVCodecContext *avctx, DecodeCallback *cb)
    {
        AVPacket *pkt  = av_packet_alloc();
        AVFrame *frame = av_frame_alloc();
        int ret;
        while ((ret = avcodec_decode_need_data(avctx)) > 0) {
            ret = cb->pull_packet(cb->priv_data_pull, pkt);
            if (ret < 0)
                goto end;
            ret = avcodec_decode_push(avctx, pkt);
            if (ret < 0)
                goto end;
        }
        while ((ret = avcodec_decode_have_data(avctx)) > 0) {
            ret = avcodec_decode_pull(avctx, frame);
            if (ret < 0)
                goto end;
            ret = cb->push_frame(cb->priv_data_push, frame);
            if (ret < 0)
                goto end;
        }
    
    end:
        av_frame_free(&frame);
        av_packet_free(&pkt);
        return ret;
    }
    

    Encoding

    For encoding something quite similar can be done:

    typedef struct EncodeCallback {
        int (*pull_frame)(void *priv, AVFrame *frame);
        int (*push_packet)(void *priv, AVPacket *packet);
        void *priv_data_push, *priv_data_pull;
    } EncodeCallback;
    

    The loop is exactly the same beside the data types swapped.

    int encode_loop(AVCodecContext *avctx, EncodeCallback *cb)
    {
        AVPacket *pkt  = av_packet_alloc();
        AVFrame *frame = av_frame_alloc();
        int ret;
        while ((ret = avcodec_encode_need_data(avctx)) > 0) {
            ret = cb->pull_frame(cb->priv_data_pull, frame);
            if (ret < 0)
                goto end;
            ret = avcodec_encode_push(avctx, frame);
            if (ret < 0)
                goto end;
        }
        while ((ret = avcodec_encode_have_data(avctx)) > 0) {
            ret = avcodec_encode_pull(avctx, pkt);
            if (ret < 0)
                goto end;
            ret = cb->push_packet(cb->priv_data_push, pkt);
            if (ret < 0)
                goto end;
        }
    
    end:
        av_frame_free(&frame);
        av_packet_free(&pkt);
        return ret;
    }
    

    Transcoding

    Transcoding, the naive way, could be something such as

    int transcode(AVFormatContext *mux,
                  AVFormatContext *dem,
                  AVCodecContext *enc,
                  AVCodecContext *dec)
    {
        DecodeCallbacks dcb = {
            get_packet,
            av_frame_queue_put,
            dem, enc->frame_queue };
        EncodeCallbacks ecb = {
            av_frame_queue_get,
            push_packet,
            enc->frame_queue, mux };
        int ret = 0;
    
        while (ret > 0) {
            if ((ret = decode_loop(dec, &dcb)) > 0)
                ret = encode_loop(enc, &ecb);
        }
    }
    

    One loop feeds the other throught the queue. get_packet and push_packet are muxing and demuxing functions, they might end up being other two Queue functions once the AVFormat layer gets a similar overhaul.

    Advanced usage

    From the examples above you would notice that in some situation you would possibly do better,
    all the loops pull data from a queue push it immediately to another:

    • why not feeding right queue immediately once you have the data ready?
    • why not doing some processing before feeding the decoded data to the encoder, such as conver the pixel format?

    Here some additional structures and functions to enable advanced users:

    typedef struct AVFrameCallback {
        int (*yield)(void *priv, AVFrame *frame);
        void *priv_data;
    } AVFrameCallback;
    
    typedef struct AVPacketCallback {
        int (*yield)(void *priv, AVPacket *pkt);
        void *priv_data;
    } AVPacketCallback;
    
    typedef struct AVCodecContext {
    // ...
    
    AVFrameCallback *frame_cb;
    AVPacketCallback *packet_cb;
    
    // ...
    
    } AVCodecContext;
    
    int av_frame_yield(AVFrameCallback *cb, AVFrame *frame)
    {
        return cb->yield(cb->priv_data, frame);
    }
    
    int av_packet_yield(AVPacketCallback *cb, AVPacket *packet)
    {
        return cb->yield(cb->priv_data, packet);
    }
    

    Instead of using directly the Queue API, would be possible to use yield functions and give the user a mean to override them.

    Some API sugar could be something along the lines of this:

    int avcodec_decode_yield(AVCodecContext *avctx, AVFrame *frame)
    {
        int ret;
    
        if (avctx->frame_cb) {
            ret = av_frame_yield(avctx->frame_cb, frame);
        } else {
            ret = av_frame_queue_put(avctx->frame_queue, frame);
        }
    
        return ret;
    }
    

    Whenever a frame (or a packet) is ready it could be passed immediately to another function, depending on your threading model and cpu it might be much more efficient skipping some enqueuing+dequeuing steps such as feeding directly some user-queue that uses different datatypes.

    This approach might work well even internally to insert bitstream reformatters after the encoding or before the decoding.

    Open problems

    The callback system is quite powerful but you have at least a couple of issues to take care of:
    – Error reporting: when something goes wrong how to notify what broke?
    – Error recovery: how much the user have to undo to fallback properly?

    Probably this part should be kept for later, since there is already a huge amount of work.

    What’s next

    Muxing and demuxing

    Ideally the container format layer should receive the same kind of overhaul, I’m not even halfway documenting what should
    change, but from this blog post you might guess the kind of changes. Spoiler: The I/O layer gets spun in a separate library.

    Proof of Concept

    Soon^WNot so late I’ll complete a POC out of this and possibly hack avplay so that either it uses QSV or videotoolbox as test-case (depending on which operating system I’m playing with when I start), probably I’ll see which are the limitations in this approach soon.

    If you like the ideas posted above or you want to discuss them more, you can join the Libav irc channel or mailing list to discuss and help.

    Thursday, 10 September

    This is a tiny introduction to Libav, the organization.

    Libav

    The project aims to provide useful tools, written in portable code that is readable, trustworthy and performant.

    Libav is an opensource organization focused on developing libraries and tools to decode, manipulate and encode multimedia content.

    Structure

    The project tries to be as non-hierarchical as possible. Every contributor must abide by a well defined set of rules, no matter which role.

    For decisions we strive to reach near-unanimous consensus. Discussions may happen on irc, mailing-list or in real life meetings.

    If possible, conflicts should be avoided and otherwise resolved.

    Join us!

    We are always looking for enthusiastic new contributors and will help you get started. Below you can find a number of possible ways to contribute. Please contact us.

    Roles

    Even if the project is non-hierarchical, it is possible to define specific roles within it. Roles do not really give additional power but additional responsibilities.

    Contributor

    Contributing to Libav makes you a Contributor!
    Anybody who reviews patches, writes patches, helps triaging bugs, writes documentation, helps people solve their problems, or keeps our infrastructure running is considered a contributor.

    It does not matter how little you contribute. Any help is welcome.

    On top of the standard great feats of contributing to an opensource project, special chocolate is always available during the events.

    Reviewer

    Many eyes might not make every bug shallow, but probably a second and a third pair might prevent some silly mistakes.

    A reviewer is supposed to read the new patches and prevent mistakes (silly, tiny or huge) to land in the master.

    Because of our workflow, spending time reading other people patches is quite common.

    People with specific expertise might get nagged to give their opinion more often than others, but everybody might spot something that looks wrong and probably is.

    Bugwrangler

    Checking that the bugs are fixed and ask for better reports is important.

    Bug wrangling involves making sure reported issues have all the needed information to start fixing the problem and checking if old issues are still valid or had been fixed already.

    Committer

    Nobody can push a patch to the master until it is reviewed, but somebody has to push it once it is.

    Committers are the people who push code to the main repository after it has been reviewed.

    Being a committer requires you to take newly submitted patches, make sure they work as expected either locally or pushing them through our continuous integration system and possibly fix minor issues like typos.

    Patches from a committer go through the normal review process as well.

    Infrastructure Administrator

    The regression test system. git repository, the samples collection, the website, the patch trackers, the wiki and the issue tracker are all managed on dedicated hardware.

    This infrastructure needs constant maintaining and improving.

    Most of comes from people devoting their time and (beside few exceptions) their own hardware, definitely this role requires a huge amount of dedication.

    Rules

    The project strives to provide a pleasant environment for everybody.

    Every contributor is considered a member of the team, regardless if they are a newcomer or a founder. Nobody has special rights or prerogatives.

    Well defined rules have been adopted since the founding of the project to ensure fairness.

    Code of Conduct

    A quite simple code of conduct is in place in our project.

    It boils down to respecting the other people and being pleasant to deal with.

    It is commonly enforced with a friendly warning, followed by the request to leave if the person is unable to behave and, then, eventual removal if anything else fails.

    Contribution workflow

    The project has a simple contribution workflow:

    • Every patch must be sent to the mailing-list
    • Every patch must get a review and an Ok before it lands in the master branch

    Code Quality

    We have plenty of documentation to make it easy for you to prepare patches.

    The reviewers usually help newcomers by reformatting the first patches and pointing and fixing common pitfalls.

    If some mistakes are not caught during the review, there are few additional means to prevent them from hitting a release.

    Post Scriptum

    This post tried to summarize the project and its structure as if the legends surrounding it do not exist and the project is just a clean slate. Shame on me for not having written this blog post 5 years ago.

    Past and Present

    I already wrote about the past and the current situation of Libav, if you are curious please do read the previous posts. I will probably blog again about the social issues soon.

    Future

    The Release 12 is in the ABI break window now and soon the release branch will be spun off! After that some of my plans to improve the API will see some initial implementations and hopefully will be available as part of the release 13 (and nihav)

    I will discuss avframe_yield first since Kostya already posted about a better way to handle container formats.

    Friday, 14 August

    I'd like to add the loop option to avconv. This option allows to repeat an input file given number of times, so the output contains specified number of inputs. The command is ./avconv -loop n -i infile outfile, n specifies how many times the input file should be looped in the output.

    How does this work?
    After processing the input file for the first time, avconv calls new seek_to_start function to seek back to the beginning of the file. av_seek_frame is called to perform seeking itself but there are other things needed for loop option to work.

    1) flush
    Flush decoder buffers to take out delayed frames. In avconv this is done by calling process_input_file with NULL as frame, process_input_packet had to be modified a little to not to signal EOF on the filters when seeking.

    2) timestamps (ts)
    To have correct timestamps in the "after seeking" part of the output stream they have to be corrected with ts = ts_{from the demuxer} + n * (duration of the input stream), n is number of times the input stream was processed so far . This duration is the duration of the longest stream in a file because all the streams have to be processed (or played) before starting the next loop. The duration of the stream is the last timestamp - the first timestamp + duration of the last frame. For the audio streams one "frame" is usually a constant number of samples and its duration is number of samples/sample rate. Video frames on the other side are displayed unevenly so their average framerate can be used for the last frame duration if available or if the average framerate is not known the last frame duration is just 1 (in the current time base).

    https://github.com/sasshka/libav/commit/90f2071420b6fd50eea34982475819248e5f6c8f

    Thursday, 30 July

    We are getting closer to a new release and you can see it is an even release by the amount of old and crufty code we are dropping. This usually is welcomed by some people and hated by others. This post is trying to explain what we do and why we are doing it.

    New API and old API

    Since the start of Libav we tried to address the painful shortcomings of the previous management, here the short list:

    • No leaders or dictators, there are rules agreed by consensus and nobody bends them.
    • No territoriality, nobody “owns” a specific area of the codebase nor has special rights on it.
    • No unreviewed changes in the tree, all the patches must receive an Ok by somebody else before they can be pushed in the tree.
    • No “cvs is the release”, major releases at least twice per year, bugfix-only point releases as often as needed.
    • No flames and trollfests, some basic code of conduct is enforced.

    One of the effect of this is that the APIs are discussed, proposals are documented and little by little we are migrating to a hopefully more rational and less surprising API.

    What’s so bad regarding the old API?

    Many of the old APIs were not designed at all, but just randomly added because mplayer or ffmpeg.c happened to need some
    feature at the time. The result was usually un(der)documented, hard to use correctly and often not well defined in some cases. Most users of the old API that I’ve seen actually used it wrong and would at best occasionally fail to work, at worst crash randomly.
    – Anton

    To expand a bit on that you can break down the issues with the old API in three groups:

    • Unnamespaced common names (e.g. CODEC_ID_NONE), those may or might not clash with other libraries.
    • Now-internal-only fields previously exposed that were expected to be something that are not really are (e.g. AVCodecContext.width).
    • Functionality not really working well (e.g. the old audio resampler) for which a replacement got provided eventually (AVResample).

    The worst result of API misuse could be a crash in specific situations (e.g. if you use the AVCodecContext dimension when you should use the AVFrame dimensions to allocate your screen surface you get quite an ugly crash since the former represent the decoding time dimension while the latter the dimensions of the frame you are going to present and they can vary a LOT).

    But Compatibility ?!

    In Libav we try our best to give migration paths and in the past years we even went over the extra mile by providing patches for quite a bit of software Debian was distributing at the time. (Since nobody even thanked for the effort, I doubt the people involved would do that again…)

    Keeping backwards compatibility forever is not really feasible:

    • You do want to remove a clashing symbol from your API
    • You do want to not have application crashing because of wrong assumptions
    • You do want people to use the new API and not keep compatibility wrappers that might not work in certain
      corner cases.

    The current consensus is to try to keep an API deprecated for about 2 major releases, with release 12 we are dropping code that had been deprecated since 2-3 years ago.

    Next!

    I had been busy with my dayjob deadlines so I couldn’t progress on the new api for avformat and avcodec I described before, probably the next blogpost will be longer and a bit more technical again.

    Thursday, 09 July

    Debian decided to move to the new FFmpeg, what does it mean to me? Why should I care? This post won’t be technical for once, if you think “Libav is evil” start reading from here.

    Relationship between Libav and Debian

    After split between what was FFmpeg in two projects, with Michael Niedermayer keeping the name due his ties with the legal owner of the trademark and “merging” everything the group of 18 people was doing under the new Libav name.

    For Gentoo I, maybe naively, decided to just have both and let whoever want maintain the other package. Gentoo is about choice and whoever wants to shot himself on a foot has to be be free to do that in the safest possible way.

    For Debian, being binary packaged, who was maintaining the package decided to stay with Libav. It wasn’t surprising given “lack of releases” was one of the sore points of the former FFmpeg and he started to get involved with upstream to try to fix it.

    Perceived Leverage and Real Shackles

    Libav started with the idea to fix everything that went wrong with the Former FFmpeg:
    – Consensus instead of idolatry for THE Leader
    – Paced releases instead of cvs is always a release
    – Maintained releases branches for years
    git instead of svn
    – Cleaner code instead of quick hacks to solve the problem of the second
    – Helping downstreams instead of giving them the finger.

    Being in Debian, according to some people was undeserved because “Libav is evil” and since we wrongly though that people would look at actions and not at random blogpost by people with more bias than anything we just kept writing code. It was a huge mistake, this blogpost and this previous are my try to address this.

    Being in Debian to me meant that I had to help fixing stale version of software, often even without upstream.

    The people at Debian instead of helping, the amount of patches coming from people @debian.org over the years amounted to 1 according to git, kept piling up work on us.

    Fun requests such as “Do remove a standard test image because its origin according to them is unclear” or “Do maintain the ancient release branch that is 3 major releases behind” had been quite common.

    For me Debian had been no help and additional bourden.

    The leverage that being in a distribution theoretically gives according to those crying because the evil Libav was in Debian amounts to none to me: their user complain because the version provided is stale, their developers do not help even keeping the point releases up or updating the software using Libav because scared to be tainted, downstreams such as Kubi (that are so naive to praise FFmpeg for what happened in Libav, such as the HEVC multi-thread support Anton wrote) would keep picking the implementation they prefer and use ffmpeg-only API whenever they could (debian will ask us to fix that for them anyway).

    Is important being in Debian?

    Last time they were discussing moving to FFmpeg I had the unpleasant experience of reading lots of lovely email with passive-aggressive snide remarks such as “libav has just developers not users” or seeing the fruits of the smear campaign such as “is it true you stole the FFmpeg hardware” in their mailing list (btw during the past VDD the FFmpeg people there said at least that would be addressed, well, it had not been yet, thank you).

    At that time I got asked to present Libav, this time after reading in the debian wiki the “case” presented with skewed git statistics (maybe purge the merge commits when you count them to compare a project activity?) and other number dressing I just got sick of it.

    Personally I do not care. There is a better way to spend your own free time than do the distro maintenance work for people that not even thanks you (because you are evil).

    The smear campaign pays

    I’m sure that now that now that the new FFmpeg gets to replace Libav will get more contributions from people @debian.org and maybe those that were crying for the “oh so unjust” treatment would be happy to do the maintenance churn.

    Anyway that’s not my problem anymore and I guess I can spend more time writing about the “social issues” around the project trying to defuse at least a little the so effective “Libav is evil” narrative a post a time.

    Friday, 03 July

    Last weekend some libav developers met in the South Pole offices with additional sponsorship from Inteno Broadband Technology. (And the people at Borgodoro that gave us more chocolate to share with everybody).

    Sprints

    Since last year the libav started to have sprints to meet up, discuss in person topics that require a more direct media than IRC or Mailing List and usually write some code asking for direct opinions and help.

    Who attended

    Benjamin was our host for the event. Andreas joined us for the first day only, while Anton, Vittorio, Kostya, Janne, Jan and Rémi stayed both days.

    What we did

    The focus had been split in a number of area of interests:

    • API: with some interesting discussion between Rémi and Anton regarding on how to clarify a tricky detail regarding AVCodecContext and AVFrame and who to trust when.
    • Reverse Engineering: With Vittorio and Kostya having fun unraveling codecs one after the other (I think they got 3 working)
    • Release 12 API and ABI break
      • What to remove and what to keep further
      • What to change so it is simpler to use
      • If there is enough time to add the decoupled API for avcodec
    • Release 12 wishlist:
      • HEVC speed improvements, since even the C code can be sped up.
      • HEVC extended range support, since there is YUV 422 content out now.
      • More optimizations for the newer architectures (aarch64 and power64le)
      • More hardware accelerator support (e.g. HEVC encoding and decoding support for Intel MediaSDK).
      • Some more filters, since enough people asked for them.
      • Merge some of the pending work (e.g. go2meeting3, the new asf demuxer).
      • Get more security fixes in (with ago kindly helping me on this).
      • … and more …
    • New website with markdown support to make easier for people to update.

    During the sprint we managed to write a lot of code and even to push some during the sprint.
    Maybe a little too early in the case of asf, but better have it in and get to fix it for the release.

    Special mention to Jan for getting a quite exotic container almost ready, I’m looking forward to see it in the ml; and Andreas for reminding me that AVScale is needed sorely by sending me a patch that fixes a problem his PowerPC users are experiencing while uncovering some strange problem in swscale… I’ll need to figure out a good way to get a PowerPC big-endian running to look at it in detail.

    Thank you

    I want to especially thank all the people at South Pole that welcome me when I arrived with 1 day in advance and all the people that participated and made the event possible, had been fun!

    Post Scriptum

    • This post had been delayed 1 week since I had been horribly busy, sorry for the delay =)
    • During the sprint legends such as kropping the sourdough monster and the burning teapot had been created, some reference of them will probably appear in commits and code.
    • Anybody with experience with qemu-user for PowerPC is welcome to share his knowledge with me.

    Wednesday, 25 March

    I am hearing a lot of persons interested in open-source and giving back to the community. I think it can be an exciting experience and it can be positive in many different ways: first of all more contributors mean better open-source software being produced and that is great, but it also means that the persons involved can improve their skills and they can learn more about how successful projects get created.

    So I wondered why many developers do not do the first step: what is stopping them to send the first patch or the first pull-request? I think that often they do not know where to start or they think that contributing to the big projects out there is intimidating, something to be left to an alien form of life, some breed of extra-good programmers totally separated by the common fellows writing code in the world we experience daily.

    I think that hearing the stories of a few developers that have given major contributions to top level project could help to go over these misconceptions. So I asked a few questions to this dear friend of mine, Luca Barbato, who contributed among the others to Gentoo and VLC.

    Let’s start from the beginning: when did you start programming?

    I started dabbling stuff during high school, but I started doing something more consistent at the time I started university.

    What was your first contribution to an open-source project?

    I think either patching the ati-drivers to work with the 2.6 series or hacking cloop (a early kernel module for compressed loops) to use lzo instead of gzip.

    What are the main projects you have been involved into?

    Gentoo, MPlayer, Libav, VLC, cairo/pixman

    How did you started being involved in Gentoo? Can you explain the roles you have covered?

    Daniel Robbins invited me to join, I thought “why not?

    During the early times I took care of PowerPC and [Altivec](http://en.wikipedia.org/wiki/AltiVec), then I focused on the toolchain due the fact it gcc and binutils tended to break software in funny ways, then multimedia since altivec was mainly used there. I had been part of the Council a few times used to be a recruiter (if you want to join Gentoo feel free to contact me anyway, we love to have more people involved) and I’m involved with community relationship lately.

    Note: Daniel Robbins is the creator of Gentoo, a Linux distribution. 

    Are there other less famous projects you have contributed to?

    I have minor contributions in quite a bit of software due. The fact is that in Gentoo we try our best to upstream our changes and I like to get back fixes to what I like to use.

    What are your motivations to contribute to open-source?

    Mainly because I can =)

    Who helped you to start contributing? From who you have learnt the most?

    Daniel Robbins surely had been one of the first asking me directly to help.

    You learn from everybody so I can’t name a single person among all the great people I met.

    How did you get to know Daniel Robbins? How did he helped you?

    I was a gentoo user, I happened to do stuff he deemed interesting and asked me to join.

    He involved me in quite a number of interesting projects, some worked (e.g. Gentoo PowerPC), some (e.g. Gentoo Games) not so much.

    Do your contributions to open-source help your professional life?

    In some way it does, contrary to the assumption I’m just seldom paid to improve the projects I care about the most, but at the same time having them working helps me when I need them during the professional work.

    How do you face disagreement on technical solutions?

    I’m a fan of informed consensus, otherwise prototypes (as in “do, test and then tell me back”) work the best.

    To contribute to OSS are more important the technical skills or the diplomatic/relation skills?

    Both are needed at different time, opensource is not just software, you MUST get along with people.

    Have you found different way to organize projects? What works best in your opinion? What works worst?

    Usually the main problem is dealing with poisonous people, doesn’t matter if it is a 10-people project or a 300+-people project. You can have a dictator, you can have a council, you can have global consensus, poisonous people are what makes your community suffer a lot. Bonus point if the poisonous people get clueless fan giving him additional voices.

    Did you ever sent a patch for the Linux kernel?

    Not really, I’m not fond of that coding style so usually other people correct the small bugs I stumble upon before I decide to polish my fix so it is acceptable =)

    Do you have any suggestions for people looking to get started contributing to open-source?

    Pick something you use, scratch your own itch first, do not assume other people are infallible or heroes.

    ME: I certainly agree with that, it is one of the best advices. However if you cannot find anything suitable at the end of this post I wrote a short list of projects that could use some help.

    Can you tell us about your best and your worst moments with contribution to OSS?

    The best moment is recurring and it is when some user thanks you since you improved his or her life.

    The worst moment for me is when some rabid fan claims I’m evil because I’m contributing to Libav and even praises FFmpeg for something originally written in Libav in the same statement, happened more than once.

    What are you working on right now and what plans do you have for the future?

    Libav, plaid, bmdtools, commonmark. In the future I might play a little more with [rust](http://www.rust-lang.org/).

    Thanks Luca! I would be extremely happy if this short post could give to someone the last push they need to contribute to an existing open-source project or start their own: I think we could all use more, better, open-source software. So let’s write it.

    One thing I admire in Luca is that he is always curious and ready to jump on the next challenge. I think this is the perfect attitude to become an OSS contributor: just start play around with the things you like and talk to people, you could find more possibilities to contribute that you could imagine.

    …and one final thing: Luca is also the author of open-source recipes: he created the recipes of two types of chocolate bars dedicated to Libav and VLC. You can find them on the borgodoro website.

    1385040326653

    I suggest to take a look at his blog.

    A few open-source you could consider contributing to

    Well, just in case you are eager to start writing some code and you are looking for some projects to contribute to here there are a few, written with different technologies. If you want to start contributing to any of those and you need directions just drop me a line (federico at tomassetti dot me) and I would be glad to help!

    • If you are interested in contributing to Libav, you can take a look at this post: there I explained how I submitted my first patch (approved in the meantime!). It is written in C.

    • You could be also interested in plaid: it is a Python web application to manage git patches sent by e-mail (there are a few projects using this model like libav or the linux kernel)

    • WorldEngine, it is a world generator written in Python

    • Plate-tectonics, it is a library for plate tectonics simulation. It is written in C++

    • JavaParser a Java parser, written in Java

    • Incremental Java parser, an incremental Java parser, written in Scala

    The post How people get started contributing to open-source? A few questions to Luca Barbato, contributor to Gentoo, MPlayer, Libav, VLC, cairo/pixman appeared first on Federico Tomassetti - Consultant Software Engineer.

    Wednesday, 18 February

    I happened to have a few hours free and I was looking for some coding to do. I thought about VLC, the media player which I have enjoyed so much using over the years and I decided that I wanted to contribute in some way.

    To start helping in such a complex process there are a few steps involved. Here I describe how I got my first patched accepted. In particular I wrote a patch for libav, the library behind VLC.

    The general picture

    I started by reading the wiki. It is a very helpful starting point but the process to setup the environment and send a first patch was not yet 100% clear to me so I got in touch with some of the developers of libav to understand how they work and how I could start lending an hand with something simple. They explained me that the easier way to start is by solving issues reported by static analysis tools and style checkers. They use uncrustify to verify that the code is adhering to their style guidelines and they run coverity to check for potential issues like memory leaks or null deferences. So I:

    • started looking at some coverity issues
    • found something easy to address (a very simple null deference)
    • prepared the patch
    • submitted the patch

    After a few minutes the patch was approved by a committer, ready to be merged. The day after it made its way to the master branch. Yeah!

    Download source code, build libav and run the tests

    First of all, let’s clone the git repository:

    git clone git://git.libav.org/libav.git

    Alternatively you could use the GitHub mirror, if you want to.

    At this point you may want to install all the dependencies. The instructions are platform specific, you can find them here. If you have Mac Os-X be sure to have installed yasm, because nasm does not work. If you have installed both configure will pick up yasm (correctly). Just be sure to run configure after installing yasm.

    If everything goes well you can now build libav by running:

    ./configure
    make

    Note that it is fine to build in-tree (no need to build in a separate directory).

    Now it is time to run the tests. You will have to specify one directory where to download some samples, later used by tests. Let’s assume you wanted to put your samples under ~/libav-samples:

    mkdir ~/libav-samples
    # This download the samples
    make fate-rsync SAMPLES=~/libav-samples
    # This run the tests
    make fate

    Did everything run fine? Good! Let’s start to patch then!

    Write the patch

    First of all we need to find an open issue. Visit Coverity page for libav at https://scan.coverity.com/projects/106. You will have to ask for access and wait that someone grants it to you. When you will be able to login you will encounter a screen like this:

    Screenshot from 2015-02-14 19:39:06

    Here, this seems an easy one! The variable oggstream has been allocated by av_mallocz (basically a wrapper for malloc) but the result values has not been checked. If the allocation fails a NULL pointer is returned and when we will try to access it at the next line things are going end up unpleasantly. What we need to do is to check the return value of av_mallocz and if it is NULL we should return an error. The appropriate error to return in this case is AVERROR(ENOMEM). To get this information… you have to start reading code, getting familiar with the way of doing business of this codebase.

    Libav follows strict rules about the comments in git commits: use git log to look at previous commits and try to use the same style.

    Submitting the patch

    I think many of you are familiar with GitHub and the whole process of submitting a patch for revision. GitHub is great because it made that process so easy. However there are some projects (notably including the Linux kernel) which adopts another approach: they receive patches by e-mail.

    Git has a functionality that permits to submit a patch by e-mail with a simple command. The patch will be sent to the mailing list, discussed, and if approved the e-mail will be downloaded, processed through git and committed in the official repository. Does it sound cumbersome? Well, it sounds to me, spoiled as I am by GitHub and similar tools but, you know, if you go in Rome you should behave as the Romans do, so…  

    # This install the git extension for sending patches through e-mail
    sudo apt install git-email 
    # This submit a patch built using the last commit
    git send-email -1 --to libav-devel@libav.org

    Sending patches using gmail with 2-factor authentication enabled

    Now, many of you are using gmail and many of you have enable 2-factor authentication (right? If not, you should). If this is you case you will get an error along this lines:

    Password for 'smtp://f.tomassetti@gmail.com@smtp.gmail.com:587': 5.7.9 Application-specific password required. Learn more at 5.7.9 http://support.google.com/accounts/bin/answer.py?answer=185833 cj12sm14743233wjb.35 - gsmtp

    Here you can find how to create a password for this goal: https://support.google.com/accounts/answer/185833 The name of the application that I had to create was smtp://f.tomassetti@gmail.com@smtp.gmail.com:587. Note that I used the same name specified in the previous error message.

    What if I need to correct my patch?

    If things go well an e-mail with your patch will be sent to the mailing-list, someone will look at it and accept it. Most of the times you will receive suggestions about possible adjustments to be done to improve your password. When it happens you want to submit a new version of your patch in the same thread which contains your first version of the patch and the e-mails commenting it.

    To do that you want to update your patch (typically using git commit –amend) and then run something like:

    git send-email -1 --to libav-devel@libav.org --in-reply-to Message-ID: 54E0F459.3090707@gentoo.org

    Of course you need to find out the message-id of the e-mail to which you want to reply. To do that in gmail select the “Show original” item from the contextual menu for the message and in the screen opened look for the Message-Id header.

    Tools to manage patches sent by e-mail

    There are also web applications which are used to manage the patches sent by e-mail. Libav is currently using Patchwork for managing patches. You can see it deployed at: https://patches.libav.org/project/libav-devel/list/. Currently another tool has been developed to replace patchwork. It is named Plaid and I tried to help a little bit with that also 🙂

    Conclusions

    Mine has been a very small contribution, and in the future I hope to be able to do more. But being a maintainer of other open-source projects I learned that also small help is useful and appreciated, so for today I feel good.

    Screenshot from 2015-02-14 22:29:48

    Please, if I am missing something help me correct this post

    The post How to contribute to Libav (VLC): just got my first patch approved appeared first on Federico Tomassetti - Consultant Software Engineer.

    Monday, 12 January

    I'm interested in history, so I like visinting castles (or their ruins) and historical towns. There're many of them here in Czech Republic. Some of Czech sights like those in Prague, Kutná Hora or Český Krumlov are well known and, especially during the summer, overcrowded by tourists.  But there are also very nice less popular places which are much calmer and it is really a pleasure to visit them. One of such places is Jindřichův Hradec town, where the third biggest castle in the Czech Republic is located. Town centre with this castle is really amazing, it is full of romatic little streets, churches, museums and ancient buildings. This small town has really big  historical centre compared to its size, so one can spend the whole day exploring the castle and it surroundings. Recently, I decided to visit the town again, it was windy day, but it was realtively warm for the winter. There weren't any tourists around, and I really enjoyed my visit. The only disadvantage of this trip was that castle is closed for visitors during the winter and museums have short opening hours on weekends.





    Thursday, 20 November

    I've participated to last Libav sprint in Torino. I made new ASF demuxer for Libav, but during the testing problems with rtsp a mms protocols has appeared. Therefore, my main task during the sprint was to fix these issues. 
    It was second time I was at such sprint and also my second Torino visit and the sprint was even better than I expected. It's really nice to see people I'm communicating throught the irc channel in person, the thing I like about Libav a lot is its friendly community. But the most important thing for me as the most unexperienced person among skilled developers was naturally their help. My mentors from OPW participated the sprint and as a result all the issues was fixed and patch was sent to the ML (https://patches.libav.org/patch/55682/). Also, these personal consultations can be very productive in learning new things and because I'm not native English speaker I realized few days I have to speak or even think in English are really helpful for getting better in it.
    The last day of the sprint we had a trip to a really magical place called Sacra di San Michele (http://www.sacradisanmichele.com/).



    I like to visit places like this, in Czech Republic, where I'm living, I'm visiting ancient castles. But I think it may be the oldest place I've ever been to, the oldest parts of it was built in 10th century.  I had a feeling the history is breathing on us from the walls. We were lucky about the weather, it was sunny during our visit and the view from the terrace on top of the building was really breathtaking. We saw peaks of Alps covered by snow that divides this part of Italy and France.

    Saturday, 15 November

    After my challenge with the fused multiply-add instructions I managed to find some time to write a new test utility. It’s written ad hoc for unpaper but it can probably be used for other things too. It’s trivial and stupid but it got the job done.

    What it does is simple: it loads both a golden and a result image files, compares the size and format, and then goes through all the bytes to identify how many differences are there between them. If less than 0.1% of the image surface changed, it consider the test a pass.

    It’s not a particularly nice system, especially as it requires me to bundle some 180MB of golden files (they compress to just about 10 MB so it’s not a big deal), but it’s a strict improvement compared to what I had before, which is good.

    This change actually allowed me to explore one change that I abandoned before because it resulted in non-pixel-perfect results. In particular, unpaper now uses single-precision floating points all over, rather than doubles. This is because the slight imperfection caused by this change are not relevant enough to warrant the ever-so-slight loss in performance due to the bigger variables.

    But even up to here, there is very little gain in performance. Sure some calculation can be faster this way, but we’re still using the same set of AVX/FMA instructions. This is unfortunate, unless you start rewriting the algorithms used for searching for edges or rotations, there is no gain to be made by changing the size of the code. When I converted unpaper to use libavcodec, I decided to make the code simple and as stupid as I could make it, as that meant I could have a baseline to improve from, but I’m not sure what the best way to improve it is, now.

    I still have a branch that uses OpenMP for the processing, but since most of the filters applied are dependent on each other it does not work very well. Per-row processing gets slightly better results but they are really minimal as well. I think the most interesting parallel processing low-hanging fruit would be to execute processing in parallel on the two pages after splitting them from a single sheet of paper. Unfortunately, the loops used to do that processing right now are so complicated that I’m not looking forward to touch them for a long while.

    I tried some basic profile-guided optimization execution, just to figure out what needs to be improved, and compared with codiff a proper release and a PGO version trained after the tests. Unfortunately the results are a bit vague and it means I’ll probably have to profile it properly if I want to get data out of it. If you’re curious here is the output when using rbelf-size -D on the unpaper binary when built normally, with profile-guided optimisation, with link-time optimisation, and with both profile-guided and link-time optimisation:

    % rbelf-size -D ../release/unpaper ../release-pgo/unpaper ../release-lto/unpaper ../release-lto-pgo/unpaper
        exec         data       rodata        relro          bss     overhead    allocated   filename
       34951         1396        22284            0        11072         3196        72899   ../release/unpaper
       +5648         +312         -192           +0         +160           -6        +5922   ../release-pgo/unpaper
        -272           +0        -1364           +0         +144          -55        -1547   ../release-lto/unpaper
       +7424         +448        -1596           +0         +304          -61        +6519   ../release-lto-pgo/unpaper
    

    It’s unfortunate that GCC does not give you any diagnostic on what it’s trying to do achieve when doing LTO, it would be interesting to see if you could steer the compiler to produce better code without it as well.

    Anyway, enough with the microptimisations for now. If you want to make unpaper faster, feel free to send me pull requests for it, I’ll be glad to take a look at them!

    Friday, 15 August

    RealAudio files have several possible interleavers. The simplest is “Int0”, which means that the packets are in order. Today, I was contrasting “Int4” and “genr”. They both require rearranging data, in highly similar but not identical ways. “genr” is slightly more complex than “Int4”.

    A typical Int4 pattern, writing to subpacket 0, 1, 2, 3, etc, would read data from subpacket 0, 6, 12, 18, 24, 30, 36, 42, 48, 54, 60, 66, 1, 7, 13, etc, in that order – assuming subpkt_h is 12, as it was in one sample file. It is effectively subpacket_h rows of subpacket_h / 2 columns, counting up by subpacket_h / 2 and wrapping every two rows.

    A typical genr pattern is a little trickier. For subpacket_h = 14, and the same 6 columns per row as above, the pattern to read from looks like 0, 12, 24, 36, 48, 60, 72, 6, 18, 30, 42, 54, 66, 78, 1, etc.

    I spent most of today implementing genr, carefully working with a paper notebook, pencil, Python, and a terse formula from the old implementation:

    case DEINT_ID_GENR:
    for (x = 0; x < w/sps; x++) avio_read(pb, ast->pkt.data+sps*(h*x+((h+1)/2)*(y&1)+(y>>1)), sps);

    After various debug printfs, a lot of quality time in GDB running commands like x /94x (pkt->data + 14 * 94), a few interestingly garbled bits of audio playback, and a mentor pointing out I have some improvements to make on header parsing, I can play (some) genr files.

    I have also recently implemented SIPR support, and it works in both RA and RM files. RV10 video also largely works.

    Saturday, 09 August

    I've solved lost packets problems and finally my ASF demuxer started to work right at "ideal samples in vacuum".  So the time for fixing memory leaks had come and valgrind helped me a lot with this issue. After memory leaks was solved I had to start testing my demuxer on various samples of ASF format multimedia files. As expected, I've found many samples my demuxer failed for. The reasons was different - mostly it was my mistakes, misunderstood or overlooked parts of specs, but I think I found a case that needed unusual handling specs didn't mention about.
    some of problems was caused for example by
    * improper subpayloads handling - one should be really careful while reading specs to avoid problems for less common cases like one subpayload is inside single payload and there's is padding inside payload itself (while padding after payload is 0), but there was other problems too
    * I had to revise padding handling for all possible cases
    * ASF file has 3 places where ASF packet size is told - twice in the header objects and once in a packet itself, and specs are not specifying what should one do when they differs or at least I didn't found it
    * some stupid mistakes like when I just forgot to do something after adding new block to my code was really annoying
    Funny thing was when I fixed my demuxer for one group of samples and another one that worked before started to fail, I fixed this new group and third group failed. I was so much annoyed by this, but many mistakes I did was caused by my inexperience and I think one (at least me) just have to do all of these mistakes to get better.

    Sunday, 27 July

    Finally, all basic parts of ASF demuxer seems to work somehow.

     At last two weeks I fixed various bugs in my code and I hope packets handling is correct now. Only problem is that few packets at the end of the Data Object are still lost. Because I wanted a small break from this problem, my mentors allowed me to implement basic seeking first. ASF demuxer can now read index entries from Simple Index Object and adds them with av_add_index_entry to AVStream. So when Simple Index Object is present in an ASF file, my demuxer can seek to the requested time.

    Sunday, 13 July


    Skeleton of the new ASF demuxer was written, but only audio was demuxed properly now. Problem is complicated video frames handling in ASF format. I hope I finally found out how to process packets properly. ASF packet can contain single payload, single payload with subpayloads, multiple payloads or multiple payloads with subpayloads inside some of them. Every subpayload is always one frame, but single payload can be whole frame or just part of it. When ASF packet contains multiple payloads inside it, each of them can be one frame but it can be just fragment of it as well. When one of mulptiple payloads contains subpayloads, each of subpayload is one frame and it can be processed as AVPacket.
    For the case of fragmented frame in ASF packet I have to store several unfinished frames in ASFPacket structures that I've created for this purpose.  There should not be more than one unfinished frames per stream, so I have one ASFPacket in each ASFStream (ASFStream is structure for storing ASF stream properties). ASFPacket contains pointer to AVBufferRef where unfinished frame is stored. When frame is finished I can forward pointer to buffer  with data to AVPacket, set its properties like size, timestamps and others and finally return AVPacket.
    I introduced many bugs to my code that was working (at least ASF packets was parsed right and audio worked) and now I'm working on fixing all of them.



    I was accepted for OPW, May - August 2014 round with project "Rewrite the ASF demuxer". First task from my mentors was to create  wiki page  about ASF (Advanced Streaming Format), it was created at https://wiki.libav.org/ASF
    Interesting notes about other containers: http://codecs.multimedia.cx/?p=676.


    Next task from my mentors was to write simple program which reads asf file and prints its structure, i.e. list of asf objects, metadata and codec information. ASF file consists of so called ASF Objects. There're 3 top-level objects - Header Object, Data Object and Index Object. Especially Header Object can contain many other objects to provide different asf features, for example Codec List Object for codec information or Metadata object for metadata.  One can recognise object with GUID,  which is 16 byte array (each byte is number) that identifies object type. I was confused about the fact the GUID number you read from the file is not matching the GUID from specs. For some historical reasons one have to modify GUIDs from specs (reorder the numbers) for match GUID read from the file.
    My program is working now and can list objects, codecs and metadata info, but it ignores Index Objects by then. I hope I'll add support for them soon. Also I want to print offsets for each object and read Data Object deeper.

    Thursday, 03 July

    Today, I learned how to use framecrc as a debug tool. Many Libav tests use framecrc to compare expected and actual decoding. While rewriting existing code, the output from the old and new versions of the code on the same sample can be checked; this makes a lot of mistakes clear quickly, including ones that can be quite difficult to debug otherwise.

    Checking framecrcs interactively is straightforward: ./avconv -i somefile -c:a copy -f framecrc -. The -c:a copy specifies that the original, rather than decoded, packet should be used. The - at the end makes the output go to stdout, rather than a named file.

    The output has several columns, for the stream index, dts, pts, duration, packet size, and crc:

    0, 0, 0, 192, 2304, 0xbf0a6b45
    0, 192, 192, 192, 2304, 0xdd016b78
    0, 384, 384, 192, 2304, 0x18da71d6
    0, 576, 576, 192, 2304, 0xcf5a6a07
    0, 768, 768, 192, 2304, 0x3a84620a

    It is also unusually simple to find out what the fields are, as libavformat/framecrcenc.c spells it out quite clearly:

    static int framecrc_write_packet(struct AVFormatContext *s, AVPacket *pkt)
    {
    uint32_t crc = av_adler32_update(0, pkt->data, pkt->size);
    char buf[256];

    snprintf(buf, sizeof(buf), “%d, %10″PRId64″, %10″PRId64″, %8d, %8d, 0x%08″PRIx32″\n”,
    pkt->stream_index, pkt->dts, pkt->pts, pkt->duration, pkt->size, crc);
    avio_write(s->pb, buf, strlen(buf));
    return 0;
    }

    Keiler, one of my Libav mentors, patiently explained the above; I hope documenting it helps other people who are starting with Libav development.

    Thursday, 12 June

    Most recently, I have been adding documentation to Libav. Today, my work included writing a demuxer howto. In the last couple of weeks, I have also reimplemented RealAudio 1.0 support (2.0 is in progress), and learned more about Coccinelle and undefined behavior in C. Blog posts on these topics are pending.

    Tuesday, 20 May

    My first patch for undefined behavior eliminates left shifts of negative numbers, replacing a << b (where a can be negative) with a * (1 << b). This change fixes bug686, at least for fate-idct8x8 and libavcodec/dct-test -i (compiled with ubsan and fno-sanitize-recover). Due to Libav policy, the next step is to benchmark the change. I was also asked to write a simple benchmarking HowTo for the Libav wiki.

    First, I installed perf: sudo aptitude install linux-tools-generic
    I made two build directories, and built the code with defined behavior in one, and the code with undefined behavior in the other (with ../configure && make -j8 && make fate). Then, in each directory, I ran:

    perf stat --repeat 150 ./libavcodec/dct-test -i > /dev/null

    The results were somewhat more stable than with –repeat 30, but it still looks much more like noise than a meaningful result. I ran the command with –repeat 30 for both before the recorded 150 run, so both would start on equal footing. With defined behavior, the results were “0.121670022 seconds time elapsed ( +-  0.11% )”; with undefined behavior, “0.123038640 seconds time elapsed ( +-  0.15% )”. The best of a further three runs had the opposite result, shown below:

    % cat undef.150.best

    perf stat –repeat 150 ./libavcodec/dct-test -i > /dev/null

    Performance counter stats for ‘./libavcodec/dct-test -i’ (150 runs):

    120.427535 task-clock (msec) # 0.997 CPUs utilized ( +- 0.11% )
    21 context-switches # 0.178 K/sec ( +- 1.88% )
    0 cpu-migrations # 0.000 K/sec ( +-100.00% )
    226 page-faults # 0.002 M/sec ( +- 0.01% )
    455’393’772 cycles # 3.781 GHz ( +- 0.05% )
    <not supported> stalled-cycles-frontend
    <not supported> stalled-cycles-backend
    1’306’169’698 instructions # 2.87 insns per cycle ( +- 0.00% )
    89’674’090 branches # 744.631 M/sec ( +- 0.00% )
    1’144’351 branch-misses # 1.28% of all branches ( +- 0.18% )

    0.120741498 seconds time elapse

    % cat def.150.best

    Performance counter stats for ‘./libavcodec/dct-test -i’ (150 runs):

    120.838976 task-clock (msec) # 0.997 CPUs utilized ( +- 0.11% )
    21 context-switches # 0.172 K/sec ( +- 1.98% )
    0 cpu-migrations # 0.000 K/sec
    226 page-faults # 0.002 M/sec ( +- 0.01% )
    457’077’626 cycles # 3.783 GHz ( +- 0.08% )
    <not supported> stalled-cycles-frontend
    <not supported> stalled-cycles-backend
    1’306’321’521 instructions # 2.86 insns per cycle ( +- 0.00% )
    89’673’780 branches # 742.093 M/sec ( +- 0.00% )
    1’148’393 branch-misses # 1.28% of all branches ( +- 0.11% )

    0.121162660 seconds time elapsed ( +- 0.11% )

    I also compared the disassembled code from jrevdct.o, before and after the changes to have defined behavior (using gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2 on x86_64).

    In the build directory for the code with defined behavior:
    objdump -d libavcodec/jrevdct.o > def.dis
    sed -e 's/^.*://' def.dis > noline.def.dis

    In the build directory for the code with undefined behavior:
    objdump -d libavcodec/jrevdct.o > undef.dis
    sed -e 's/^.*://' undef.dis > noline.undef.dis

    Leaving aside difference in jump locations (despite the fact that they can impact performance), there are two differences:

    diff -u build_benchmark_undef/noline.undef.dis build_benchmark_def/noline.def.dis

    –       0f bf 50 f0             movswl -0x10(%rax),%edx
    +       0f b7 58 f0             movzwl -0x10(%rax),%ebxi

    It’s switched to using a zero-extension rather than sign-extension in one place.

    –       74 1c                   je     40 <ff_j_rev_dct+0x40>
    –       c1 e2 02                shl    $0x2,%edx
    –       0f bf d2                movswl %dx,%edx
    –       89 d1                   mov    %edx,%ecx
    –       0f b7 d2                movzwl %dx,%edx
    –       c1 e1 10                shl    $0x10,%ecx
    –       09 d1                   or     %edx,%ecx
    –       89 48 f0                mov    %ecx,-0x10(%rax)
    –       89 48 f4                mov    %ecx,-0xc(%rax)
    –       89 48 f8                mov    %ecx,-0x8(%rax)
    –       89 48 fc                mov    %ecx,-0x4(%rax)
    +       74 19                   je     3d <ff_j_rev_dct+0x3d>
    +       c1 e3 02                shl    $0x2,%ebx
    +       89 da                   mov    %ebx,%edx
    +       0f b7 db                movzwl %bx,%ebx
    +       c1 e2 10                shl    $0x10,%edx
    +       09 da                   or     %ebx,%edx
    +       89 50 f0                mov    %edx,-0x10(%rax)
    +       89 50 f4                mov    %edx,-0xc(%rax)
    +       89 50 f8                mov    %edx,-0x8(%rax)
    +       89 50 fc                mov    %edx,-0x4(%rax)

    Leaving aside differences in register use:

    –       0f bf d2                movswl %dx,%edx
    There is one extra movswl instruction in the version with undefined behavior, at least with the particular version of the particular compiler for the particular architecture checked.

    This is an example of a null result while benchmarking; neither version performs better, although any given benchmark has one or the other come out ahead, generally by less than the variance within the run. If this were a suggested performance change, it would not make sense to apply it. However, the point of this change was correctness; a performance increase is not expected, and the lack of a performance penalty is a bonus.

    Monday, 19 May

    One of my fantastic OPW mentors prepared a “Welcome task package”, of self-contained, approachable, useful tasks that can be done while getting used to the code, and with a much smaller scope than the core objective. This is awesome. To any mentors reading this: consider making a welcome package!

    Step one of it is to use ubsan with gdb. This turned out to be somewhat intricate, so I have decided to supplement the wiki’s documentation with a step-by-step guide for Ubuntu 14.04.

    1) Install clang-3.5 (sudo aptitude install clang-3.5), as Ubuntu 14.04 comes with gcc 4.8, which does not support -fsanitize=undefined.

    2) Under libav, mkdir build_ubsan && cd build_ubsan && ../configure --toolchain=clang-usan --extra-cflags=-fno-sanitize-recover (alternatively, –cc=clang –extra-cflags=-fsanitize=undefined –extra-ldflags=-fsanitize=undefined can be used instead of –toolchain=clang-usan).

    3) make -j8 && make fate

    4) Watch where the tests die (they only die if –extra-cflags=-fno-sanitize-recover is used). For me, they died on TEST idct8x8. Running make V=1 fate and asking my mentors pointed me towards libavcodec/dct-test -i, which is dying on jrevdct.c:310:47: with “runtime error: left shift of negative value -14”. If you really want to err on the side of caution, make a second build dir, and ./configure --cc=clang && make -j8 && make fate in it, making sure it does not fail… this confirms that the problem is related to configuring with –toolchain=clang-usan (and, it turns out, with -fsanitize=undefined).

    5) It’s time to use the information my mentor pointed out on the wiki about ubsan at https://wiki.libav.org/Security/Tools  – specifically, the information about useful gdb breakpoints. I put a modified version of the b_u definitions into ~/.gdbinit. The wiki has been updated now, but was originally missing a few functions, including one that turns out to be relevant: __ubsan_handle_shift_out_of_bounds

    6 Run gdb ./libavcodec/dct-test, then at the gdb prompt, set args -i to set the arguments dct-test was being run with, and then b_u to load the ubsan breakpoints defined above. Then start the program: type run at the gdb prompt.

    7) It turns out that a problem can be found, and the program stops running. Get a backtrace with bt.


    680 in __ubsan_handle_shift_out_of_bounds ()
    #1  0x000000000048ac96 in __ubsan_handle_shift_out_of_bounds_abort ()
    #2  0x000000000042c074 in row_fdct_8 (data=<optimized out>) at /home/me/opw/libav/libavcodec/jfdctint_template.c:219
    #3  ff_jpeg_fdct_islow_8 (data=<optimized out>) at /home/me/opw/libav/libavcodec/jfdctint_template.c:273
    #4  0x0000000000425c46 in dct_error (dct=<optimized out>, test=<optimized out>, is_idct=<optimized out>, speed=<optimized out>) at /home/me/opw/libav/libavcodec/dct-test.c:246
    #5  main (argc=<optimized out>, argv=<optimized out>) at /home/me/opw/libav/libavcodec/dct-test.c:522

    It would be nice to see a bit more detail, so I wanted to compile the project so that less would be optimized out, and eventually settled on -O1 because compiling with ubsan and without optimizations failed (which I reported as bug 683). This led to a slightly better backtrace:


    #0  0x0000000000491a70 in __ubsan_handle_shift_out_of_bounds ()
    #1  0x0000000000492086 in __ubsan_handle_shift_out_of_bounds_abort ()
    #2  0x0000000000434dfb in ff_j_rev_dct (data=<optimized out>) at /home/me/opw/libav/libavcodec/jrevdct.c:275
    #3  0x00000000004258eb in dct_error (dct=0x4962b0 <idct_tab+64>, test=1, is_idct=1, speed=0) at /home/me/opw/libav/libavcodec/dct-test.c:246
    #4  0x00000000004251cc in main (argc=<optimized out>, argv=<optimized out>) at /home/me/opw/libav/libavcodec/dct-test.c:522

    It is possible to work around the problem by modifying the source code rather than the compiler flags: FFmpeg did so within hours of the bug report – the commit is at http://git.videolan.org/?p=ffmpeg.git;a=commit;h=bebce653e5601ceafa004db0eb6b2c7d4d16f0c0 ! Both FFmpeg and Libav have also merged my patch to work around the problem (FFmpeg patch, Libav patch). The workaround of using -O1 was suggested by one of my mentors, lu_zero; –disable-optimizations does not actually disable all optimizations (in practice, it leaves in ones necessary for compilation), and it does not touch the -O1 that –toolchain=clang-usan now sets.

    Wanting a better backtrace leads to the next post: a detailed guide to narrowing down a bug in a the C compiler, Clang. Yes, I know, the problem is never a bug in the C compiler – but this time, it was.

    Thursday, 15 May

    What’s the fun of only running code on platforms you physically have? Portability is important, and Libav actively targets several platforms. It can be useful to be able to try out the code, even if the hardware is totally unavailable.

    Here is how to run Libav’s tests under aarch64, on x86_64 hardware and Ubuntu 14.04. This guide is provided in the hopes that it saves someone else 20 hours or more: there is a lot of once-excellent information which has become misleading, because a lot of progress has been made in aarch64 support. I have tried three approachs – building with Linaro’s cross-compiler, building under QEMU user emulation, and building under QEMU system emulation, and cross-compiling. Building with a cross-compiler is the fastest option. Building under user emulation is about ten times slower. Building under system emulation is about a hundred times slower. There is actually a fourth option, using ARM Foundation Model, but I have not tried it. Running under QEMU user emulation is the only approach I managed to make entirely work.

    For all three approaches, you will want a rootfs; I used Ubuntu Core. You can download Ubuntu Core for aarch64 (a minimal rootfs; see https://wiki.ubuntu.com/Core to learn more),  and untar it (as root) into a new directory. Then, set an environment variable that the rest of this guide/set of notes uses frequently, changing the path to match your system:

    export a64root=/path/to/your/aarch64/rootdir

    Approach 1 – build under QEMU’s user emulation.

    Step 1) Set up QEMU. The days when using SUSE branches were necessary are over, but it still needs to be statically linked, and not all QEMU packages are. Ubuntu has a static QEMU:

    sudo aptitude install qemu-user-static

    This package also sets up binfmt for you. You can delete broken or stale binfmt information by running:
    echo -1 > /proc/sys/fs/binfmt_misc/archnamehere – this can be useful, especially if you have previously installed QEMU by hand.

    Step 2) Copy your QEMU binary into the chroot, as root, with:

    cp `which qemu-aarch64-static` $a64root/usr/bin/

    Step 3) As root, set up the aarch64 image so it can do DNS resolution, so you can freely use apt-get:
    echo 'nameserver 8.8.8.8' > $a64root/etc/resolv.conf

    Step 4) Chroot into your new system. Run chroot $a64root /bin/bash as root.

    At this point, you should be able to run an aarch64 version of ls, and confirm with file /bin/ls that it is an aarch64 binary.

    Now you have a working, emulated, minimal aarch64 system.

    On x86, you would run aptitude build-dep libav, but there is no such package for aarch64 yet, so outside of the chroot, on the normal system, I installed apt-rdepends and ran:
    apt-rdepends --build-depends --follow=DEPENDS libav

    With version information stripped out, the following packages are considered dependencies:
    debhelper frei0r-plugins-dev libasound2-dev libbz2-dev libcdio-cdda-dev libcdio-dev libcdio-paranoia-dev libdc1394-22-dev libfreetype6-dev  libgnutls-dev libgsm1-dev libjack-dev libmp3lame-dev libopencore-amrnb-dev libopencore-amrwb-dev libopenjpeg-dev libopus-dev libpulse-dev libraw1394-dev librtmp-dev libschroedinger-dev libsdl1.2-dev libspeex-dev libtheora-dev libtiff-dev libtiff5-dev libva-dev libvdpau-dev libvo-aacenc-dev libvo-amrwbenc-dev libvorbis-dev libvpx-dev libx11-dev libx264-dev libxext-dev libxfixes-dev libxvidcore-dev libxvmc-dev texi2html yasm zlib1g-dev doxygen

    Many of the libraries do not have current aarch64 Ubuntu packages, and neither does frei0r-plugins-dev, but running aptitude install on the above list installs a lot of useful things – including build-essential. The full list is in the command below; the missing packages are non-essential.

    Step 5) Set it up: apt-get install aptitude

    aptitude install git debhelper frei0r-plugins-dev libasound2-dev libbz2-dev libcdio-cdda-dev libcdio-dev libcdio-paranoia-dev libdc1394-22-dev libfreetype6-dev  libgnutls-dev libgsm1-dev libjack-dev libmp3lame-dev libopencore-amrnb-dev libopencore-amrwb-dev libopenjpeg-dev libopus-dev libpulse-dev libraw1394-dev librtmp-dev libschroedinger-dev libsdl1.2-dev libspeex-dev libtheora-dev libtiff-dev libtiff5-dev libva-dev libvdpau-dev libvo-aacenc-dev libvo-amrwbenc-dev libvorbis-dev libvpx-dev libx11-dev libx264-dev libxext-dev libxfixes-dev libxvidcore-dev libxvmc-dev texi2html yasm zlib1g-dev doxygen

    Now it is time to actually build libav.

    Step 6) Create a user within your chroot: useradd -m auser, and switch to running as that user: sudo -u auser bash, and type cd to go to the home directory.

    Step 7) Run git clone git://git.libav.org/libav.git, then ./configure --disable-pthreads && make -j8 (change the 8 to approximately the number of CPU cores you have).
    On my hardware, this takes 10-11 minutes, and ‘make fate’ takes about 16. Disabling pthreads is essential, as qemu-user does not handle threads well, and running the tests hangs randomly without it.


    Approach 2: cross-compile (warning: I do not have the tests working with this approach).

    1) Start by getting an aarch64 compiler. A good place to get one is http://releases.linaro.org/latest/components/toolchain/binaries/; I am using http://releases.linaro.org/latest/components/toolchain/binaries/gcc-linaro-aarch64-linux-gnu-4.8-2014.04_linux.tar.xz . Untar it, and add it to your path:

    export PATH=$PATH:/path/to/your/linaro/tools/bin

    2) Make the cross-compiler work. Run aptitude install lsb lib32stdc++6. Without this, invoking the compiler will say “No such file or directory”. See http://lists.linaro.org/pipermail/linaro-toolchain/2012-January/002016.html.

    3) Under the libav directory (run git clone git://git.libav.org/libav.git if you do not have one), type mkdir a64crossbuild; cd a64crossbuild. Make sure the libav directory is somewhere under $a64root (it should simplify running the tests, later).

    4)./configure --arch=aarch64 --cpu=generic --cross-prefix=aarch64-linux-gnu- --cc=aarch64-linux-gnu-gcc --target-os=linux --sysroot=$a64root --target-exec="qemu-aarch64-static -L $a64root" --disable-pthreads

    This is a minimal variant of Jannau’s configuration – a developer who has recently done a lot of libav aarch64 work.

    5) Run make -j8. On my hardware, it takes just under a minute.

    6) Run make fate. Unfortunately, both versions of QEMU I tried hung on wait4 at this point (in fft-test, fate-fft-4), and used an extra couple of hundred megabytes of RAM per second until I stopped QEMU, even if I asked it to wait for a remote GDB. For anyone else trying this, https://lists.libav.org/pipermail/libav-devel/2014-May/059584.html has several useful tips for getting the tests to run after cross-compilation.


    Approach 3: Use QEMU’s system emulation. In theory, this should allow you to use pthreads; in practice, the tests hung for me. The following May 9th post describes what to do: http://www.bennee.com/~alex/blog/2014/05/09/running-linux-in-qemus-aarch64-system-emulation-mode/. In short: git clone git://git.qemu.org/qemu.git qemu.git && cd qemu.git && ./configure --target-list=aarch64-softmmu && make, then

    ./aarch64-softmmu/qemu-system-aarch64 -machine virt -cpu cortex-a57 -machine type=virt -nographic -smp 1 -m 2048 -kernel aarch64-linux-3.15rc2-buildroot.img  --append "console=ttyAMA0" -fsdev local,id=r,path=$a64root,security_model=none -device virtio-9p-device,fsdev=r,mount_tag=r

    Then, under the buildroot system, log in as root (no password), and type mkdir /mnt/core && mount -t 9p -o trans=virtio r /mnt/core. At this point, you can run chroot /mnt/core /bin/bash, and follow the approach 1 instructions from useradd onwards, except that ./configure without –disable-pthreads should theoretically work. On my system, ./configure takes a bit over 5 minutes with this approach. Running make is quite slow; time make took 113 minutes. Do not use -j – you are limited to a single core, so -j would slow compilation down slightly. However, make fate consistently hung on acodec-pcm-alaw, and I have not yet figured out why.


     

    Things not to do:

    • Use a rootfs from a year ago; I am yet to try one that is not broken, and some come with fun bonuses like infinite file system loops. These cost me well over a dozen hours.
    • Compile SUSE’s QEMU; qemu-system is bleeding-edge enough that you need to compile it from upstream, but SUSE’s patches have long been merged into the normal QEMU upstream. Unless you want qemu-system, you do not need to compile QEMU at all under Ubuntu 14.04.
    • Leave the environment variables in this tutorial unset in a new shell and wonder why things do not work.

     

    Wednesday, 23 April

    Applying to OPW requires an initial contribution. The Libav IRC channel suggested porting the asettb filter from FFmpeg, so I did (version 5 of the patch was merged upstream, in two parts: a rename patch and a content patch; the FFmpeg author was credited as author for the latter, while I did a signed-off-by). I also contributed a 3000+ line documentation patch, standardizing the libavfilter documentation and removing numerous English errors, and triaged a few bugs, git bisecting the one that was reproducible.

    Sunday, 13 April

    And how it nearly ruined another video coding standard.

    Everyone knows that interlacing was a trick in the '80s for pseudo motion compensation with analogue video. This more or less worked because it mimicked how television worked back then. This technique was preserved when flat panels for pc and tv were introduced, for a mix of backward compatibility and technical limitations, and video coding features interlacing in MPEG2 and H264 and similar.

    However as with black and white, TACS and Gopher, old technology has to be replaced with modern and efficient technology, as a trade off of users' interests and technology providers' market prospects. In case you are not familiar, interlacing is a mess to support, makes decoding slower and heavily degrades quality. People saying that interlacing saves bandwidth do not know much about video coding and bad marketing claiming that higher resolution is better than higher framerate has an effect too.

    So, when ITU and then MPEG set out to establish the mandates for a new video standard capable of superseding H264, it was decided that interlacing was old enough, did more harm than good and it was time for retirement: HEVC was going to be the first video codec to officially deprecate interlacing.

    Things went pretty swell during its development, until a few months before the completion of the standard. A group of US companies complained that the proposed tools were not sufficient (a set of SEI messages and treating fields like progressive frames) and heavily protested with both standardisation bodies. ITU firmly rejected the idea (with the video group chair threatening to step down) while MPEG set out to understand the needs of the industry and see if there was anything that could be done.

    An ad-hoc group was established to see if there was any evidence that interlaced coding tool would have improved the situation. Things looked really shady, the Requirements group even mentioned that it was the first time that an AhG was established to look for evidence, instead of establishing an AhG because there was evidence. Several liasons from EBU and other DVB members tried to point out this absurdity while the threat of adding interlacing back in HEVC became real. Luckily the first version of the specifications got published in the meantime, so this decision didn't slow down the standardisation process.

    Why so much love towards interlacing? Well in the "rebellious" group defence, it is true that interlaced content in HEVC is less performant than in H264; however it is also true that such deinterlaced content in HEVC outperforms H264 in any configuration. Truth is that mass marketed deinterlacers (commonly found in televisions for example) bring a lot of royalty income, so it is normal that companies with vested interests would prefer to have interlacing in a soon-popular video standard like HEVC. Also in markets like US where the network operator (which has control on the encoding but not on the video source) might differ from the content provider, it could be politically difficult to act as a carrier only if you have to deinterlace a video.

    However these problems are actually not enough for forcing every encoder, decoder, analyser to support a deprecated technology like interlacing. Technical problems can be solved with good deinterlacers at the top of the distribution chain, while political ones can be solved amending contracts. Plus having progressive only video will definitely improve quality and let the industry concentrate on other delicate subjects, like bit depth, both properties going in favour of users' interests.

    At the last MPEG meeting, the "rebellious" group which had been working on reintroducing interlacing for a year provided no real evidence that interlaced coding tools would improve HEVC at all. The only sensible solution was to disband the group over this wasted effort and support progressive video only, which is what happened luckily. So now both ITU and MPEG support progressive video only and this has finally nailed it.

    Interlacing is dead, long live progressive.

    Written by Vittorio Giovara (projectsymphony@gmail.com)
    Published under a CC-BY-SA 3.0 license.

    Tuesday, 25 March

    I am very glad to announce that Libav 10 has been released!

    This has a bunch of features that I contributed to, in particular regarding stereoscopic video and interlaced filtering, but more importantly this release has the work of an awesome group of people which has been carried out for a whole year. This is the magic of open source!

    I joined the group more or less one year ago, with some patches regarding an obscure H.264 specification which I then later reimplemented in HEVC and then I wrote a few filters I needed and then designed an API and then, wow! A whole year passed without me noticing, and I am still around, sending patches to the same group of people who welcomed someone who had problems with shifting values (sad but true story)!

    I met the team both at VDD and FOSDEM and they've been the most exciting conferences I ever went to (and I went to a lot of them). I couldn't believe I was with the devteam of my favourite multimeida opensource projects I've been following since I was a kid! Until a year ago, I saw the names from the commits and the blogposts from both VideoLAN and Libav projects and I had been thinking "Oh wouldn't it be so cool to be like one of them".

    The answer is yes, it definitely would, and it's something that can happen if one is really committed in it! The Libav Info page states "Being a committer is a duty, not a privilege", but it sure does feel like one.

    Thanks for this exciting year guys, I look forward to the next ones.

    Monday, 24 March

    ...using latest modern tools!

    X264 and VLC are two of the most awesomest opensource software you can find on-line and of course the pose no problem when you compile them on a Unix environment. Too bad that sometimes you need to think of Windowze as well, so we need a way to crosscompile that software: in this blogpost, I'll describe how to achieve that, using modern tools on a Ubuntu 12.04 installation.

    [0] Sources
    It goes without saying that without the following guides, I'd have had a much harder time!
    http://alex.jurkiewi.cz/blog/2010/cross-compiling-x264-for-win32-on-ubuntu-linux
    https://bbs.archlinux.org/viewtopic.php?id=138128
    http://wiki.videolan.org/Win32Compile
    http://forum.videolan.org/viewtopic.php?f=32&t=101489
    So a big thanks to all the original authors!

    [1] Introduction
    When you crosscompile you just use the same tools and toolchains that you are used to, gcc, ld and so on, but configured (and compiled) so that they produce executable code for a different platform. This platform can vary both in software and in hardware and it is usually identified by a triplet: the processor architecture, the ABI and the operating system.

    What we are going to use here is i686-w64-mingw32, which identifies any x86 cpu since the Pentium III, the w64 ABI used on modern Windows NT systems (if I'm not wrong), and the mingw32 architecture, that is the Windows gcc variant.



    [2] Prerequisites
    Note that the name of the packages might be slightly different according to your distribution. We are going to need a quite recent mingw-runtime for VLC (>=3.00) which has not yet landed on Ubuntu, so we'll take it from our Debian cousins.

    Execute this command

    $ wget http://ftp.jp.debian.org/debian/pool/main/m/mingw-w64/mingw-w64-dev_3.0~svn4933-1_all.deb
    $ sudo dpkg -i mingw-w64-dev_3.0~svn4933-1_all.deb


    and then install stock dependencies


    $ sudo dpkg -i gcc-mingw-w64 g++-mingw-w64
    $ sudo dpkg -i pkg-config yasm subversion cvs git-core

    [3] x264 and libav 
    x264 has very few dependencies, just pthreads and zlib, but it reaches its full potential when all of them are satisfied (encapsulation, avisynth support and so on).

    Loosely following Alex Jurkiewicz's work, we create a user-writable folder and then we prepare a script that sets some useful variables every time.


    $ mkdir -p ~/win32-cross/{src,lib,include,share,bin}
    #!/bin/sh

    TRIPLET=i686-w64-mingw32

    export CC=$TRIPLET-gcc
    export CXX=$TRIPLET-g++
    export CPP=$TRIPLET-cpp
    export AR=$TRIPLET-ar
    export RANLIB=$TRIPLET-ranlib
    export ADD2LINE=$TRIPLET-addr2line
    export AS=$TRIPLET-as
    export LD=$TRIPLET-ld
    export NM=$TRIPLET-nm
    export STRIP=$TRIPLET-strip

    export PATH="/usr/i586-mingw32msvc/bin:$PATH"
    export PKG_CONFIG_PATH="$HOME/win32-cross/lib/pkgconfig/"

    export CFLAGS="-static -static-libgcc -static-libstdc++ -I$HOME/win32-cross/include -L$HOME/win32-cross/lib -I/usr/$TRIPLET/include -L/usr/$TRIPLET/lib"
    export CXXFLAGS="$CFLAGS"

    exec "$@"
    Please not the use of the CFLAGS variables: without all the static parameters, the executable will dynamically link gcc, so you'll need to bundle the equivalent dll. I prefer to have one single exe, so everything goes static, but I'm not really sure which flag is actually needed. If you have any idea, please drop me a line.

    Anyway, let's compile latest revision of pthreads (2.9.1 as of this writing)

    $ cd ~/win32-cross/src
    $ wget -qO - ftp://sourceware.org/pub/pthreads-win32/pthreads-w32-2-9-1-release.tar.gz | tar xzvf -
    $ cd pthreads-w32-2-9-1-release
    $ make GC-static CROSS=i686-w64-mingw32-
    $ cp libpthreadGC2.a ../../lib
    $ cp *.h ../../include

    and zlib (1.2.7) - we need to remove the references to the libc library (which is implied anyway) otherwise we will get a linkage failure

    $ cd ~/win32-cross/src
    $ wget -qO - http://zlib.net/zlib-1.2.7.tar.gz | tar xzvf -
    $ cd zlib-1.2.7
    $ ../../mingw ./configure
    $ sed -i"" -e 's/-lc//' Makefile
    $ make
    $ DESTDIR=../.. make install prefix=

    Now it's turn for libav, so that x264 can use different input chroma and other stuff. If you need libav exececutables, you might want to change the configure line so that it suits you


    $ cd ~/win32-cross/src
    $ git clone git://git.libav.org/libav.git
    $ cd libav
    $ ./configure \
    --target-os=mingw32 --cross-prefix=i686-w64-mingw32- --arch=x86 --prefix=../.. \
    --enable-memalign-hack --enable-gpl --enable-avisynth --enable-runtime-cpudetect \
    --disable-encoders --disable-muxers --disable-network --disable-devices
    $ make
    $ make install

    and the nice tools that give more output options


    $ cd ~/win32-cross/src
    $ svn checkout http://ffmpegsource.googlecode.com/svn/trunk/ ffms
    $ cd ffms
    $ ../../mingw ./configure --host=mingw32 --with-zlib=../.. --prefix=$HOME/win32-cross
    $ ../../mingw make
    $ make install


    $ cd $HOME/win32-x264/src
    # Create a CVS auth file on your machine
    $ cvs -d:pserver:anonymous@gpac.cvs.sourceforge.net:/cvsroot/gpac login
    $ cvs -z3 -d:pserver:anonymous@gpac.cvs.sourceforge.net:/cvsroot/gpac co -P gpac
    $ cd gpac
    $ chmod +rwx configure src/Makefile
    # Hardcode cross-prefix
    $ sed -i'' -e 's/cross_prefix=""/cross_prefix="i686-w64-mingw32-"/' configure
    $ ../../mingw ./configure --static --use-js=no --use-ft=no --use-jpeg=no \
          --use-png=no --use-faad=no --use-mad=no --use-xvid=no --use-ffmpeg=no \
          --use-ogg=no --use-vorbis=no --use-theora=no --use-openjpeg=no \
          --disable-ssl --disable-opengl --disable-wx --disable-oss-audio \
          --disable-x11-shm --disable-x11-xv --disable-fragments--use-a52=no \
          --disable-xmlrpc --disable-dvb --disable-alsa --static-mp4box \
          --extra-cflags="-I$HOME/win32-cross/include -I/usr/i686-w64-mingw32/include" \
          --extra-ldflags="-L$HOME/win32-cross/lib -L/usr/i686-w64-mingw32/lib"
    # Fix pthread lib name
    $ sed -i"" -e 's/pthread/pthreadGC2/' config.mak
    # Add extra libs that are required but not included
    $ sed -i"" -e 's/-lpthreadGC2/-lpthreadGC2 -lwinmm -lwsock32 -lopengl32 -lglu32/' config.mak
    $ make
    # Make will fail a few commands after building libgpac_static.a
    # (i586-mingw32msvc-ar cr ../bin/gcc/libgpac_static.a ...).
    # That's fine, we just need libgpac_static.a 
    i686-w64-mingw32-ranlib bin/gcc/libgpac_static.a 
    $ cp bin/gcc/libgpac_static.a ../../lib/
    $ cp -r include/gpac ../../include/

     
    Finally we can compile x264 at full power! The configure script will provide a list of what features have been activated, make sure everything you need is there!

    $ cd ~/win32-cross/src
    $ git clone git://git.videolan.org/x264.git
    $ cd x264
    $ ./configure --cross-prefix=i686-w64-mingw32- --host=i686-w64-mingw32 \
          --extra-cflags="-static -static-libgcc -static-libstdc++ -I$HOME/win32-cross/include" \
          --extra-ldflags="-static -static-libgcc -static-libstdc++ -L$HOME/win32-cross/lib" \
          --enable-win32thread
    $ make

    And you're done! Take that x264.exe file and use it wherever you want!
    Most of the work here has been outlined by Alex Jurkiewicz in this guide so checkout his blog for more nice guides!


    [4] VideoLAN
    On the other hand, VLC has a LOT of dependencies, but thankfully it also has a nice way to get them working quickly. If you read the wiki guide, you'll notice that it will use i586-mingw32msvc everywhere, but you should definitely avoid that! In fact that one offers a very old toolchain, under which VLC will fail to compile! Also the latest versions provides much better code, x264 will weight 46MB against 38MB in one case!

    So let's update every script to the more modern version i686-w64-mingw32! As usual, first of all get the sources


    $ git clone git://git.videolan.org/vlc.git vlc
    $ cd vlc 
    And let's get the dependencies through the contrib scripts, qt4 needs to be compiled by hand as the version in Ubuntu repositories doesn't cope well with the rest of the process. I also had to remove some of the files because they were of the wrong architecture (mileage might vary here) .

    $ mkdir -p contrib/win32
    $ cd contrib/win32
    $ ../bootstrap --host=
    i686-w64-mingw32
    $ make prebuilt
    $ make .qt4
    $ rm ../i686-w64-mingw32/bin/{moc,uic,rcc}
    $ cd -

    We now return to the main sources folder and launch the boostrap and configure process; you need some standard automake/libtool dependencies for this.


    $ ./bootstrap
    $ mkdir win32 && cd win32
    $ ../extras/package/win32/configure.sh --host=i686-w64-mingw32
    $ ./compile
    $ make package-win-common

    Let's grab something to drink and celebrate when the compilation ends! You'll find all the necessary files in the vlc-x.x.x folder. A big thanks goes to the wiki authors and j-b who gave me pointers on #videolan irc.

    [5] Conclusions
    Whelp, that was a long run! As additional benefit you are able to customize every single piece of software to your need, eg. you can modify the libav version that you are going to use for Vlc as you wish! Also crosscompiling is often treated as black magic, but in reality is a simple process that just needs more careful configuration. Errors often are related to wrong paths or missing dependencies and sometimes a combination of both; don't lose hope and keep going until you get what you want!

    For future reference, all (or most of) functions and structs in libav have a prefix that indicates the exposure of that functions. Those are

    • av_ meaning a public function, present in the API;
    • ff_ meaning a private function, not present in the API;
    • avpriv_ meaning inter-library private function, used internally across libraries only.
    Source: #libav-devel

    Friday, 13 January

    Well, I've finished the new audio decoding API, which has been merged into Libav master. The new audio encoding API is basically done, pending a (hopefully final) round of review before committing.

    Next up is audio timestamp fixes/clean-up. This is a fairly undefined task. I've been collecting a list of various things that need to be fixed and ideas to try. Plus, the audio encoding API revealed quite a few bugs in some of the demuxers. Today I started a sort of TODO list for this stage of the project. I'll be editing it as the project continues to progress.

    Friday, 28 October

    For the past few weeks I've been working on a new project sponsored by FFMTech. The entire project involves reworking much of the existing audio framework in libavcodec.

    Part 1 is changing the audio decoding API to match the video decoding API. Currently the audio decoders take packet data from an AVPacket and decode it directly to a sample buffer supplied by the user. The video decoders take packet data from an AVPacket and decode it to an AVFrame structure with a buffer allocated by AVCodecContext.get_buffer(). My project will include modifying the audio decoding API to decode audio from an AVPacket to an AVFrame, as is done with video.

    AVCODEC_MAX_AUDIO_FRAME_SIZE puts an arbitrary limit on the amount of audio data returned by the decoder. For example, each FLAC frame can hold up to 65536 samples for 8 channels at 32-bit sample depth, which is 2097152 bytes of raw audio, but AVCODEC_MAX_AUDIO_FRAME_SIZE is only 192000. Using get/release_buffer() for audio decoding will solve this problem. It will, however, require changes to every audio decoder. Most of those changes are trivial since the frame size is known prior to decoding the frame or is easily parsed. Some of the changes are more intrusive due to having to determine the frame size prior to allocating and writing to the output buffer.

    As part of the preparation for the new API, I have been cleaning up all the audio decoder, which has been quite tedious. I've found some pretty surprising bugs along the way. I'm getting close to finishing that part so I'll be able to move on to implementing the new API in each decoder.

    Wednesday, 27 July

    So, I've moved on from AHT now, and it's on to Spectral Extension (SPX).  I got the full syntax working yesterday, now I just need to figure out how to calculate all the parameters.  I have a feeling this will help quality quite a bit, especially when used in conjunction with variable bandwidth/coupling.  My vision for automatic bandwidth adjustment is starting to come together.

    SPX encoding/decoding is fairly straightforward, so I expect this won't take too long to implement.  Similar to channel coupling, the encoder writes coarsely banded scale factors for frequencies above the fully-encoded bandwidth, along with noise blending factors.  The decoder copies lower frequency coefficients to the upper bands, multiplies them by the scale factors, and blends them with noise (which has been scaled according to the band energy and the blending factors in the bitstream).  For the encoder, I just need to make the reconstructed coefficients match the original coefficients as closely as possible by calculating appropriate spectral extension coordinates and blending factors.  Also, like coupling coordinates, the encoder can choose how often to resend the parameters to balance accuracy vs. bitrate.

    Once SPX encoding is working properly, I'll revisit variable bandwidth.  However, instead of adjusting the upper cutoff frequency (which is somewhat complex to avoid very audible attack/decay), it will adjust the channel coupling and/or spectral extension ranges to keep the cutoff frequency constant while still adjusting to changes in signal complexity to keep a more stable quality level at a constant bitrate.  This could also be used in a VBR mode with constrained bitrate limits.

    If you want to follow the development, I have a separate branch at my Libav github repository.
    http://github.com/justinruggles/Libav/commits/eac3_spx

    I finally got the complete AHT syntax working properly.  Unfortunately, the quality seems to be lower at all bitrates than with the normal AC-3 quantization.  I'm hoping that I just need to pick better gain values, but I have a suspicion that some of the difference is related to vector quantization, which the encoder has no control over (a basic 6-dimensional VQ minimum distance search is the best it can do).

    My first step is to find out for sure if choosing better gain values will help.  One problem is that the bit allocation model is saying we need X number of bits for each mantissa.  Using mode=0 (all zero gains) gives exactly X number of bits per mantissa (with no overhead for encoding the gain values), but the overall quality is lower than with normal AC-3 quantization or even GAQ with simplistic mode/gain decisions.  So I think that means there is some bias built-in to the AHT bit allocation table that assumes GAQ will appropriately fine-tune the final allocations.  Additionally, it could be that AHT should not always be turned on when the exponents are reused in blocks 1 through 5 (the condition required to use AHT).  This is probably the point where I need a more accurate bit allocation model...

    edit: After analyzing the bit allocation tables for AC-3 vs. E-AC-3, it seems there is no built-in bias in the GAQ range.  They are nearly identical.  So the difference is clearly in VQ.  Next step, try a direct comparison of quantized mantissas using VQ vs. linear quantization and consider that in the AHT mode decision.

    edit2: dct+VQ is nearly always worse than linear quantization...  I also tried turning AHT off for a channel if the quantization difference was over a certain threshold, but as the threshold approached zero, the quality approached that with AHT turned off.  I don't know what to do at this point... *sigh*

    note: analyzation of a commercial E-AC-3 sample using AHT shows that AHT is always turned on when the exponent strategy allows it.

    edit3: It turns out that the majority of the quality difference was in the 6-point DCT.  If I turn it off in both the encoder and decoder (but leave the quantization the same) the quality is much better.  I hope it's a bug or too much inaccuracy (it's 25-bit fixed-point) in my implementation...  If not then I'm at another dead-end.

    edit4: I'm giving up on AHT for now.  The DCT is definitely correct and is very certainly causing the quality decrease.  If I can get my hands on a source + encoded E-AC-3 file from a commercial encoder that uses AHT then I will revisit this.  Until then, I have nothing to analyze to tell me how using AHT can possibly produce better quality.

    Friday, 17 June

    Well, I finally got a working E-AC-3 encoder committed to Libav.  The bitstream format does save a few bits here and there, but the overall quality difference is minimal.  However, it will be the starting point for adding more E-AC-3 features that will improve quality.

    The first feature I completed was support for higher bit rates.  This is done in E-AC-3 by using fewer blocks per frame.  A normal AC-3 frame has 6 blocks of 256 samples each, but E-AC-3 can reduce that to 1, 2, or 3 blocks.  This way a small range can be used for the per-frame bit rate, but it still allow for increasing the per-second bit rate.  For example, 5.1-channel E-AC-3 content on HD-DVDs was typically encoded at 1536 kbps using 1 block per frame.

    Currently I am working on implementing AHT (adaptive hybrid transform).  The AHT process uses a 6-point DCT on each coefficient across the 6 blocks in the frame.  It basically uses the normal AC-3 bit allocation process to determine quantization of each DCT-processed "pre-mantissa" but it uses a finer resolution for quantization and different quantization methods.  I have the 6-point DCT working and one of the two quantization methods.  Now I just need to finish the other quantization method and implement mantissa bit counting and bitstream output.

    Feeds

    FeedRSSLast fetched
    Ambient Language XML 2020-07-14 18:31
    Kostya's Boring Codec World XML 2020-07-14 18:31
    libav – alpaastero XML 2020-07-14 18:31
    Libav – Federico Tomassetti – Consultant Software Engineer XML 2020-07-14 18:31
    libav – Luca Barbato XML 2020-07-14 18:31
    Multimedia – Flameeyes's Weblog XML 2020-07-14 18:31
    Project Symphony XML 2020-07-14 18:31
    Sasshka's XML 2020-07-14 18:31
    Scarabeus' blag XML 2020-07-14 18:31

    Planet Feed

    rss 2.0 | opml

    About

    If you want your blog to be added, send an email to Luca Barbato.

    Planet service provided by Luminem.