As I wrote in my previous post, I had functioning audio player nearing completion. And now I’ve finally added all features I wanted to add and can call it done.
While previously I mostly ranted on the bloat introduced by the components authors, here I’d like to describe the design and the reasoning behind it.
First of all, what I wanted to have: a player for audio that uses NihAV
for decoding it, minimum of outside dependencies (i.e. just the audio output interface and nothing else), simple design, and an ability to pause and seek. The support for certain kind of corner cases was sacrificed for simplicity. After all, I’m using just Linux and I know what kind of content I play and how—so why bother on whether it will work as fine compiled for Windows or under some exotic terminal. The same applies to custom output, filters and such.
So, how does it work? About as simple as you can make it: at first it spawns a separate thread to read key presses from the terminal and then for each provided input it opens it, configures SDL output (16-bit mono or stereo, I don’t have multichannel headphones or good ears to enjoy the intricacies of 24-bit audio) and then simply in a loop displays current time, reacts to the input commands sent by the terminal reader thread, and refills audio queue (implemented by SDL2) when its fill drops too low.
While this is enough to make a good enough player, it has some drawbacks too. For starters, certain formats may take a significant time to decode one frame (and we all know that Monkey’s Audio codec insane mode) which means that while it’s being decoded the screen is not updated which may be a bit irritating. But comparing to the alternative of having another thread for audio decoding (with an additional logic to control it) I’d rather pick the simpler solution. Another drawback is that you cannot do anything to the already queued data so if you change volume it will be applied only to the newly queued data and that is likely to happen in a second or more (again, a problem mostly for a certain codec in a certain mode). As I mentioned before, the proper solution would involve a separate object for handling audio callbacks that maintains its own queue and applies volume modification only when output samples are requested but I’ll probably leave to the upcoming video player.
Another thing is that I’ve not particularly tried to optimise codecs yet it seems to be satisfactory for my needs. On my more than ten years old laptop I get these results for various lossless audio decoders on audio CD rips (my music collection is mostly in lossless formats after all):
avconv
, 6.3s with nihav-tool
(mostly because I don’t have any optimisation for unary code reading and that’s where most of the time is wasted). Still it’s 150 times real-time decoding so it’s good enough;avconv
and 33s with nihav-tool
;avconv
, 7:10 with nihav-tool
(and it spends over 90% of time performing adaptive filtering which I perform with 32-bit ints instead of 16-bit ones and no SIMD except the one from compiler);avconv
, 18s with nihav-tool
;avconv
and 54s with nihav-tool
.So while it’s not as fast as the usual alternative, it is fast enough on my hardware for practical purposes (i.e. playing audio without loading too much CPU with one notable exception). And another problem with Monkey’s Audio insane mode is latency. Its frame in this mode contains by default whopping 1179648 samples (over 26 seconds of audio at 44.1kHz) and it takes about two and half seconds to decode such frame. unmac
, on which libavcodec
decoder is based, exploits the fact that newer APE format codes samples in interleaved manner (yes, old versions coded all samples for single channel together—but the again, insane frames there were about six seconds long) and decodes it in blocks of 4608 samples by default so you can have 1/256th of insane frame decoded immediately and the rest decoded later. It also requires a whole frame data being buffered at once and decoder being called again and again until the whole frame is decoded. Since it’s not possible with current NihAV
design I can probably make a special hacky demuxer and decoder specifically for the player that will be based on a sequence of single data frame and dummy frames just to tell the decoder to decode more of the first frame. At least nothing prevents me from bundling such demuxer and decoder with the player and register it just there. But that should be done only if the problems with playing back such files will irritate me enough to implement that.
In either case, nihav-sndplay
is done and works sufficiently good so it’s time to move to writing a satisfactory video player. A good task for the rest of this year and maybe some chunk of the next one.
So after weeks of doing nothing and looking at lossless audio codecs (in no particular order) I’m going back to developing NihAV
and more particularly an audio player.
The main problem with nihav-player
concept is that 1) it’s primarily video player 2) it’s based on outdated SDL1 instead of SDL2 3) the SDL1 wrapper especially in audio area was incomplete and my shim implementation for audio callback as a trait is not good enough so it deadlocks time from time. So I’ve finally installed SDL2 and started to write a new audio-only player based on it (well, considering that I’m not trying to write a cross-platform player an ALSA wrapper would do but I’d rather avoid it). It turns out that SDL2 has its own audio queue interface which simplifies my life (at least for now)—instead of NIHing it and encountering deadlocks I can now simply send data to the queue, check how many samples are still there to be played and add more when I want to. Maybe later when I need more advanced processing I’ll implement proper callback-based audio processing pipeline but for my current needs this is enough.
And now I’d like to complain about the crates I use and don’t use. This is not a fault of Rust programming language but rather a downside created by modern approach to programming and enabled by all those package managers like Cargo
and most famously npm.js
(which is the synonym with the bloated library dependencies).
First of all I need tcgetattr()
/tcsetattr()
for reading commands from the terminal. If I simply use libc
crate it does what I want it to do, but if I’d use suggested “friendly” wrapper for it called nix
it’ll try to pull additional two or three crates and will fail to build with rustc 1.33
that I still have no reason to migrate from. SDL2 wrappers are even worse. sdl2-sys
that provides just bindings to the library depends on cfg-if
and libc
crates which is reasonable. sdl2
crate though is a bloated beast pulling a dozen of dependencies with some of them being duplicate (before version 0.33 it pulled rand
crate by default which pulled a dozen of other crates, some of them depending on different rand_core
version from the others; and the saddest thing is that it was required just to allow generating random pixel values). And I can’t compile that version because it depends on TryFrom
trait instead of num-traits
crate and you need a compiler at least two months younger to support it. The older version should work just fine though.
The rant is over, back to the NihAV
matters.
Now my player can play various formats supported by NihAV
, pause/resume playback, seek forward and backward. Essentially it’s good enough for everyday music needs. I’m still considering whether I should not support MP3 or definitely not support it.
And for achieving that I had to fix some bugs, add a seeking for FLAC files without seektable (which seems to be virtually all FLAC files out there), make demuxers report duration (either as a container or a stream property) plus some optimisations in Monkey’s Audio decoder.
The optimisations was a funny thing. At first I tried SSE intrinsics provided by the compiler and it turned out that loop unrolling by 16 leads to wrong results in one of the sum registers. But unrolling it by 8 worked fine—except that auto-vectorisation in the compiler did even better job on the same code turned into a standalone function with iterators. Plus IMO x86 intrinsics look ugly with their _mm_operation_epi32
. Maybe if I ever get to writing H.264 decoder for my own viewing needs I’ll try standalone assembly or inline one if asm!()
syntax will be stabilised by then.
So for my player I think I’ll need just to add volume control and print name of the file currently being played and not just current time. But this should not be that hard…
Recently The Mike asked if I can look at this format. In case you didn’t know, The Multimedia Mike is one of the under-appreciated founders of opensource multimedia, involved both in reverse engineering codecs and maintaining infrastructure for about two decades (for example this particular blog has been here for fifteen years thanks to him and his maintenance efforts). So of course I had to look at it even if out of sheer respect.
Sonarc is probably the first known lossless audio codec as the copyright mentions year 1992 as the first date (Shorten and VocPack appeared in 1993). Spoiler: it turned out to be closer to Shorten in design.
This was harder to RE because it was larger (decompressor was three to four times larger than VocPack) and the original was written in Borland Pascal with all the peculiarities it brings. By those peculiarities I mean mostly Pascal strings. Well, the code for manipulating them is annoying to parse but not too bad, the main problem is that they are put in the same segment with code right before the function that uses it and that confuses Ghidra which for some reason selects the segment with standard library routines for them instead (and uninitialised variables are not assigned to any segment at all). The write()
implementation is also no fun.
Side note: back in the day Turbo Pascal was probably the best programming language for DOS and back in school at least two my schoolmates were doing crazy things with it (and Delphi later) which I couldn’t (and I was writing in C as I still do today). Yet somehow the popularity of the language vanished and I haven’t heard anything about them becoming famous programmers (neither did I but they had better chances). And the only modern project written in Pascal that I’m aware of is Hedgewars.
Anyway, let’s talk about the format itself. Sonarc can compress raw PCM, .voc
and .wav
into either its own format or into .wav
and it supports both 8- and 16-bit audio.
From what I saw it uses the same approach: optionally applying the LPC filter and coding the residues. Residues can be coded with two different approaches: old one for 8-bit audio and new one for 8- and 16-bit audio. Old 8-bit audio coding uses one of eight different static Huffman codebooks or can code residues as raw bytes (and I can’t remember that many other codecs doing the same except for MLP and DT$-HD Lossless probably because why compress audio in that case). New 8-/16-bit coding still uses fixed codebooks but in a different fashion: now they simply code the number of bits for the residue. It does not look like the data is split into segments but I may be wrong (I/O is still not the easiest thing to get around there).
Overall it’s not a bad codec for its time and e.g. FLAC has not come that far away from it in concepts (except that it uses Rice codes and has independent frames plus partitioning inside individual frame for better compression). I hope though there are no older lossless audio codecs out there to be discovered (CCITT G.711 infinite-law with its fixed 1:1 compression does not count).
One of the issues with On2 VPx family is that they started it from VP3 while having four different TrueMotion codecs before that (it’s like the company was called Valve and not Duck at that time). But I wanted to look at some lossless audio codecs and there’s VocPack or VP for short which has versions 1 and 2. Bingo!
This is a very old lossless audio codec that appeared in 1993 along with Shorten and, as it turns out, originated the second approach to lossless audio compression. While Shorten was a simple format oriented on fast decoding and thus used fixed prediction (either LPC filter or even fixed prediction scheme) and Rice codes for residues (the same scheme used in FLAC and TAK), VocPack employed adaptive filter and arithmetic coding (the approach carried by LA, Monkey’s Audio, OptimFROG and such). And it was made for DOS and 8-bit audio! Well, version 2 added support for 16-bit but it seems to compress only high 8 bits of the sample anyway while transmitting low bits verbatim.
And it turned out to be my first real experience of using Ghidra with DOS executables. The main troubles were identifying library functions and dealing with pointers. Since it was compiled with Borland C++ 3.0 (who doesn’t remember it?) it was rather easy to decompile but library functions were not recognized (DOS executables don’t get much love these days…) but searching the disassembly for int 21h
with Ralf Brown’s interrupts list at hand it was easy to identify calls for file operations (open/read/write/seek) and from those infer the stdio
library functions using them and finally the code using all those getc()
s. And of course segmented model makes decompiling fun, especially when decompiler can’t understand segment/offset variables being used separately. In result sometimes you recognize offset but you have to look at the data segment yourself to see what it refers to; even worse, for some local variables Ghidra seemed to assume wrong segment which resulted in variables in disassembly and decompiled output pointing to non-existent locations. Despite all of that it was rather easy to understand what unpacker for VP1 does. VP2 has only packer and no unpacker publicly available (feel free to trace the author and buy a copy from him that supports unpacking) plus it depends on those wrongly understood global variables more which prevented me from understanding how encoding a residue works there. In theory you should be able to set data segment manually but I don’t see a point on spending more than a couple of hours on REing the format.
It was a nice distraction though.
As I’d mentioned in a previous post on lossless audio codecs, I wanted to look at some of them that are still not reverse engineered for documentation sake. And I did exactly that so now entries on LA, OptimFROG and RK Audio are not stubs any more but rather contain some information on how the codecs work.
And if you look at LA
structure you see a lot of filters of various sizes and structure. Plus an adaptive weight used to select certain parameters. If you look at other lossless audio codecs with high compression and slow decoding like OptimFROG
or Monkey's Audio
you’ll see the same picture: several filters of different kinds and sizes layered over each other plus adaptive weights also used in residuals coding. Of course that reminded me of AV2 and more specifically about neural networks. And what do you know, Monkey's Audio
actually calls its longer filters neural networks (hence the name NNFilter.h
in the official SDK and you can spot it in the version history as well leaving no doubts that it’s exactly the neural networks it is named after).
Which leads me to the only possible conclusion: lossless audio codecs had been using neural networks for compression before it became mainstream and it gave them the best compression ratios in the class.
And if we apply all this knowledge to video coding then maybe in AV4 we’ll finally see some kind of convolution filters processing whole tiles and then the smaller blocks removing spatial redundance maybe with some compaction layers like many neural network designs have (or transforms for largest possible block size in H.265/AV1/AVS2) and expansion layers (well, what do you think motion interpolation actually does?) and using RNNs to code residues left from all the prediction.
While I have nothing against Rust as such and keep writing my pet project in Rust, there are still some deficiencies I find preventing Rust from being a proper programming language. Here I’d like to present them and explain why I deem them as such even if not all of them have any impact on me.
First and foremost, Rust does not have a formal language specification and by that I mean that while some bits like grammar and objects are explained, there are no formal rules to describe what language features can and cannot be. If you’ve ever looked at ISO C standard you’ve seen that almost any entity there has three or four parts in the description: formal syntax, constraints (i.e. what is not allowed or what can’t be done with it), semantics (i.e. what it does, how it impacts the program, what implementation caveats are there), and there may be some examples to illustrate the points. The best close equivalent in Rust is The Rust Reference and e.g. structure there is described in the following way: syntax (no objections there), definition in a form of “A struct is a nominal struct type defined with the keyword struct
.”, examples, a brief mention of empty (or unit-like) structures in the middle of examples, and “The precise memory layout of a struct is not specified.” at the end. I understand that adding new features is more important than documenting them but this is lame.
A proper mature language (with 1.0 in its version) should have a formal specification that should be useful both for people developing compilers and the programmers trying to understand certain intricacies of the language and why it does not work as expected (more on that later). For example, for that struct
definition I find lacking at least these: mentioning that you can have impl
for it (even a reference would do—even if you have to repeat it for every type), split off tuples into a separate entry because it’s very different syntactically and raises a question why you have anonymous tuples and not anonymous structs (which you also can’t find from the documentation), and of course create proper layout so that rather important information (about memory layout for example) is not lost among examples.
And now to the specific problems I encounter quite often and I don’t know whether I understand it wrong or the compiler understands it wrong. And since there’s no formal specification I can’t tell which one it is (even if the former is most probable).
Function/method calling convention. Here’s a simple example:
struct Foo { a: i32 }
impl Foo { fn bar(&mut self, val: i32) { self.a = val + 42; } }
fn main() {
let mut foo = Foo { a: 0 };
foo.bar(foo.a);
}
For now this won’t compile because of the borrowing but shouldn’t the compiler be smart enough to create a copy of foo.a
before call? I’m not sure but IIRC current implementation first mutably borrows object for the call and only then tries to borrow the arguments. Is it really so and if yes, why? Update: I’m told that newer versions of the compiler handle it just fine but the question still stands (was it just a compiler problem or the call definition has been changed?).
The other thing is the old C caveat of function arguments evaluation. Here’s a simple example:
let mut iter = “abc”.chars();
foo(iter.next().unwrap(), iter.next().unwrap(), iter.next().unwrap());
So would it be foo('a','b','c')
or foo('c','b','a')
call. In C it’s undefined because it depends on how arguments are passed on the current platform (consider yourself lucky if you don’t remember __pascal
or __stdcall
). In Rust it’s undefined because there’s no formal specification to tell you even that much. And it would be even worse if you consider that you may use the same source for indexing the caller object like handler[iter.next().unwrap() as usize].process(iter.next().unwrap());
in some theoretical bytecode handler (of course it’s a horrible way to write code and you should use named temporary variables but it should illustrate the problem).
And another source of annoyance for me is traits. I have almost no problems with owning/lifetime/borrowing concepts but traits get me almost every time. I’m vaguely aware that the answer to why the following problems exist is “because traits are implemented as a call table” but again, should they be implemented like that and what should be the constraints on them (after all the original object should be somehow linked to the trait pointer). So when you have a supertrait (i.e. trait Foo: Bar
) you can’t easily cast it for subtrait (e.g. &Foo -> &Bar
) without writing a lot of boilerplate code. And even worse if you convert an object into Box<trait>
there’s no way to get the original object back (still in boxed form of course; I remember seeing a special crate that implements a lot of boilerplate code in order to get a mutable reference though). To reiterate: the problem is not me being stupid but rather the lack of formal description on how it’s done and why what I want is so hard. Then I’d probably at least be able to realize how I should change my code to work around the limitations.
rustc
problemsNo, I’m not going to talk about compilation speed. It’s certainly a nuisance but not a problem per se. Here I want to point rather theoretical problems that a mature language should not have. And having just one compiler is one of those problems (call that problem zero).
First of all, bootstrapping process is laughably bad. I realize that it’s never too easy but if you call yourself a systems programming language you should be able to bootstrap a compiler in a sane amount of steps. For instance, IIRC Guix
has the following bootstrapping process for C compiler: simple C complier in Scheme (for which you can often write an implementation in assembly by hand) compiles TCC, TCC compiles GCC 2.95, GCC 2.95 compiles GCC 3.7, GCC 3.7 compiles GCC 4.9. For rustc
you should either start with the original compiler written in OCaml and compile every following version with the previous one (i.e. 1.17 with 1.16) or cheat by using mrustc
written in C++ which can compile Rust 1.19 or 1.29 (without borrow checks), then compile 1.30 with 1.29, 1.31 with 1.30 etc etc. The problem here is that you cannot skip versions and e.g. compile rustc 1.46
with rustc 1.36
(I’d be happy to learn that I’m wrong). IMO you should have maybe an ineffective compiler but written in a dialect that much older compiler should understand i.e. rustc 1.0
should be able to compile a compiler for 1.10, which can be used to compile 1.20 and so forth. Of course it’s a huge waste of resources for rather theoretical problem but it may prove beneficial for compiler design itself.
Then there’s LLVM dependency. I understand that LLVM
provides many advantages (like no need to worry about code generation for many platforms and optimising it) but it gives some disadvantages too. First, you don’t have a really self-hosting compiler (a theoretical problem but still a thing worth thinking about; also consider that you have to rely on a framework developed mostly by large corporations mostly in their own interests). Second, you’re limited by what it does e.g. I read complaints about debug builds being too slow mostly because of LLVM backend. And I suspect it still can’t do certain kinds of memory-related optimisations because it was designed with C++ compiler in mind which still has certain quirks regarding multiple memory access (plus IIRC there was one LLVM bug triggered by an infinite loop in Rust code that’s perfectly valid there but not according to C++ rules). I’m aware that cranelift
exists (and Rust front-end for GCC
) so hopefully this will be improved.
And finally there’s a thing related to the previous problem. Rust has poor support for assembly. Of course not so many people need standalone assembly and not inline one (which is still lacking but asm!
is almost there) but languages oriented for systems programming support compiling assembly in addition to the higher-language code so it would be proper to support assembly files even with not so rich preprocessor syntax as GAS has. Fiddling with build.rs
to invoke an external assembler is possible but not nice at all.
There’s also one problem with Rust std
library that I should mention too. It’s useless for interfacing OS. Now if I want to do something natural to any UNIX system I need to at least import libc
crate and link against an external libc (it’s part of the runtime anyway). One solution would be that crate I heard of that wanted to translate musl
into Rust so you can at least eliminate the linking step. But the proper solution would be to support at least OS-specific syscall() in std
crate as many interesting libc functions are just a wrapper over it (like open()
/write()
/ioctl()
; Windows is a different beast so I don’t mind if it’s std::os::unix::syscall
and not something more common).
I’m not a Rust language architect and I’m extremely unlikely to become one but I have an opinion on what Rust lacks in order to become a proper mature language really fit for systems development (three things essentially: being fully self-hosted, having a specification, and being able to interface low-level stuff without resorting to C compiler or assembler). Hopefully this will be rectified despite the lack of Mozilla.
Occasionally I look at the experiments in AV1 repository that should be the base for AV2 (unless Baidu rolls out VP11 from its private repository to replace it entirely). A year ago they added intra modes predictor based on neural network and in August they added a neural network based loop filter experiment as well. So, to make AV2 both simpler to implement in hardware and improve its compression efficiency I propose to switch all possible coding tools to use misapplied statistics. This way it can also attract more people from the corresponding field to compensate the lack of video compression experts. Considering the amount of pixels (let alone the ways to encode them) in a modern video it is BigData™ indeed.
Anyway, here is what I propose specifically:
In result we’ll have a rather simple codec with most blocks being neural networks doing specific tasks, an arithmetic coder to provide input values, some logic to connect those blocks together, and some leftover DSP routines but I’m not sure we’ll need them at this stage. This will also greatly simplify the encoder as well as it will be more of a producing fitting model weights instead of trying some limited encoding combinations. And it may also be the first true next generation video codec after H.261 paving the road to radically different video codecs.
From hardware implementation point of view this will be a win too, you just need some ROM and RAM for models plus a generic tensor accelerator (which become common these days) and no need to design those custom DSP blocks.
P.S. Of course it may initially be slow and work in a range of thousands FPS (frames per season) but I’m not going to use AV1 let alone AV2 so why should I care?
I’ve decided to add a couple of lossless audio formats in a preparation for a long-term goal of having a NihAV
-based player (the debug tool nihav-player
that I currently have can’t really count for one especially considering how it does not play pure audio files and tends to deadlock in SDL audio thread).
So I’ve added nihav-llaudio
crate with four most common formats for music I have, namely FLAC, Monkey’s Audio, TTA and WavPack. And I guess it’s time to revisit my opinion about various lossless audio formats now that I’ve (re)implemented support for some of them (I tried to summarise my views about them almost ten years ago). Let’s see what has changed since then:
The sample count in a TTA1 frame is a multiple to 576 (sound buffer granule). Based on this, the “frame time” is defined as a constant 1.04489795918367346939. Thus, the sample count in a regular TTA1 frame determined as: regular TTA1 frame length = frame time * sample rate.
I’m no mathematician so this does not form a coherent logical chain for me, I’d use something like “frame length in samples is sample rate rounded up to multiple of 576” instead of “sample rate multiplied by 256/245”. The main irritating point is that last frame contains less samples and you need to signal that it’s last frame (or merely check if you have enough bits left to decode a full frame after you decoded enough samples for the last frame). Oh, and TTA2 seems to be still in development.
Normally lossless audio formats either store offsets for each frame or have an easily recognizable header, but FLAC is different. It’s obvious that the author was inspired by MPEG audio header design but those actually had frame sizes coded. Here in order to find where the frame ends you need either to decode it or calculate CRC for the data you read (and in the likely case of false positives also check that the data is followed by a valid header). One could argues that there’s often a seek table in FLAC file but for e.g. in luckynight.flac
those entries are for multiples of ten seconds positions, making seeking to a more precise position a task of skipping frames (which is fun—see above).
I remember reading somewhere (on Hydrogenaudio most likely) a brief story about development of several popular lossless audio codecs (even told by the author of one but I might be wrong). Essentially it’s not a NIH syndrome but very close: somebody develops a format, another guy finds a minor flaw the original developer refuses to address (my memory is hazy but I think there were such things mentioned as no plugin for some player or not supporting some tags) and develops another format. The amount of formats that came to existence because somebody wanted to create a format and could not keep it to himself is pretty large too.
But those days seem to be over and maybe I’ll reverse engineer some of those old codecs for documenting reasons as there’s very little risk that somebody would pick them up and make widespread now. Alternatively I can rant on newer formats sucking as well. Though why wait, let’s do it now:
Now back to doing nothing.
Now that (as I believe) I’ve fixed remaining reconstruction bugs in VX decoder, why not do a quick comparison of various video codecs developed by Actimagine and see how they differ (if at all).
There seem to be the following codecs:
And while they all are based on H.264 with finer block partitioning, there are some differences as well.
Proper structure. The original VX codec used quantiser derived from FPS and all frames were encoded in the same way, while the latter codecs have I-frames and quantisers are transmitted for each frame (as delta for non-keyframes).
References and motion compensation. VX had three previous frames as reference ones, later codecs increased that number to five. VX had fullpel motion compensation, later codecs use halfpel MC.
Data coding. VX relied on Elias Gamma’ codes for all codes except coefficient coding, later codecs use codebooks for most coded values. Also while VX coded residue in 4×4 blocks in H.264 way (starting from the end and with tail of ones coded explicitly), newer codecs use separable transforms and the usual (zero run, coefficient level)
coding. Additionally only nine coding modes out of twenty four have survived after VX (intra prediction, MC with motion vectors coded and splits).
Overall, while all those codecs are related, there are large differences between VX and later Mobiclip variants and the only differenced between Mobiclip variants are colourspace (Mods uses YCoCg model, HD uses the proper YUV model), quantiser being clipped to 12-52 range, and block mode codebooks being different.
As I mentioned before, somebody has reverse-engineered decoders for Mobiclip (and a quick check on codebooks used tells me that Mobiclip HD and 3DS versions are the same) so if somebody needs them it should not be that hard to write a decoder.
Sometimes I like to play old strategy games from my youth: Civilization II, Settlers II, WarCraft II and Reunion. You probably have never heard about it since it’s not from some famous studio but from some Hungarians and published by rather obscure publisher too.
The idea is about the same as in Settlers II but IN SPACE! In some near future an experimental spaceship somehow gets into an unknown star system, most of technologies are lost and now you have to colonise planets, fight with aliens and find your way back home. This game combines some planet-building with space exploration and ground battles (there are also battles in space but they’re fought without your involvement). And since it has a story you have events like getting a chance to get some technology or break the alliance between your enemies. So it’s an interesting mix overall and it explains why I still return to it time from time. Sadly the game was programmed in traditional Hungarian manner (remember, Hungarians are responsible for such popular software as Windows 95 or MPlayer) and its intro (a separate program) sometimes crashes and sometimes it even makes DosBox segfault. The main game is also prone to corruptions and crashes (yet I still play it sometimes).
Anyway, today I’ve stumbled upon a page of one guy who reverse-engineered image format used in this game just by fiddling with it. It turned out to be compressed with RLE similar to the one used in PCX (0x00-0xBF
– normal pixel, 0xC0-0xFF
– run of next byte value 0-63 times). Since the game had some animations as well I decided to look at them.
So intro uses mostly still images split into 640×100 strips (so they can fit into one segment if you remember those) that are scrolled and faded in and out. And there’s a special animation format for some in-game animations similar to the picture format (as expected). Animation file is a series of frames (without palette) that are coded with similar RLE but there are some quirks not encountered in still images. First of all, frames are coded as differences and codes in range 0x80-0xBF
are used to signal how many pixels to skip. Second, it turns out that codes 0x80
and 0xC0
are actually escape codes and are followed by 16-bit value of actual skip or run length (and in case of 0xC0
code a pixel value after that). Again, since the format is so simple it could be found just by looking inside the animation files and messing with a decoder.
As for the other games mentioned in the beginning, Civ2 has GIF files mostly hidden inside resource .dll
s plus Indeo 4 video (with transparency even!) and Settlers II and WarCraft II have videos in Smacker format.
Having said that, my pointless diversion to looking at game formats is over, back to doing nothing!
Since I’ve got the second request for a decoder relicensing I’ve decided to keep an open list of the project that requested relicensing. This way it may satisfy somebody’s curiosity about which parts of NihAV
piqued some interest and also keep a proof for a project that I granted them a new license for the code.
The page is right here.
So I’ve released my decoder for Actimagine VX and it’s far from perfect.
First problem is audio. While the codec itself it not that tricky (it turned out to be some LPC codec that takes 5-10 16-bit words per frame to code pulses and filter for 128-sample frame), but its data is stored right after video frame data so in order to decode audio first you need to decode video frame and feed the remains of input buffer to the audio decoder. Since I can’t do that in a sane way I could not test the decoder either and it’s there just for the informative purposes only.
The second problem is obviously video. I’ve managed to decode bitstream fine but reconstructed images are not bit-exact and in case of plane prediction this leads to ugly artefacts (essentially the target value wraps around and you have gradients from white to black or vice versa instead of almost flat dark or white regions). I’ve introduced a clipping which seems to help but this is not right and maybe I’ll fix it one day. Maybe even before Bink2.
And finally there are some problems with the demuxer. In theory VX files may have multiple tracks but my demuxer might not handle them at all and if it does then it’ll simply ignore anything but the first video stream.
So VX support is far from perfect but it serves its goal of proving that the format works as expected. And if it’s useful to anybody then it’s even better.
As you may know (but definitely not care), NihAV
has some limited support for Bink2 video. The problem in fixing it is that known samples are usually 720p video or mode which makes it hard to debug decoding past few initial frames (okay, older versions have smaller known videos so they’re likely to be fixed sooner). And of course the encoder is available only to the RAD customers to which I don’t belong. So in result I’ve decided to look at Actimagine VX codec once again.
I’ve looked at it four years ago but I could just study it but not write a decoder because of the binary. Essentially this codec happens on BigN DS consoles so you have to deal with raw ARM7 or ARM9 binary that (as it turns out) sets up its own segments (and the problems arise when you see absolute addresses to the areas not present there). So you load binary at addresses e.g. 0x2000000-0x20e1030
but in reality it contains also segments 0x1ffe800-0x1fff000
and 0x27e0000-0x27e4000
. Thankfully Ghidra can not just load raw ARM binary but also add aliases to data as new segments. This allowed me to work on the decoder again and now I have more or less complete understanding of it and semi-working decoder for it as well, here’s an example:
Sample decoded frame.
Essentially it’s a simplified variant of H.264 with the following features: frames are split into 16×16 macroblocks that can be further recursively divided horizontally or vertically down to 2×2 blocks. Block can be coded in 24 different modes that boil down to full-pel motion compensation from one of three previous frames (without a motion vector, with motion vector, or with motion vector and an offset value that should be added to each pixel), intra prediction on whole block or intra prediction in 4×4 blocks. Also whether you have residue coded is also part of the mode (e.g. mode 11 is intra prediction without residue and mode 22 is intra prediction with residue). Residue is coded in 8×8 blocks comprising six 4×4 coefficient blocks, each block is coded in a way reminding of H.264: there are numbers for total number of non-zero coefficients, number of last non-zero coefficients being plus-minus one and number of zeroes dispersed between non-zero coefficients. Those being coded with variable-length codes that I could not access earlier was the blocker but not any more.
And there’s one curious feature of this codec that made it worth REing: instead of using plane prediction like H.264, this codec fills block in a recursive way. It interpolates bottom-right corner as an average of top-right and borrom-left neighbour pixels (e.g. [15,-1] and [-1,15] for 16×16 block; it also adds a delta to it in certain decoding modes), then it calculates halfway-bottom right and halfway-right bottom pixels (e.g. [15,7] and [7,15] for 16×16 block), then a centre pixel, and then repeats the process for each quarter (or half for some rectangular blocks). This is less computationally intensive than ordinary plane prediction and it seems to give nice results too.
I mentioned before that my decoder is far from perfect (and you can see it for yourself on that picture) but I know how to debug and improve it. I’m not trying to say that piracy is okay, but being able to find some .nds
image with a game that has VX videos and using it with DeSmuME
with GDB stub would help to debug the decoder but piracy is bad and so it’s not a proper way to do things.
As for audio counterpart, I should mention this: curiously enough there’s an opensource decoder for later MobiClip formats that seems to contain working Sx decoder for an audio used in VX files (it’s a pity the person who did it could not finish VX as well—why should I do the work myself instead of letting other people do my work for me?!). Unfortunately it’s mostly translated assembly so while it should work it’s mostly sub_XXX()
doing various accesses to various positions of large byte array of decoder state. I’ll probably add it as well for completeness sake and document the formats properly after I fix the decoder (which should happen during this year too).
A brief context: I watch videos from BaidUTube (name slightly altered just because) and my preferable way to do that is to grab video files with youtube-dl
in 720p quality so I can watch them later at my leisure, in the way I like (i.e. without a browser), and re-watch later even if it’s taken down. It works fine but in recent weeks I’ve noticed that some of the downloaded videos are unplayable. Of course this can be fixed by downloading it again in slightly different form (separate video stream and separate audio streams muxed locally, youtube-dl
can do that) but today I was annoyed enough to look at the problem.
In case it’s not obvious I’m talking about mp4 filed encoded and muxed at BaidUTube without any modifications by youtube-dl
which merely downloaded it. So, what’s the problem?
Essentially MP4 file contains header with metadata telling at which offset and which size are frames for each codec and the actual data is stored in mdat
atom. Not here. First you have lots of 12-byte sequenced 90 00 00 00 00 0X XX XX 00 02 XX XX
, then moof
atom (used in fragmented MP4) and then another mdat
. And another. I’ve tried to avoid streaming stuff but even to me it looks like somebody put all fragments prepared for HLS streaming into single MP4 file making an unplayable mess.
Overall this happens only on few random videos and probably most of the browsers would not pick it (since VP9 or VP10 in WebMKV is the suggested format) so I don’t expect it to be fixed. My theory is that they decided to roll a new version of encoding software with a broken muxer library or muxing mode. And if you ask “What were they thinking? You should run at least some tests to see if it encodes properly.”, one wise guy has an answer to you: they weren’t thinking about that, they were thinking when how long until the lunch break and then when it’s time to go home. This is the state of enterprise software and I have no reasons to believe the situation will ever improve.
And there’s a fact maybe related to it. Random files starting from 2019 maybe also show the marker “x264 – core 155 r2901 7d0ff22” in the encoded frames while most of the files have no markers at all. While I don’t think they violate the license it still looks strange that a company known for not admitting that it uses open-source projects (“for their own protection” as it was explained once) lets such marker slip through.
Well, that was an even more useless rant than usual.
As you might’ve heard, MPEG is essentially no more. And the last noticeable thing related to video coding it did the last was MPEG-5 (and synthesising actors and issuing commands to them with MPEG-G and MPEG-4 standards unholy unity). In result we have an abuse of letter ‘e’—in HEVC, EVC and LCEVC it means three different things. I’ll talk about VVC probably when AV2 specification is available, EVC is slightly enhanced AVC and LCEVC is interesting. And since I was able to locate DIS for it why not give a review of it?
LCEVC is based on Perseus and as such it’s still an interesting concept. For starters, it is not an independent codec but an enhancement layer to add scalability to other video codecs, somewhat like video SBR but hopefully it will remain more independent.
A good deal of specification is copied from H.264 probably because nobody in the industry can take a codec without NALs, SEIs and HRD seriously (I know their importance but here it still feels excessive). Regardless, here is what I understood from the description while suffering from thermal throttling.
The underlying idea is quite simple and hasn’t changed since Perseus: you take a base frame, upscale it, add the high-frequency differences and display the result. The differences are first grouped into 4×4 or 8×8 blocks, transformed with Walsh-Hadamard matrix or modified Walsh-Hadamard matrix (with some coefficients being zeroed out), quantised and coded. Coding is done in two phases: first there is a compaction state where coefficients are transformed into byte stream with flags for zero runs and large values (or RLE just for zeroes and ones) and then it can be packed further with Huffman codes. I guess that there are essentially two modes: a faster one where coefficient data is stored as bytes (with or without RLE) and slightly better compressed mode with those values are further packed with Huffman codes generated per tile.
Overall this looks like a neat scheme and I hope it will have at least some success. No, not to prove Chiariglione’s new approach for introducing new codecs an industry can use without complex patent licensing, but rather because it might be the only recent major video codec built on principles different from H.26x line and its success may introduce more radically different codecs and my codec world will get less boring.
NihAV
was a fine joke that had been running for far too long. But today, on no particulate date at all, I release it for public to ignore or to briefly look and forget immediately. Some decoders (Bink2, ClearVideo and Vivo 2) are still far from perfect, some features have simple or sketchy implementations, but despite all of that here it is.
The official website is here, source code is here.
Many thanks to people from former Libav
project for hosting.
Since the work on NihAV
is nearing the point when I can release it to public without that much shame (main features I wanted to implement are there and I’ve even documentation for all public interfaces plus some overview, you can’t ask for more than that) I want to give $title.
This is the oldest tool oriented mostly to test decoders functionality. By default it will try to decode every stream in a file and output it either into a wave file or a sequence of images (PPM for RGB video, PGMYUV for YUV). Beside that it can also not decode a stream (and if you choose to decode neither then it tests demuxer or dumps raw frames).
Here is the list of switches it understands:
-noout
makes it decode data but not produce any output (good for testing decoding process if you don’t currently care about decoder output);-an/-vn
makes it ignore audio or video streams correspondingly;-nm=count/pktpts/frmpts
make nihav-tool
write frame numbers as a sequence or using PTS from input packet or decoded frame correspondingly;-skip=key/inter
tells video codec (if it is willing to listen) to skip less significant frames and decode only keyframes or intra- and interframes but no B-frames;-seek time
tells the tool to seek to the given position before decoding;-apfx/-vpfx prefix
specify the prefix for output filename(s) which comes useful when decoding files in a batch;-ignerr
tells nihav-tool
to keep decoding ignoring errors the decoders report;-dumpfrm
tells nihav-tool
to dump raw frames. This is both useful for obtaining raw audio frames (I could not make avconv
do that) and because of the way it is implemented (it dumps packet contents first and then tries to decode it) if you use it along with the decoder and it errors out you’ll have raw frame on which it errored out.Additionally you can specify end time after giving input name if you don’t need to decode the whole file.
As you can see this is not the most feature-rich tool but it works good enough for the declared goal (hence I use it mostly a debug build of it).
This is another quick and dirty tool that appeared when I decided that looking at long sequences of images is not the best way to ensure that decoding goes right. So I wrote something that can pass in a bad light for a player since it can show moving pictures and play sound that sometimes even goes in sync instead of deadlocking audio playback thread.
Currently it’s written using patched SDL1 crate (removing dependencies on num
and rand
and adding YUV overlay support and audio interface that you can actually use from Rust; patches will be available in the same repository) because my primary development system is too old and I don’t want to mess with various libraries or finding which version of sdl2
crate would compile using my current version of Rust (1.31 or 1.33.
In either case it’s a temporary solution used mostly for visual debugging and I want to write a proper media player based on SDL2 that would play audio-only files just as fine (so I can move to dogfooding). After all, can you really call yourself a multimedia developer if haven’t written a single player?
And finally the tool that appeared out of need to debug encoders instead of decoders. Hopefully it will become more useful than that one day but at least its interface should give you the idea what it does and what it will do in the future.
I still consider one of the main problems with ffmpeg
(the tool) and later avconv
the positional order of arguments. Except when the order does not matter. If you’ve never been annoyed by the fact you should put some arguments before -i infile
in order for them to take effect on input and the rest of arguments should be put before output file name—well, in this case you’re luckier than me. So I’ve decided to have it in a more free-form format.
nihav-encoder
command line looks a list of options in no particular order and some of them take complex arguments and then you provide a comma-separated list in form --options-list option1,option2=value,option3=...
. Here is the list of recognised options:
--list-{decoders,encoders,demuxers,muxers}
obviously lists the corresponding category and quits after listing all requested lists and options (see the next item);--query-{decoder,encoder,demuxer,muxer}-options name
prints the list of options supported by the corresponding codec or (de)muxer. Of course you can request options for several different things to be listed by adding this option several times;--input inputfile
and --output outputfile
;--input-format format
and --output-format format
force (de)muxer to use the provided format when autodetection fails;--demuxer-options options
takes a comma-separated list of options for demuxer (BTW you can also force input format with e.g. --demuxer-options format=avi
);--muxer-options options
takes a comma-separated list of options for muxer (BTW you can also force output format with e.g. --muxer-options format=avi
);--no-audio
and --no-video
tell nihav-encoder
to ignore all audio or video streams correspondingly;--start time
and --end time
tell nihav-encoder
to start decoding at given time and end at given time. The times are absolute so --start 1:10:00 --end 1:11:00
will process just a second of data;--istreamX options
and --ostreamX options
set options for input and output streams with given numbers (starting with zero of course). More about them below.nihav-encoder
has two modes of operation: query mode, in which you specify which e.g. demuxers or codec options you want listed, and the program quits after listing them; and transcode mode, in which you specify input and output file and what you want to do with them. Maybe I’ll add a probe mode but I’ve never cared much about it before.
So what happens when you specify input and output? nihav-encoder
will try to see which streams can be output (e.g. when transcoding from AVI to WAV there’s no point to even attempt to do anything with video stream), then it will try to copy input streams to the output unless anything else is specified. Of course you can specify that you want to discard some input stream with e.g. --istream0 drop
. And for output streams you can also specify encoder and its parameters. For example my command line for testing Cinepak encoding looks like this:
./nihav-encoder –input laser05.avi –output cinepak.avi –no-audio –ostream0 encoder=cinepak,quant_mode=mediancut,nstrips=4
It takes input file laser05.avi
, discards audio stream, encodes remaining video stream with Cinepak encoder that has options quant_mode
and nstrips
set explicitly, and writes the result to cinepak.avi
.
As you can see, this tool has enough features to serve as a daily base transcoder but no complex features like taking input from several files, arbitrary mapping input streams from them to output streams and maybe applying some effects while at it. In my opinion that’s the task for some more complex application that builds a complex processing graph probably using a domain-specific language to specify inputs and outputs and what to do with them (and it should be a proper command file instead of command line that is impossible to type correctly even from the eighth try). Since I never had interest in GStreamer
I’m definitely not going even to play with that. But a simple transcoder should serve my needs just fine.
So instead of doing something productive like adding missing functionality bits and writing documentation I wasted my time on adding some QuickTime decoders. And while wasting time on adding SVQ1, SVQ3, QDMC and QDM2 decoders it became apparent why NihAV
is a good thing to exist.
Implementing two of them was not a very big deal but implementing SVQ3 and QDM2 decoders took more than a week each because there are only two specifications available for them and both are equally hard to comprehend: the first one is the official binary specification, the second one is source code in libavcodec
which is derived from the former.
The problem arises when somebody wants to understand how it works and/or reimplement the code and both SVQ3 and QDM2 decoder demonstrate two different aspects of that problem.
SVQ3 decoder is based on some draft of H.264 (or ex-MPEG/AVC if you’re from Piedmont) with certain extensions mostly related to motion compensation. Documentation for it was scarce and because of optimisations and integration with common H.264 decoder bits it’s hard to understand some of the things. One of those is intra prediction with two modes having SVQ3-specific hacks hidden in libavcodec/h264pred.c
(those are 16×16 plane prediction mode giving transposed result and 4×4 diagonal down prediction being simplified and not relating on pixels not immediately top/left from the block) and another one is block coefficients decoding function. It took me quite a while to realize that it actually decodes three different kinds of blocks: single 4×4 block with zigzag scan, 4×4 block divided into two parts with interlaced scan, and 2×2 block. I’ve documented most of that in The Wiki (before that nobody has touched that page for almost ten years; sometimes I feel like I’m the only person contributing there).
QDM2 is horrible in different way. It is slightly improved translation of the original binary specification with hardly any idea how it works (there are still names like local_int_8
in the code). Don’t get me wrong, back in 2003-2005 when reverse engineering was done the only tools you had were debugger, disassembler (you’re lucky if it’s not the one provided by debugger) and no decompilers at all (IIRC rec
appeared much later and was of limited usefulness, especially on multi-megabyte QT monolith—and that’s assuming you’re not doing that on Mac with even less tools available). I did some of such work back then as well so I understand how hard it is and how you’re happy that it works somehow and you can ship it and forget about it.
Another thing is that now it’s clear that QDMC and QDM2 are predecessors of DT$ LBR (aka Express) and use the same principles (QDMC simply coded noise and tones, QDM2 is almost like LBR but without some features like LPC or multichannel audio and with different chunk structure), but back in the day there was no documentation on LBR (or LBR itself for that matter).
But the main problem is that nobody has tried to understand the code since. It became a so-called category killer i.e. its existence prevents others from doing something similar. At least until some idiot tried to do another implementation in NihAV
.
And here we have the reason for NihAV
to exist: it advances the understanding of codecs for me (and I document results in The Wiki) resulting in different implementations that are (hopefully) easier to understand and sometimes even fix long-standing bugs. I hope this shall convince you that sometimes it’s good to have reimplementation of the decoder even if an existing implementation is good enough (as far as I remember the only time a decoder was rewritten in FFmpeg
was when a reverse-engineered Indeo 3 decoder that crashed on damaged content almost every time was replaced with a reverse-engineered Indeo 3 decoder where a guy had the idea how it works).
But back to QDM2: while my decoder is not finished yet and I probably won’t bother with inter-frames in it (I’ve never seen any samples with those), it still decodes sweeps much better. That’s mostly because of the various bugs I’ve uncovered (also while discovering that Ghidra effectively does not allow to edit about a megabyte large decoder context). Since I have no incentive to produce a patch and people who created the decoder are long gone from the project, here are some spotted bugs: wrong coarse quantiser band selection (resulting in noise generated in wrong frequency range), reading bits past the chunk end (because is some cases checks are missing), ignoring group 4 tones because of the wrong conditions, some initial variables are set in the wrong way too. Nevertheless it mostly works and it was very useful for mapping the functions in the binary specification (fun fact: QDM2 decoder is located in QuickTime.qts
while QDMC is located in QuickTimeInternetExtras.qtx
).
I’m happy to announce that NihAV
has finally taken more or less complete form. Sure there are some concepts I wanted to play with (like raw streams handling) but I had no need for them so far so it can wait until much much later. But all major features required to build a transcoder are there as well as working transcoder itself.
As I wrote in the previous post I wanted to play with vector quantisation so first I implemented image palettisation but since that was not enough I implemented two encoders using vector quantisation: 15-bit MS Video 1 and Cinepak. I have no doubts that Tomas Härdin has written a much better encoder but why should that stop me from NIHing? Of course such encoder is not very useful by itself (and it was useless to begin with) so I needed a muxer to represent encoder output in some form. And then simply fiddling with parameters and recompiling became boring so I finally introduced generic options and in order to use those options without recompiling the binary every time I had to write a transcoder as well. But that means that now I can use NihAV
to recode media into something else even if it’s just two crappy video encoders, MS ADPCM and PCM encoder with the large variety of supported output containers (AVI and WAV!). I called it conceptually done because all the essential concepts are there, not because there’s nothing left to do.
Now about video encoders. I’ll describe the NihAV
design and how it works on a separate page, for now I just mention that while decoders are working on “frame in-picture/audio out” principle, encoders accept single picture or audio buffer for encoding and then may output a series of encoded packets. Why such asymmetry in design? Because decoders are expected to produce single output for single input (and frame reordering is handled externally) while most encoders are expected to have at least a single audio frame or couple of pictures of lookahead to make decisions about coding of a current input. For modern video codecs it may be a decision what frame type to assign or where to start a new scene, for audio codecs like AAC you may need to change current frame type if the following frame type has transients and previous frame type didn’t have them.
Anyway, back to the technical details about the encoders. MS Video 1 operates on 4×4 blocks that can be coded as skipped, filled with single colour, filled with two colours in a pattern, or split into 2×2 sub-blocks each filled with its own two colours in a pattern. Sounds perfect for median cut. Cinepak is much more complex. It splits frame into several strips and each strip is also split into 4×4 blocks that may be coded as skipped, single 2×2 YUV codeword (2×2 Y block and single U and V values) scaled twice or four YUV codewords from different codebook. Essentially for a good encoding you need to determine how to partition frame into strips optimally, split blocks into single and four-vector ones and find optimal codebooks for them separately. Since I wanted to write a working encoder mostly to check whether vector quantisation is working, I simply have fixed amount of strips and add every block as a candidate for both coding schemes without a following refining steps.
Here are some numbers if you really care about those. Input is laser05.avi
(320×240 Indeo2 file with 196 video frames from the standard samples place). Encoding with MS Video 1 encoder takes about 4 seconds . Encoding Cinepak with median cut takes six seconds. Encoding Cinepak with ELBG and randomly-generated codebooks takes 36 seconds and result looks bad (but recognizable). Encoding Cinepak with ELBG that takes codebooks produced with median cut as the initial ones takes 68 seconds but the quality is higher than merely median cut and the output file is slightly smaller too.
Now with all of this done I should probably fix the knowingly bad decoders (RV6 and Bink2), add whatever missing decoders and features I see fit and start documenting it all. I have strong doubts about VDD this year but maybe I’ll be able to present my stuff at FOSDEM 2021.
While NihAV
had support for paletted formats before, now it has more use cases covered. Previously I could only decode paletted format and convert picture into some other format. Now it can handle palette in standard containers like AVI and MOV and even palette change in AVI (it’s done via NASideData
which is essentially the same thing I NIHed more than nine years ago). In addition to that it can convert image into paletted format as well and below I’d like to give a brief review of methods employed.
I always wanted to try vector quantisation and adding conversion to paletted formats is a perfect opportunity to do so.
It should be no surprise to you that conventional algorithms for this are based on ideas from 1980s.
Median cut (described in Color image quantization for frame buffer display by Paul Heckbert in 1982) is a very simple approach: you gather all data into a single box, split it on the largest dimension (e.g. if maximum green value minus minimum green value is greater than red or blue differences then split data into two boxes, one with green components less than average green value and one with larger than average green value), apply recursively to the resulting boxes until you get a desired amount of boxes or can’t split any more. This method works moderately fast and unlike other approaches it does not need an initial palette.
NeuQuant is an algorithm proposed by Anthony Dekker in 1994 but it’s based on work of Teuvo Kohonen in 1980s. Despite being based on neural network, the algorithm works and it’s quite simple to understand. Essentially you have nodes corresponding to palette entries and each new pixel is used to update value in both the nearest-matching palette entry and also its neighbours (with decreasing weight of course). It works fast and produces good results even on partially sampled image but its result is heavily affected by the order pixels are sampled. That’s why for the best results you need to walk image in pseudo-random order while other methods do not care about order at all.
ELBG is the enhancement of Linde-Buzo-Gray algorithm (from 1980) proposed in 2000. In the base algorithm you select random centroids (essentially the centres of pixel clusters that will serve as palette entries in the end), calculate which pixels are the closest to which centroid, calculate proper centroid for these clusters (an average of those pixels essentially), repeat last two steps until centroids don’t move much (or moving them does not improve the quantisation error by much). An enhancement comes from an observation that some clusters might have larger dispersion than the others so moving a centroid from a small cluster of pixels near another small cluster of pixels to a large cluster of pixels might reduce quantisation error. In my experience this is the slowest method of three I tried and the enhancement actually makes it about twice as slow (but of course I did not try to optimise any of these methods beside the very basic stuff). Also since it constantly searches for suitable centroids for every pixel (more than once if you consider enhancement step) it gets really slow with an increased number of pixels. But it’s supposed to give the best results.
And here are some words about palette lookup. After quantisation step is done you still have to assign palette indices to each pixel. The same paper by Heckbert lists three methods for doing that: the usual brute force, local search and k-d tree. Brute force obviously means comparing pixel against every palette entry to find the closest one. Local search means matching pixel against palette entries in its quantised vicinity (I simply keep a list of nearest palette entries for each possible 15-bit quantised pixel value). K-d tree means simply constructing a k-dimensional tree in about the same way as median cut but you do that on palette entries and nodes contain component threshold (e.g. “if red component is less than 120 take left branch, otherwise take right branch”). In my experience k-d tree is the fastest one but its results are not as good as local search.
Output image quality may be improved further by applying dithering but I did not care enough to play with it.
P.S. I’ve also made a generic version of median cut and ELBG algorithms that can be used to quantise any kind of input data and I want to use it in some encoder. But that’s a story for another day. There’s other new and questionably exciting stuff in NihAV
I want to blog about meanwhile.
I am working on PowerPC SIMD optimizations for x264. I was playing with SAD functions and was thinking it would be nice to have something similar to x86 PSADBW for computing the sum of absolute differences. Luca suggested me to try using the #power9 vec_absd
. Single vec_absd( fenc, pix0v )
replaces vec_sub( vec_max( fencv, pix0v ), vec_min( fencv, pix0v) )
. My patch can be found here. To make it work -mcpu=power9 must be set. The patch contains the macro that makes its code backward compatible with POWER8:#ifndef __POWER9_VECTOR__
#define vec_absd(a, b) vec_sub(vec_max(a, b), vec_min(a, b))
#endif
I got very nice results using
vec_absd
(the numbers are ratios of AltiVec/C checkasm timings):
vec_absd
for SAD and SSD functions makes 8% improvement. Which is amazing for such a small change.I decided to write SIMD optimizations for HEVC decoder inverse transform (which is IDCT approximation) for ARMv7. (Here is an interesting post about DCT.) The inverse transform for HEVC operates on 4x4, 8x8, 16x16 and 32x32 blocks and I have finished them recently. For each block there are 2 functions, one for 8 bitdepth and the other for 10 bitdepth:
vpush/vpop {q4-q7}
) when one wants to use them. VPUSH/VPOP pushes and pops to/from stack. pop lr
and then bx lr
to return but it's better to return with simply pop {pc}
. mov rx, sp
and rx, sp, #15
add rx, rx, #buffer_size
sub sp, sp, rx
sp now points to the buffer. After using the buffer, the stack pointer has to be restored with add sp, sp, rx
. I tried my skills at optimising HEVC. My SIMD IDCT (Inverse Discrete Cosine Transform) for HEVC decoder was merged lately. What I did was 4x4, 8x8, 16x16 and 32x32 IDCTs for 8 and 10 bitdepths. Both 4x4 and 8x8 are supported on 32-bit CPUs but 16x16 and 32x32 are 64-bit only.
The larger transforms calls the smaller ones, 32 calls 16, 16 calles 8 and so on, so 4x4 is used by all the other transforms. Here is how the actual assembly looks:
; void ff_hevc_idct_4x4__{8,10}_(int16_t *coeffs, int col_limit)
*coeffs is a pointer to coefficients I want to transform. They are loaded to XMM registers and then
; %1 = bitdepth
%macro IDCT_4x4 1
cglobal hevc_idct_4x4_%1, 1, 1, 5, coeffs
mova m0, [coeffsq]
mova m1, [coeffsq + 16]
TR_4x4 7, 1, 1
TR_4x4 20 - %1, 1, 1
mova [coeffsq], m0
mova [coeffsq + 16], m1
RET
%endmacroTR_4x4
macro is called. This macro transforms the coeffs according to the following equations: res00 = 64 * src00 + 64 * src20 + 83 * src10 + 36 * src30
res10 = 64 * src01 + 64 * src21 + 83 * src11 + 36 * src31
res20 = 64 * src02 + 64 * src23 + 83 * src12 + 36 * src32
res30 = 64 * src03 + 64 * src23 + 83 * src13 + 36 * src33
Because the transformed coefficients are written back to the same place, "res" (as residual) is used for the results and "src" for the initial coefficients. The results from the calculations are then scaled res = (res + add_const) >> shift
and the (4x4) block of the results is transposed. The macro is called again to perform the same transform but this time to rows.
; %1 - shift
; %2 - 1/0 - SCALE and Transpose or not
; %3 - 1/0 add constant or not
%macro TR_4x4 3
; interleaves src0 with src2 to m0
; and src1 with scr3 to m2
; src0: 00 01 02 03 m0: 00 20 01 21 02 22 03 23
; src1: 10 11 12 13 -->
; src2: 20 21 22 23 m1: 10 30 11 31 12 32 13 33
; src3: 30 31 32 33
SBUTTERFLY wd, 0, 1, 2
pmaddwd m2, m0, [pw_64] ; e0
pmaddwd m3, m1, [pw_83_36] ; o0
pmaddwd m0, [pw_64_m64] ; e1
pmaddwd m1, [pw_36_m83] ; o1
%if %3 == 1
%assign %%add 1 << (%1 - 1)
mova m4, [pd_ %+ %%add]
paddd m2, m4
paddd m0, m4
%endif
SUMSUB_BADC d, 3, 2, 1, 0, 4
%if %2 == 1
psrad m3, %1 ; e0 + o0
t psrad m1, %1 ; e1 + o1
psrad m2, %1 ; e0 - o0
psrad m0, %1 ; e1 - o1
;clip16
packssdw m3, m1
packssdw m0, m2
; Transpose
SBUTTERFLY wd, 3, 0, 1
SBUTTERFLY wd, 3, 0, 1
SWAP 3, 1, 0
%else
SWAP 3, 2, 0
%endif
%endmacro
The larger transforms are a bit more complicated but they works in a similar way.
There are the results benchmarked by checkasm bench_new() function for the bitdepth 8 (the results are similar for bitdepth 10). Checkasm can benchmark SIMD functions with --bench option, in my case the full command was:
I asked Kostya Shishkov, an experienced ARM developer, to check my basic NEON knowledge. So here are his questions and my answers to them:
vmov.i32 q0, #42
- move immediate constant 42 to q0
SIMD register, suffix i32 specifies the data type, 32-bit integer in this case, as q0
is 128-bit register, there will be 4 32-bit 42 constants mov r1, #16
- move number 16 to r1
GPR register add r0, r1
- add 16 bytes to the address stored in r0 vst1.s16 {q0}, [r0]
- store the content of q0
to the address stored in r1
r0
and store the result in r1
mov r2, \step - move the constant step to r2
vld1.s16 {d0}, [r1], r2 - store d0
content to the address in r1
, then update r1 = r1 + r2 vmov d0[0], r1
I decided to organise the Libav sprint again, this time in a small village near Pelhřimov. The participants:
Some time ago Niels Möller proposed a new method of bitreading that should be faster then the current one (here). It is an interesting idea and I decided to try it. Luca Barbato considered it to be a good idea and had his company sponsored this work. The new bitstream reader (bitstream.h) is faster in many cases and is never slower than the existing one (get_bits.h).
static inline unsigned int get_bits(GetBitContext *s, int n){
register int tmp;
OPEN_READER(re, s);
UPDATE_CACHE(re, s); tmp = SHOW_UBITS(re, s, n); LAST_SKIP_BITS(re, s, n); CLOSE_READER(re, s); return tmp;}
The new bitstream reader is written to be easier to use, more consistent and to be easier to follow. It is better documented, the functions are named according to the current naming convetions and to be consistent with the bytestream reader naming. bitstream_read_32()
reads bits from the 0-32 range and replaces get_bits()
get_bits_long()
get_bitsz()
bitstream_peek_32()
replaces show_bits()
show_bits_long()
show_bits1()
bitstream_skip()
replaces skip_bits1()
skip_bits()
skip_bits_long()
Sometimes it's very useful to print out how some parameters changes during the program execution.
When writing the new version of some piece of code one usually needs to compare it with the old one to be sure it behaves the same in every case. Especially the corner cases might be tricky and I spent a lot of time with them while my code worked fine in general.
For example when I was working on my ASF demuxer, I was happy there's an old demuxer and I can compare their behaviour. When debugging the ASF, I wanted to know the state of I/O context. In that time lu_zero
(who was mentoring me) created a set of macros which printed logs for every I/O function (here). For example there's the macro for avio_seek()
function (which is equivalent to fseek()).
#define avio_seek(s, o, w) ({ \When such a macro was present in my demuxer, for all the calls of
int64_t _ret = avio_seek(s, o, w); \
int64_t _pos = avio_tell(s); \
av_log(NULL, AV_LOG_VERBOSE|AV_LOG_C(154), "0x%08"PRIx64" - %s:%d seek %p %"PRId64" %d -> %"PRId64"\n", \
_pos, __FUNCTION__, __LINE__, s, o, w, _ret); \
_ret; \
})
avio_seek
the following information was printed _pos = avio_tell(s)
: the offset in the demuxed file __FUNCTION__
: preprocessor define that contains the name of the function being compiled to know which function called avio_seek
__LINE__
: preprocessor define that contains the line number of the original source file that is being compiled to know from what line avio_seek
was called from s, o, w
: the values of the parameters avio_seek
was called with _ret
: the avio_seek
return value __FILE__
: preprocessor define contains the name of the file being compiled (this one was not used in the example but might be useful when one needs more complex log). _ret;
as the last statement in this macro because its value serves as the value of the entire construct. If the last _ret;
would be omitted in my example, the return value of this macro expression would be printf return value. The underscores in _ret
or _pos
variables are used to be sure it does not shadow some other variables with the same names.lu_zero
for teaching me about it. The support from the more experienced developers is the thing I really love about Libav.
I made a split my complex dcadec bit-exact patch (https://patches.libav.org/patch/59013/) to the several parts. The first part which contains changing the dcadec core to work with integer coefficients instead of converting the coefficients to floats just after reading them was sent to the mailing list (https://patches.libav.org/patch/59141/). Such a change was expected to slow down the decoding process. Therefore I made some measurements to examine how much slower decoding is after my patch.
I decoded this sample: samples.libav.org/A-codecs/DTS/dts/dtswavsample16.wav 10 times and measured the user time between invocation and termination with the "time" command:
time ./avconv -f dts -i dtswavsample16.wav -f null -c pcm_f32le null,counted the average real time of avconv run and repeated everything for the master branch. The duration of the dtswavsample16.wav is ~4 mins and I wanted to look at the slow down for the longer files. Hence I used relatively new loop option for the avconv (http://sasshkas.blogspot.cz/2015/08/the-loop-option-for-avconv.html) to create ~24 mins long file from the initial file by looping it 6x with
./avconv -loop 6 -i dtswavsample16.wav -c copy dts_long.wav.I decoded this longer dts file 10x again for both new integer and old float coefficients core and counted the averages.
When playing a multimedia file, one usually wants to seek to reach different parts of a file. Mostly, containers allows this feature but it might be problem for streamed files.
Withing the libavformat, seeking is performed with function (inside the demuxer) called read_seek. This function tries to find matching timestamp for the requested position (offset) in the played file.
There are 2 ways to seek through the file. One of them is when file contains some kind of index, which matches positions with appropriate timestamps. In this case index entries are created by calling av_add_index_entry. If index entries are present, av_index_search_timestamp, which is called inside the read_seek, looks for the closest timestamp for the requested position. When the file does not provide such entries, one can look for the requested position with ff_seek_frame_binary. For doing so, read_timestamp function has to be created inside the demuxer.
Read_timestamp takes required position and stream index and then tries to find offset of the beginning of the closest packet wich is key frame with matching stream index. While doing this, read_timestamp reads timestamps for all the packets after given position and creates index entries. When the key frame with matching stream index is found, read_timestamp upgrades required position and returns timestamp matching to it.
I was told to test my ASF demuxer with the zzuf utility. Zuff is a fuzzer, it changes random bits in the program's input which simulates damaged file or unexpected data.
For testing ASF's behaviour I want to feed avconv with some corrupted wmv files and see what will happen. Because I want to fuzz in several different ways I want to vary seed (the initial value of zzuf’s random number generator). I'll do this with command:
while true; SEED=$RANDOM; for file *wmv; do zzuf -M -l -r 0.00001 -q -U 60 -s $SEED ./avconv -i "file" -f null -c copy - || echo $SEED $file >> fuzz; done; done;.
I got the file fuzz which is the list of seed paired with filename. Now I need to use zzuf for creating damaged files to check the problem with valgrind. I'll use the list to determine the seed which caused some crash for creating my damaged file:
zzuf -M -l -r 0.00001 -q -U 60 -s myseed < somefile.wmv | cat out.asf.
Now I'll just use valgrind to find out what happened:
valgrind ./avconv -i out.asf -f null -c copy -.
I tried to test the ASF demuxer with different tricky samples and with FATE
and the demuxer behaved well but testing with zzuf detected several new crashes. Mainly it was insane values sizes and it was easy to fix them by adding some more checks. Zzuf is a great thing for testing.
Pelhřimov is small but very nice town in Czech Republic approximately 120 km from the capital Prague and I decided to organize a small but nice Libav sprint in it.
The participants and the topics were:
dca_decode_frame
to handle all the extensions and working with the new options for them more systematicly. I decided to improve the Libav DTS decoder - dcadec. Here I want to explain what are its problems now and what I would like to do about them.
DTS encoded audio stream consists of core audio and may contain extended audio. Dcadec supports XCH
and XLL
extensions but X96
, XXCH
and XBR
extensions are waiting to be implemented - I'd like to implement them later.
For the DTS lossless extension - XLL
, the decoded output audio should be a bit for bit accurate reproduction of the encoded input. However there are some problems:
dequantization (with int -> float conversion)I'm working now on modifying the core to work with integer coefficients and then convert them to floats before QMF filtering for lossy output but use bitexact QMF (intermediate LFE coefficients should be always integers and I think it's not correct in the current version) for lossless output. Also I added an option called
↓
inverse ADPCM (when needed)
↓
VQ decoding (when needed)
↓
filtering: QMF, LFE, downmixing (when needed)
↓
float output.
-force_fixed
to force fixed-point reconstruction for any kind of input.XLL
extension is not detected sometimes and the core audio only is decoded in this case. I want to fix this issue as well.I'd like to add the loop option to avconv. This option allows to repeat an input file given number of times, so the output contains specified number of inputs. The command is ./avconv -loop n -i infile outfile, n specifies how many times the input file should be looped in the output.
How does this work?
After processing the input file for the first time, avconv calls new seek_to_start function to seek back to the beginning of the file. av_seek_frame is called to perform seeking itself but there are other things needed for loop option to work.
1) flush
Flush decoder buffers to take out delayed frames. In avconv this is done by calling process_input_file with NULL as frame, process_input_packet had to be modified a little to not to signal EOF on the filters when seeking.
2) timestamps (ts)
To have correct timestamps in the "after seeking" part of the output stream they have to be corrected with ts = ts_{from the demuxer} + n * (duration of the input stream), n is number of times the input stream was processed so far . This duration is the duration of the longest stream in a file because all the streams have to be processed (or played) before starting the next loop. The duration of the stream is the last timestamp - the first timestamp + duration of the last frame. For the audio streams one "frame" is usually a constant number of samples and its duration is number of samples/sample rate. Video frames on the other side are displayed unevenly so their average framerate can be used for the last frame duration if available or if the average framerate is not known the last frame duration is just 1 (in the current time base).
https://github.com/sasshka/libav/commit/90f2071420b6fd50eea34982475819248e5f6c8f
I am hearing a lot of persons interested in open-source and giving back to the community. I think it can be an exciting experience and it can be positive in many different ways: first of all more contributors mean better open-source software being produced and that is great, but it also means that the persons involved can improve their skills and they can learn more about how successful projects get created.
So I wondered why many developers do not do the first step: what is stopping them to send the first patch or the first pull-request? I think that often they do not know where to start or they think that contributing to the big projects out there is intimidating, something to be left to an alien form of life, some breed of extra-good programmers totally separated by the common fellows writing code in the world we experience daily.
I think that hearing the stories of a few developers that have given major contributions to top level project could help to go over these misconceptions. So I asked a few questions to this dear friend of mine, Luca Barbato, who contributed among the others to Gentoo and VLC.
Let’s start from the beginning: when did you start programming?
I started dabbling stuff during high school, but I started doing something more consistent at the time I started university.
What was your first contribution to an open-source project?
I think either patching the ati-drivers to work with the 2.6 series or hacking cloop (a early kernel module for compressed loops) to use lzo instead of gzip.
What are the main projects you have been involved into?
Gentoo, MPlayer, Libav, VLC, cairo/pixman
How did you started being involved in Gentoo? Can you explain the roles you have covered?
Daniel Robbins invited me to join, I thought “why not?
During the early times I took care of PowerPC and [Altivec](http://en.wikipedia.org/wiki/AltiVec), then I focused on the toolchain due the fact it gcc and binutils tended to break software in funny ways, then multimedia since altivec was mainly used there. I had been part of the Council a few times used to be a recruiter (if you want to join Gentoo feel free to contact me anyway, we love to have more people involved) and I’m involved with community relationship lately.
Note: Daniel Robbins is the creator of Gentoo, a Linux distribution.
Are there other less famous projects you have contributed to?
I have minor contributions in quite a bit of software due. The fact is that in Gentoo we try our best to upstream our changes and I like to get back fixes to what I like to use.
What are your motivations to contribute to open-source?
Mainly because I can =)
Who helped you to start contributing? From who you have learnt the most?
Daniel Robbins surely had been one of the first asking me directly to help.
You learn from everybody so I can’t name a single person among all the great people I met.
How did you get to know Daniel Robbins? How did he helped you?
I was a gentoo user, I happened to do stuff he deemed interesting and asked me to join.
He involved me in quite a number of interesting projects, some worked (e.g. Gentoo PowerPC), some (e.g. Gentoo Games) not so much.
Do your contributions to open-source help your professional life?
In some way it does, contrary to the assumption I’m just seldom paid to improve the projects I care about the most, but at the same time having them working helps me when I need them during the professional work.
How do you face disagreement on technical solutions?
I’m a fan of informed consensus, otherwise prototypes (as in “do, test and then tell me back”) work the best.
To contribute to OSS are more important the technical skills or the diplomatic/relation skills?
Both are needed at different time, opensource is not just software, you MUST get along with people.
Have you found different way to organize projects? What works best in your opinion? What works worst?
Usually the main problem is dealing with poisonous people, doesn’t matter if it is a 10-people project or a 300+-people project. You can have a dictator, you can have a council, you can have global consensus, poisonous people are what makes your community suffer a lot. Bonus point if the poisonous people get clueless fan giving him additional voices.
Did you ever sent a patch for the Linux kernel?
Not really, I’m not fond of that coding style so usually other people correct the small bugs I stumble upon before I decide to polish my fix so it is acceptable =)
Do you have any suggestions for people looking to get started contributing to open-source?
Pick something you use, scratch your own itch first, do not assume other people are infallible or heroes.
ME: I certainly agree with that, it is one of the best advices. However if you cannot find anything suitable at the end of this post I wrote a short list of projects that could use some help.
Can you tell us about your best and your worst moments with contribution to OSS?
The best moment is recurring and it is when some user thanks you since you improved his or her life.
The worst moment for me is when some rabid fan claims I’m evil because I’m contributing to Libav and even praises FFmpeg for something originally written in Libav in the same statement, happened more than once.
What are you working on right now and what plans do you have for the future?
Libav, plaid, bmdtools, commonmark. In the future I might play a little more with [rust](http://www.rust-lang.org/).
Thanks Luca! I would be extremely happy if this short post could give to someone the last push they need to contribute to an existing open-source project or start their own: I think we could all use more, better, open-source software. So let’s write it.
One thing I admire in Luca is that he is always curious and ready to jump on the next challenge. I think this is the perfect attitude to become an OSS contributor: just start play around with the things you like and talk to people, you could find more possibilities to contribute that you could imagine.
…and one final thing: Luca is also the author of open-source recipes: he created the recipes of two types of chocolate bars dedicated to Libav and VLC. You can find them on the borgodoro website.
I suggest to take a look at his blog.
Well, just in case you are eager to start writing some code and you are looking for some projects to contribute to here there are a few, written with different technologies. If you want to start contributing to any of those and you need directions just drop me a line (federico at tomassetti dot me) and I would be glad to help!
If you are interested in contributing to Libav, you can take a look at this post: there I explained how I submitted my first patch (approved in the meantime!). It is written in C.
You could be also interested in plaid: it is a Python web application to manage git patches sent by e-mail (there are a few projects using this model like libav or the linux kernel)
WorldEngine, it is a world generator written in Python
Plate-tectonics, it is a library for plate tectonics simulation. It is written in C++
JavaParser a Java parser, written in Java
Incremental Java parser, an incremental Java parser, written in Scala
The post How people get started contributing to open-source? A few questions to Luca Barbato, contributor to Gentoo, MPlayer, Libav, VLC, cairo/pixman appeared first on Federico Tomassetti - Consultant Software Engineer.
I happened to have a few hours free and I was looking for some coding to do. I thought about VLC, the media player which I have enjoyed so much using over the years and I decided that I wanted to contribute in some way.
To start helping in such a complex process there are a few steps involved. Here I describe how I got my first patched accepted. In particular I wrote a patch for libav, the library behind VLC.
I started by reading the wiki. It is a very helpful starting point but the process to setup the environment and send a first patch was not yet 100% clear to me so I got in touch with some of the developers of libav to understand how they work and how I could start lending an hand with something simple. They explained me that the easier way to start is by solving issues reported by static analysis tools and style checkers. They use uncrustify to verify that the code is adhering to their style guidelines and they run coverity to check for potential issues like memory leaks or null deferences. So I:
After a few minutes the patch was approved by a committer, ready to be merged. The day after it made its way to the master branch. Yeah!
First of all, let’s clone the git repository:
git clone git://git.libav.org/libav.git
Alternatively you could use the GitHub mirror, if you want to.
At this point you may want to install all the dependencies. The instructions are platform specific, you can find them here. If you have Mac Os-X be sure to have installed yasm, because nasm does not work. If you have installed both configure will pick up yasm (correctly). Just be sure to run configure after installing yasm.
If everything goes well you can now build libav by running:
./configure make
Note that it is fine to build in-tree (no need to build in a separate directory).
Now it is time to run the tests. You will have to specify one directory where to download some samples, later used by tests. Let’s assume you wanted to put your samples under ~/libav-samples:
mkdir ~/libav-samples # This download the samples make fate-rsync SAMPLES=~/libav-samples # This run the tests make fate
Did everything run fine? Good! Let’s start to patch then!
First of all we need to find an open issue. Visit Coverity page for libav at https://scan.coverity.com/projects/106. You will have to ask for access and wait that someone grants it to you. When you will be able to login you will encounter a screen like this:
Here, this seems an easy one! The variable oggstream has been allocated by av_mallocz (basically a wrapper for malloc) but the result values has not been checked. If the allocation fails a NULL pointer is returned and when we will try to access it at the next line things are going end up unpleasantly. What we need to do is to check the return value of av_mallocz and if it is NULL we should return an error. The appropriate error to return in this case is AVERROR(ENOMEM). To get this information… you have to start reading code, getting familiar with the way of doing business of this codebase.
Libav follows strict rules about the comments in git commits: use git log to look at previous commits and try to use the same style.
I think many of you are familiar with GitHub and the whole process of submitting a patch for revision. GitHub is great because it made that process so easy. However there are some projects (notably including the Linux kernel) which adopts another approach: they receive patches by e-mail.
Git has a functionality that permits to submit a patch by e-mail with a simple command. The patch will be sent to the mailing list, discussed, and if approved the e-mail will be downloaded, processed through git and committed in the official repository. Does it sound cumbersome? Well, it sounds to me, spoiled as I am by GitHub and similar tools but, you know, if you go in Rome you should behave as the Romans do, so…
# This install the git extension for sending patches through e-mail sudo apt install git-email # This submit a patch built using the last commit git send-email -1 --to libav-devel@libav.org
Now, many of you are using gmail and many of you have enable 2-factor authentication (right? If not, you should). If this is you case you will get an error along this lines:
Password for 'smtp://f.tomassetti@gmail.com@smtp.gmail.com:587': 5.7.9 Application-specific password required. Learn more at 5.7.9 http://support.google.com/accounts/bin/answer.py?answer=185833 cj12sm14743233wjb.35 - gsmtp
Here you can find how to create a password for this goal: https://support.google.com/accounts/answer/185833 The name of the application that I had to create was smtp://f.tomassetti@gmail.com@smtp.gmail.com:587. Note that I used the same name specified in the previous error message.
If things go well an e-mail with your patch will be sent to the mailing-list, someone will look at it and accept it. Most of the times you will receive suggestions about possible adjustments to be done to improve your password. When it happens you want to submit a new version of your patch in the same thread which contains your first version of the patch and the e-mails commenting it.
To do that you want to update your patch (typically using git commit –amend) and then run something like:
git send-email -1 --to libav-devel@libav.org --in-reply-to Message-ID: 54E0F459.3090707@gentoo.org
Of course you need to find out the message-id of the e-mail to which you want to reply. To do that in gmail select the “Show original” item from the contextual menu for the message and in the screen opened look for the Message-Id header.
There are also web applications which are used to manage the patches sent by e-mail. Libav is currently using Patchwork for managing patches. You can see it deployed at: https://patches.libav.org/project/libav-devel/list/. Currently another tool has been developed to replace patchwork. It is named Plaid and I tried to help a little bit with that also
Mine has been a very small contribution, and in the future I hope to be able to do more. But being a maintainer of other open-source projects I learned that also small help is useful and appreciated, so for today I feel good.
Please, if I am missing something help me correct this post
The post How to contribute to Libav (VLC): just got my first patch approved appeared first on Federico Tomassetti - Consultant Software Engineer.
I've participated to last Libav sprint in Torino. I made new ASF demuxer for Libav, but during the testing problems with rtsp a mms protocols has appeared. Therefore, my main task during the sprint was to fix these issues.
It was second time I was at such sprint and also my second Torino visit and the sprint was even better than I expected. It's really nice to see people I'm communicating throught the irc channel in person, the thing I like about Libav a lot is its friendly community. But the most important thing for me as the most unexperienced person among skilled developers was naturally their help. My mentors from OPW participated the sprint and as a result all the issues was fixed and patch was sent to the ML (https://patches.libav.org/patch/55682/). Also, these personal consultations can be very productive in learning new things and because I'm not native English speaker I realized few days I have to speak or even think in English are really helpful for getting better in it.
The last day of the sprint we had a trip to a really magical place called Sacra di San Michele (http://www.sacradisanmichele.com/).
After my challenge with the fused multiply-add instructions I managed to find some time to write a new test utility. It’s written ad hoc for unpaper but it can probably be used for other things too. It’s trivial and stupid but it got the job done.
What it does is simple: it loads both a golden and a result image files, compares the size and format, and then goes through all the bytes to identify how many differences are there between them. If less than 0.1% of the image surface changed, it consider the test a pass.
It’s not a particularly nice system, especially as it requires me to bundle some 180MB of golden files (they compress to just about 10 MB so it’s not a big deal), but it’s a strict improvement compared to what I had before, which is good.
This change actually allowed me to explore one change that I abandoned before because it resulted in non-pixel-perfect results. In particular, unpaper now uses single-precision floating points all over, rather than doubles. This is because the slight imperfection caused by this change are not relevant enough to warrant the ever-so-slight loss in performance due to the bigger variables.
But even up to here, there is very little gain in performance. Sure some calculation can be faster this way, but we’re still using the same set of AVX/FMA instructions. This is unfortunate, unless you start rewriting the algorithms used for searching for edges or rotations, there is no gain to be made by changing the size of the code. When I converted unpaper to use libavcodec, I decided to make the code simple and as stupid as I could make it, as that meant I could have a baseline to improve from, but I’m not sure what the best way to improve it is, now.
I still have a branch that uses OpenMP for the processing, but since most of the filters applied are dependent on each other it does not work very well. Per-row processing gets slightly better results but they are really minimal as well. I think the most interesting parallel processing low-hanging fruit would be to execute processing in parallel on the two pages after splitting them from a single sheet of paper. Unfortunately, the loops used to do that processing right now are so complicated that I’m not looking forward to touch them for a long while.
I tried some basic profile-guided optimization execution, just to figure out what needs to be improved, and compared with codiff
a proper release and a PGO version trained after the tests. Unfortunately the results are a bit vague and it means I’ll probably have to profile it properly if I want to get data out of it. If you’re curious here is the output when using rbelf-size -D
on the unpaper
binary when built normally, with profile-guided optimisation, with link-time optimisation, and with both profile-guided and link-time optimisation:
% rbelf-size -D ../release/unpaper ../release-pgo/unpaper ../release-lto/unpaper ../release-lto-pgo/unpaper
exec data rodata relro bss overhead allocated filename
34951 1396 22284 0 11072 3196 72899 ../release/unpaper
+5648 +312 -192 +0 +160 -6 +5922 ../release-pgo/unpaper
-272 +0 -1364 +0 +144 -55 -1547 ../release-lto/unpaper
+7424 +448 -1596 +0 +304 -61 +6519 ../release-lto-pgo/unpaper
It’s unfortunate that GCC does not give you any diagnostic on what it’s trying to do achieve when doing LTO, it would be interesting to see if you could steer the compiler to produce better code without it as well.
Anyway, enough with the microptimisations for now. If you want to make unpaper faster, feel free to send me pull requests for it, I’ll be glad to take a look at them!
RealAudio files have several possible interleavers. The simplest is “Int0”, which means that the packets are in order. Today, I was contrasting “Int4” and “genr”. They both require rearranging data, in highly similar but not identical ways. “genr” is slightly more complex than “Int4”.
A typical Int4 pattern, writing to subpacket 0, 1, 2, 3, etc, would read data from subpacket 0, 6, 12, 18, 24, 30, 36, 42, 48, 54, 60, 66, 1, 7, 13, etc, in that order – assuming subpkt_h is 12, as it was in one sample file. It is effectively subpacket_h rows of subpacket_h / 2 columns, counting up by subpacket_h / 2 and wrapping every two rows.
A typical genr pattern is a little trickier. For subpacket_h = 14, and the same 6 columns per row as above, the pattern to read from looks like 0, 12, 24, 36, 48, 60, 72, 6, 18, 30, 42, 54, 66, 78, 1, etc.
I spent most of today implementing genr, carefully working with a paper notebook, pencil, Python, and a terse formula from the old implementation:
case DEINT_ID_GENR:
for (x = 0; x < w/sps; x++) avio_read(pb, ast->pkt.data+sps*(h*x+((h+1)/2)*(y&1)+(y>>1)), sps);
After various debug printfs, a lot of quality time in GDB running commands like x /94x (pkt->data + 14 * 94), a few interestingly garbled bits of audio playback, and a mentor pointing out I have some improvements to make on header parsing, I can play (some) genr files.
I have also recently implemented SIPR support, and it works in both RA and RM files. RV10 video also largely works.
I've solved lost packets problems and finally my ASF demuxer started to work right at "ideal samples in vacuum". So the time for fixing memory leaks had come and valgrind helped me a lot with this issue. After memory leaks was solved I had to start testing my demuxer on various samples of ASF format multimedia files. As expected, I've found many samples my demuxer failed for. The reasons was different - mostly it was my mistakes, misunderstood or overlooked parts of specs, but I think I found a case that needed unusual handling specs didn't mention about.
some of problems was caused for example by
* improper subpayloads handling - one should be really careful while reading specs to avoid problems for less common cases like one subpayload is inside single payload and there's is padding inside payload itself (while padding after payload is 0), but there was other problems too
* I had to revise padding handling for all possible cases
* ASF file has 3 places where ASF packet size is told - twice in the header objects and once in a packet itself, and specs are not specifying what should one do when they differs or at least I didn't found it
* some stupid mistakes like when I just forgot to do something after adding new block to my code was really annoying
Funny thing was when I fixed my demuxer for one group of samples and another one that worked before started to fail, I fixed this new group and third group failed. I was so much annoyed by this, but many mistakes I did was caused by my inexperience and I think one (at least me) just have to do all of these mistakes to get better.
Finally, all basic parts of ASF demuxer seems to work somehow.
At last two weeks I fixed various bugs in my code and I hope packets handling is correct now. Only problem is that few packets at the end of the Data Object are still lost. Because I wanted a small break from this problem, my mentors allowed me to implement basic seeking first. ASF demuxer can now read index entries from Simple Index Object and adds them with av_add_index_entry to AVStream. So when Simple Index Object is present in an ASF file, my demuxer can seek to the requested time.
Skeleton of the new ASF demuxer was written, but only audio was demuxed properly now. Problem is complicated video frames handling in ASF format. I hope I finally found out how to process packets properly. ASF packet can contain single payload, single payload with subpayloads, multiple payloads or multiple payloads with subpayloads inside some of them. Every subpayload is always one frame, but single payload can be whole frame or just part of it. When ASF packet contains multiple payloads inside it, each of them can be one frame but it can be just fragment of it as well. When one of mulptiple payloads contains subpayloads, each of subpayload is one frame and it can be processed as AVPacket.
For the case of fragmented frame in ASF packet I have to store several unfinished frames in ASFPacket structures that I've created for this purpose. There should not be more than one unfinished frames per stream, so I have one ASFPacket in each ASFStream (ASFStream is structure for storing ASF stream properties). ASFPacket contains pointer to AVBufferRef where unfinished frame is stored. When frame is finished I can forward pointer to buffer with data to AVPacket, set its properties like size, timestamps and others and finally return AVPacket.
I introduced many bugs to my code that was working (at least ASF packets was parsed right and audio worked) and now I'm working on fixing all of them.
I was accepted for OPW, May - August 2014 round with project "Rewrite the ASF demuxer". First task from my mentors was to create wiki page about ASF (Advanced Streaming Format), it was created at https://wiki.libav.org/ASF.
Interesting notes about other containers: http://codecs.multimedia.cx/?p=676.
Next task from my mentors was to write simple program which reads asf file and prints its structure, i.e. list of asf objects, metadata and codec information. ASF file consists of so called ASF Objects. There're 3 top-level objects - Header Object, Data Object and Index Object. Especially Header Object can contain many other objects to provide different asf features, for example Codec List Object for codec information or Metadata object for metadata. One can recognise object with GUID, which is 16 byte array (each byte is number) that identifies object type. I was confused about the fact the GUID number you read from the file is not matching the GUID from specs. For some historical reasons one have to modify GUIDs from specs (reorder the numbers) for match GUID read from the file.
My program is working now and can list objects, codecs and metadata info, but it ignores Index Objects by then. I hope I'll add support for them soon. Also I want to print offsets for each object and read Data Object deeper.
Today, I learned how to use framecrc as a debug tool. Many Libav tests use framecrc to compare expected and actual decoding. While rewriting existing code, the output from the old and new versions of the code on the same sample can be checked; this makes a lot of mistakes clear quickly, including ones that can be quite difficult to debug otherwise.
Checking framecrcs interactively is straightforward: ./avconv -i somefile -c:a copy -f framecrc -
. The -c:a copy
specifies that the original, rather than decoded, packet should be used. The -
at the end makes the output go to stdout, rather than a named file.
The output has several columns, for the stream index, dts, pts, duration, packet size, and crc:
0, 0, 0, 192, 2304, 0xbf0a6b45
0, 192, 192, 192, 2304, 0xdd016b78
0, 384, 384, 192, 2304, 0x18da71d6
0, 576, 576, 192, 2304, 0xcf5a6a07
0, 768, 768, 192, 2304, 0x3a84620a
It is also unusually simple to find out what the fields are, as libavformat/framecrcenc.c spells it out quite clearly:
static int framecrc_write_packet(struct AVFormatContext *s, AVPacket *pkt)
{
uint32_t crc = av_adler32_update(0, pkt->data, pkt->size);
char buf[256];
snprintf(buf, sizeof(buf), “%d, %10″PRId64″, %10″PRId64″, %8d, %8d, 0x%08″PRIx32″\n”,
pkt->stream_index, pkt->dts, pkt->pts, pkt->duration, pkt->size, crc);
avio_write(s->pb, buf, strlen(buf));
return 0;
}
Keiler, one of my Libav mentors, patiently explained the above; I hope documenting it helps other people who are starting with Libav development.
Most recently, I have been adding documentation to Libav. Today, my work included writing a demuxer howto. In the last couple of weeks, I have also reimplemented RealAudio 1.0 support (2.0 is in progress), and learned more about Coccinelle and undefined behavior in C. Blog posts on these topics are pending.
My first patch for undefined behavior eliminates left shifts of negative numbers, replacing a << b (where a can be negative) with a * (1 << b). This change fixes bug686, at least for fate-idct8x8 and libavcodec/dct-test -i (compiled with ubsan and fno-sanitize-recover). Due to Libav policy, the next step is to benchmark the change. I was also asked to write a simple benchmarking HowTo for the Libav wiki.
First, I installed perf: sudo aptitude install linux-tools-generic
I made two build directories, and built the code with defined behavior in one, and the code with undefined behavior in the other (with ../configure && make -j8 && make fate
). Then, in each directory, I ran:
perf stat --repeat 150 ./libavcodec/dct-test -i > /dev/null
The results were somewhat more stable than with –repeat 30, but it still looks much more like noise than a meaningful result. I ran the command with –repeat 30 for both before the recorded 150 run, so both would start on equal footing. With defined behavior, the results were “0.121670022 seconds time elapsed ( +- 0.11% )”; with undefined behavior, “0.123038640 seconds time elapsed ( +- 0.15% )”. The best of a further three runs had the opposite result, shown below:
% cat undef.150.best
perf stat –repeat 150 ./libavcodec/dct-test -i > /dev/null
Performance counter stats for ‘./libavcodec/dct-test -i’ (150 runs):
120.427535 task-clock (msec) # 0.997 CPUs utilized ( +- 0.11% )
21 context-switches # 0.178 K/sec ( +- 1.88% )
0 cpu-migrations # 0.000 K/sec ( +-100.00% )
226 page-faults # 0.002 M/sec ( +- 0.01% )
455’393’772 cycles # 3.781 GHz ( +- 0.05% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
1’306’169’698 instructions # 2.87 insns per cycle ( +- 0.00% )
89’674’090 branches # 744.631 M/sec ( +- 0.00% )
1’144’351 branch-misses # 1.28% of all branches ( +- 0.18% )
0.120741498 seconds time elapse
% cat def.150.best
Performance counter stats for ‘./libavcodec/dct-test -i’ (150 runs):
120.838976 task-clock (msec) # 0.997 CPUs utilized ( +- 0.11% )
21 context-switches # 0.172 K/sec ( +- 1.98% )
0 cpu-migrations # 0.000 K/sec
226 page-faults # 0.002 M/sec ( +- 0.01% )
457’077’626 cycles # 3.783 GHz ( +- 0.08% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
1’306’321’521 instructions # 2.86 insns per cycle ( +- 0.00% )
89’673’780 branches # 742.093 M/sec ( +- 0.00% )
1’148’393 branch-misses # 1.28% of all branches ( +- 0.11% )
0.121162660 seconds time elapsed ( +- 0.11% )
I also compared the disassembled code from jrevdct.o, before and after the changes to have defined behavior (using gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2 on x86_64).
In the build directory for the code with defined behavior:
objdump -d libavcodec/jrevdct.o > def.dis
sed -e 's/^.*://' def.dis > noline.def.dis
In the build directory for the code with undefined behavior:
objdump -d libavcodec/jrevdct.o > undef.dis
sed -e 's/^.*://' undef.dis > noline.undef.dis
Leaving aside difference in jump locations (despite the fact that they can impact performance), there are two differences:
diff -u build_benchmark_undef/noline.undef.dis build_benchmark_def/noline.def.dis
– 0f bf 50 f0 movswl -0x10(%rax),%edx
+ 0f b7 58 f0 movzwl -0x10(%rax),%ebxi
It’s switched to using a zero-extension rather than sign-extension in one place.
– 74 1c je 40 <ff_j_rev_dct+0x40>
– c1 e2 02 shl $0x2,%edx
– 0f bf d2 movswl %dx,%edx
– 89 d1 mov %edx,%ecx
– 0f b7 d2 movzwl %dx,%edx
– c1 e1 10 shl $0x10,%ecx
– 09 d1 or %edx,%ecx
– 89 48 f0 mov %ecx,-0x10(%rax)
– 89 48 f4 mov %ecx,-0xc(%rax)
– 89 48 f8 mov %ecx,-0x8(%rax)
– 89 48 fc mov %ecx,-0x4(%rax)
+ 74 19 je 3d <ff_j_rev_dct+0x3d>
+ c1 e3 02 shl $0x2,%ebx
+ 89 da mov %ebx,%edx
+ 0f b7 db movzwl %bx,%ebx
+ c1 e2 10 shl $0x10,%edx
+ 09 da or %ebx,%edx
+ 89 50 f0 mov %edx,-0x10(%rax)
+ 89 50 f4 mov %edx,-0xc(%rax)
+ 89 50 f8 mov %edx,-0x8(%rax)
+ 89 50 fc mov %edx,-0x4(%rax)
Leaving aside differences in register use:
– 0f bf d2 movswl %dx,%edx
There is one extra movswl instruction in the version with undefined behavior, at least with the particular version of the particular compiler for the particular architecture checked.
This is an example of a null result while benchmarking; neither version performs better, although any given benchmark has one or the other come out ahead, generally by less than the variance within the run. If this were a suggested performance change, it would not make sense to apply it. However, the point of this change was correctness; a performance increase is not expected, and the lack of a performance penalty is a bonus.
One of my fantastic OPW mentors prepared a “Welcome task package”, of self-contained, approachable, useful tasks that can be done while getting used to the code, and with a much smaller scope than the core objective. This is awesome. To any mentors reading this: consider making a welcome package!
Step one of it is to use ubsan with gdb. This turned out to be somewhat intricate, so I have decided to supplement the wiki’s documentation with a step-by-step guide for Ubuntu 14.04.
1) Install clang-3.5 (sudo aptitude install clang-3.5
), as Ubuntu 14.04 comes with gcc 4.8, which does not support -fsanitize=undefined.
2) Under libav, mkdir build_ubsan && cd build_ubsan && ../configure --toolchain=clang-usan --extra-cflags=-fno-sanitize-recover
(alternatively, –cc=clang –extra-cflags=-fsanitize=undefined –extra-ldflags=-fsanitize=undefined can be used instead of –toolchain=clang-usan).
3) make -j8 && make fate
4) Watch where the tests die (they only die if –extra-cflags=-fno-sanitize-recover is used). For me, they died on TEST idct8x8. Running make V=1 fate
and asking my mentors pointed me towards libavcodec/dct-test -i, which is dying on jrevdct.c:310:47: with “runtime error: left shift of negative value -14”. If you really want to err on the side of caution, make a second build dir, and ./configure --cc=clang && make -j8 && make fate
in it, making sure it does not fail… this confirms that the problem is related to configuring with –toolchain=clang-usan (and, it turns out, with -fsanitize=undefined).
5) It’s time to use the information my mentor pointed out on the wiki about ubsan at https://wiki.libav.org/Security/Tools – specifically, the information about useful gdb breakpoints. I put a modified version of the b_u definitions into ~/.gdbinit. The wiki has been updated now, but was originally missing a few functions, including one that turns out to be relevant: __ubsan_handle_shift_out_of_bounds
6 Run gdb ./libavcodec/dct-test
, then at the gdb prompt, set args -i
to set the arguments dct-test was being run with, and then b_u
to load the ubsan breakpoints defined above. Then start the program: type run
at the gdb prompt.
7) It turns out that a problem can be found, and the program stops running. Get a backtrace with bt
.
680 in __ubsan_handle_shift_out_of_bounds ()
#1 0x000000000048ac96 in __ubsan_handle_shift_out_of_bounds_abort ()
#2 0x000000000042c074 in row_fdct_8 (data=<optimized out>) at /home/me/opw/libav/libavcodec/jfdctint_template.c:219
#3 ff_jpeg_fdct_islow_8 (data=<optimized out>) at /home/me/opw/libav/libavcodec/jfdctint_template.c:273
#4 0x0000000000425c46 in dct_error (dct=<optimized out>, test=<optimized out>, is_idct=<optimized out>, speed=<optimized out>) at /home/me/opw/libav/libavcodec/dct-test.c:246
#5 main (argc=<optimized out>, argv=<optimized out>) at /home/me/opw/libav/libavcodec/dct-test.c:522
It would be nice to see a bit more detail, so I wanted to compile the project so that less would be optimized out, and eventually settled on -O1 because compiling with ubsan and without optimizations failed (which I reported as bug 683). This led to a slightly better backtrace:
#0 0x0000000000491a70 in __ubsan_handle_shift_out_of_bounds ()
#1 0x0000000000492086 in __ubsan_handle_shift_out_of_bounds_abort ()
#2 0x0000000000434dfb in ff_j_rev_dct (data=<optimized out>) at /home/me/opw/libav/libavcodec/jrevdct.c:275
#3 0x00000000004258eb in dct_error (dct=0x4962b0 <idct_tab+64>, test=1, is_idct=1, speed=0) at /home/me/opw/libav/libavcodec/dct-test.c:246
#4 0x00000000004251cc in main (argc=<optimized out>, argv=<optimized out>) at /home/me/opw/libav/libavcodec/dct-test.c:522
It is possible to work around the problem by modifying the source code rather than the compiler flags: FFmpeg did so within hours of the bug report – the commit is at http://git.videolan.org/?p=ffmpeg.git;a=commit;h=bebce653e5601ceafa004db0eb6b2c7d4d16f0c0 ! Both FFmpeg and Libav have also merged my patch to work around the problem (FFmpeg patch, Libav patch). The workaround of using -O1 was suggested by one of my mentors, lu_zero; –disable-optimizations does not actually disable all optimizations (in practice, it leaves in ones necessary for compilation), and it does not touch the -O1 that –toolchain=clang-usan now sets.
Wanting a better backtrace leads to the next post: a detailed guide to narrowing down a bug in a the C compiler, Clang. Yes, I know, the problem is never a bug in the C compiler – but this time, it was.
What’s the fun of only running code on platforms you physically have? Portability is important, and Libav actively targets several platforms. It can be useful to be able to try out the code, even if the hardware is totally unavailable.
Here is how to run Libav’s tests under aarch64, on x86_64 hardware and Ubuntu 14.04. This guide is provided in the hopes that it saves someone else 20 hours or more: there is a lot of once-excellent information which has become misleading, because a lot of progress has been made in aarch64 support. I have tried three approachs – building with Linaro’s cross-compiler, building under QEMU user emulation, and building under QEMU system emulation, and cross-compiling. Building with a cross-compiler is the fastest option. Building under user emulation is about ten times slower. Building under system emulation is about a hundred times slower. There is actually a fourth option, using ARM Foundation Model, but I have not tried it. Running under QEMU user emulation is the only approach I managed to make entirely work.
For all three approaches, you will want a rootfs; I used Ubuntu Core. You can download Ubuntu Core for aarch64 (a minimal rootfs; see https://wiki.ubuntu.com/Core to learn more), and untar it (as root) into a new directory. Then, set an environment variable that the rest of this guide/set of notes uses frequently, changing the path to match your system:
export a64root=/path/to/your/aarch64/rootdir
Approach 1 – build under QEMU’s user emulation.
Step 1) Set up QEMU. The days when using SUSE branches were necessary are over, but it still needs to be statically linked, and not all QEMU packages are. Ubuntu has a static QEMU:
sudo aptitude install qemu-user-static
This package also sets up binfmt for you. You can delete broken or stale binfmt information by running:
echo -1 > /proc/sys/fs/binfmt_misc/archnamehere
– this can be useful, especially if you have previously installed QEMU by hand.
Step 2) Copy your QEMU binary into the chroot, as root, with:
cp `which qemu-aarch64-static` $a64root/usr/bin/
Step 3) As root, set up the aarch64 image so it can do DNS resolution, so you can freely use apt-get:
echo 'nameserver 8.8.8.8' > $a64root/etc/resolv.conf
Step 4) Chroot into your new system. Run chroot $a64root /bin/bash
as root.
At this point, you should be able to run an aarch64 version of ls
, and confirm with file /bin/ls
that it is an aarch64 binary.
Now you have a working, emulated, minimal aarch64 system.
On x86, you would run aptitude build-dep libav
, but there is no such package for aarch64 yet, so outside of the chroot, on the normal system, I installed apt-rdepends and ran:
apt-rdepends --build-depends --follow=DEPENDS libav
With version information stripped out, the following packages are considered dependencies:
debhelper frei0r-plugins-dev libasound2-dev libbz2-dev libcdio-cdda-dev libcdio-dev libcdio-paranoia-dev libdc1394-22-dev libfreetype6-dev libgnutls-dev libgsm1-dev libjack-dev libmp3lame-dev libopencore-amrnb-dev libopencore-amrwb-dev libopenjpeg-dev libopus-dev libpulse-dev libraw1394-dev librtmp-dev libschroedinger-dev libsdl1.2-dev libspeex-dev libtheora-dev libtiff-dev libtiff5-dev libva-dev libvdpau-dev libvo-aacenc-dev libvo-amrwbenc-dev libvorbis-dev libvpx-dev libx11-dev libx264-dev libxext-dev libxfixes-dev libxvidcore-dev libxvmc-dev texi2html yasm zlib1g-dev doxygen
Many of the libraries do not have current aarch64 Ubuntu packages, and neither does frei0r-plugins-dev, but running aptitude install on the above list installs a lot of useful things – including build-essential. The full list is in the command below; the missing packages are non-essential.
Step 5) Set it up: apt-get install aptitude
aptitude install git debhelper frei0r-plugins-dev libasound2-dev libbz2-dev libcdio-cdda-dev libcdio-dev libcdio-paranoia-dev libdc1394-22-dev libfreetype6-dev libgnutls-dev libgsm1-dev libjack-dev libmp3lame-dev libopencore-amrnb-dev libopencore-amrwb-dev libopenjpeg-dev libopus-dev libpulse-dev libraw1394-dev librtmp-dev libschroedinger-dev libsdl1.2-dev libspeex-dev libtheora-dev libtiff-dev libtiff5-dev libva-dev libvdpau-dev libvo-aacenc-dev libvo-amrwbenc-dev libvorbis-dev libvpx-dev libx11-dev libx264-dev libxext-dev libxfixes-dev libxvidcore-dev libxvmc-dev texi2html yasm zlib1g-dev doxygen
Now it is time to actually build libav.
Step 6) Create a user within your chroot: useradd -m auser,
and switch to running as that user: sudo -u auser bash,
and type cd
to go to the home directory.
Step 7) Run git clone git://git.libav.org/libav.git
, then ./configure --disable-pthreads && make -j8
(change the 8 to approximately the number of CPU cores you have).
On my hardware, this takes 10-11 minutes, and ‘make fate’ takes about 16. Disabling pthreads is essential, as qemu-user does not handle threads well, and running the tests hangs randomly without it.
Approach 2: cross-compile (warning: I do not have the tests working with this approach).
1) Start by getting an aarch64 compiler. A good place to get one is http://releases.linaro.org/latest/components/toolchain/binaries/; I am using http://releases.linaro.org/latest/components/toolchain/binaries/gcc-linaro-aarch64-linux-gnu-4.8-2014.04_linux.tar.xz . Untar it, and add it to your path:
export PATH=$PATH:/path/to/your/linaro/tools/bin
2) Make the cross-compiler work. Run aptitude install lsb lib32stdc++6
. Without this, invoking the compiler will say “No such file or directory”. See http://lists.linaro.org/pipermail/linaro-toolchain/2012-January/002016.html.
3) Under the libav directory (run git clone git://git.libav.org/libav.git
if you do not have one), type mkdir a64crossbuild; cd a64crossbuild
. Make sure the libav directory is somewhere under $a64root (it should simplify running the tests, later).
4)./configure --arch=aarch64 --cpu=generic --cross-prefix=aarch64-linux-gnu- --cc=aarch64-linux-gnu-gcc --target-os=linux --sysroot=$a64root --target-exec="qemu-aarch64-static -L $a64root" --disable-pthreads
This is a minimal variant of Jannau’s configuration – a developer who has recently done a lot of libav aarch64 work.
5) Run make -j8
. On my hardware, it takes just under a minute.
6) Run make fate
. Unfortunately, both versions of QEMU I tried hung on wait4 at this point (in fft-test, fate-fft-4), and used an extra couple of hundred megabytes of RAM per second until I stopped QEMU, even if I asked it to wait for a remote GDB. For anyone else trying this, https://lists.libav.org/pipermail/libav-devel/2014-May/059584.html has several useful tips for getting the tests to run after cross-compilation.
Approach 3: Use QEMU’s system emulation. In theory, this should allow you to use pthreads; in practice, the tests hung for me. The following May 9th post describes what to do: http://www.bennee.com/~alex/blog/2014/05/09/running-linux-in-qemus-aarch64-system-emulation-mode/. In short: git clone git://git.qemu.org/qemu.git qemu.git && cd qemu.git && ./configure --target-list=aarch64-softmmu && make
, then
./aarch64-softmmu/qemu-system-aarch64 -machine virt -cpu cortex-a57 -machine type=virt -nographic -smp 1 -m 2048 -kernel aarch64-linux-3.15rc2-buildroot.img --append "console=ttyAMA0" -fsdev local,id=r,path=$a64root,security_model=none -device virtio-9p-device,fsdev=r,mount_tag=r
Then, under the buildroot system, log in as root (no password), and type mkdir /mnt/core && mount -t 9p -o trans=virtio r /mnt/core
. At this point, you can run chroot /mnt/core /bin/bash
, and follow the approach 1 instructions from useradd onwards, except that ./configure without –disable-pthreads should theoretically work. On my system, ./configure takes a bit over 5 minutes with this approach. Running make
is quite slow; time make
took 113 minutes. Do not use -j – you are limited to a single core, so -j would slow compilation down slightly. However, make fate
consistently hung on acodec-pcm-alaw, and I have not yet figured out why.
Things not to do:
Applying to OPW requires an initial contribution. The Libav IRC channel suggested porting the asettb filter from FFmpeg, so I did (version 5 of the patch was merged upstream, in two parts: a rename patch and a content patch; the FFmpeg author was credited as author for the latter, while I did a signed-off-by). I also contributed a 3000+ line documentation patch, standardizing the libavfilter documentation and removing numerous English errors, and triaged a few bugs, git bisecting the one that was reproducible.
And how it nearly ruined another video coding standard.
Everyone knows that interlacing was a trick in the '80s for pseudo motion compensation with analogue video. This more or less worked because it mimicked how television worked back then. This technique was preserved when flat panels for pc and tv were introduced, for a mix of backward compatibility and technical limitations, and video coding features interlacing in MPEG2 and H264 and similar.
However as with black and white, TACS and Gopher, old technology has to be replaced with modern and efficient technology, as a trade off of users' interests and technology providers' market prospects. In case you are not familiar, interlacing is a mess to support, makes decoding slower and heavily degrades quality. People saying that interlacing saves bandwidth do not know much about video coding and bad marketing claiming that higher resolution is better than higher framerate has an effect too.
So, when ITU and then MPEG set out to establish the mandates for a new video standard capable of superseding H264, it was decided that interlacing was old enough, did more harm than good and it was time for retirement: HEVC was going to be the first video codec to officially deprecate interlacing.
Things went pretty swell during its development, until a few months before the completion of the standard. A group of US companies complained that the proposed tools were not sufficient (a set of SEI messages and treating fields like progressive frames) and heavily protested with both standardisation bodies. ITU firmly rejected the idea (with the video group chair threatening to step down) while MPEG set out to understand the needs of the industry and see if there was anything that could be done.
An ad-hoc group was established to see if there was any evidence that interlaced coding tool would have improved the situation. Things looked really shady, the Requirements group even mentioned that it was the first time that an AhG was established to look for evidence, instead of establishing an AhG because there was evidence. Several liasons from EBU and other DVB members tried to point out this absurdity while the threat of adding interlacing back in HEVC became real. Luckily the first version of the specifications got published in the meantime, so this decision didn't slow down the standardisation process.
Why so much love towards interlacing? Well in the "rebellious" group defence, it is true that interlaced content in HEVC is less performant than in H264; however it is also true that such deinterlaced content in HEVC outperforms H264 in any configuration. Truth is that mass marketed deinterlacers (commonly found in televisions for example) bring a lot of royalty income, so it is normal that companies with vested interests would prefer to have interlacing in a soon-popular video standard like HEVC. Also in markets like US where the network operator (which has control on the encoding but not on the video source) might differ from the content provider, it could be politically difficult to act as a carrier only if you have to deinterlace a video.
However these problems are actually not enough for forcing every encoder, decoder, analyser to support a deprecated technology like interlacing. Technical problems can be solved with good deinterlacers at the top of the distribution chain, while political ones can be solved amending contracts. Plus having progressive only video will definitely improve quality and let the industry concentrate on other delicate subjects, like bit depth, both properties going in favour of users' interests.
At the last MPEG meeting, the "rebellious" group which had been working on reintroducing interlacing for a year provided no real evidence that interlaced coding tools would improve HEVC at all. The only sensible solution was to disband the group over this wasted effort and support progressive video only, which is what happened luckily. So now both ITU and MPEG support progressive video only and this has finally nailed it.
Interlacing is dead, long live progressive.
Written by Vittorio Giovara (projectsymphony@gmail.com)
Published under a CC-BY-SA 3.0 license.
I am very glad to announce that Libav 10 has been released!
This has a bunch of features that I contributed to, in particular regarding stereoscopic video and interlaced filtering, but more importantly this release has the work of an awesome group of people which has been carried out for a whole year. This is the magic of open source!
I joined the group more or less one year ago, with some patches regarding an obscure H.264 specification which I then later reimplemented in HEVC and then I wrote a few filters I needed and then designed an API and then, wow! A whole year passed without me noticing, and I am still around, sending patches to the same group of people who welcomed someone who had problems with shifting values (sad but true story)!
I met the team both at VDD and FOSDEM and they've been the most exciting conferences I ever went to (and I went to a lot of them). I couldn't believe I was with the devteam of my favourite multimeida opensource projects I've been following since I was a kid! Until a year ago, I saw the names from the commits and the blogposts from both VideoLAN and Libav projects and I had been thinking "Oh wouldn't it be so cool to be like one of them".
The answer is yes, it definitely would, and it's something that can happen if one is really committed in it! The Libav Info page states "Being a committer is a duty, not a privilege", but it sure does feel like one.
Thanks for this exciting year guys, I look forward to the next ones.
...using latest modern tools!
X264 and VLC are two of the most awesomest opensource software you can find on-line and of course the pose no problem when you compile them on a Unix environment. Too bad that sometimes you need to think of Windowze as well, so we need a way to crosscompile that software: in this blogpost, I'll describe how to achieve that, using modern tools on a Ubuntu 12.04 installation.
[0] Sources
It goes without saying that without the following guides, I'd have had a much harder time!
http://alex.jurkiewi.cz/blog/2010/cross-compiling-x264-for-win32-on-ubuntu-linux
https://bbs.archlinux.org/viewtopic.php?id=138128
http://wiki.videolan.org/Win32Compile
http://forum.videolan.org/viewtopic.php?f=32&t=101489
So a big thanks to all the original authors!
[1] Introduction
When you crosscompile you just use the same tools and toolchains that you are used to, gcc, ld and so on, but configured (and compiled) so that they produce executable code for a different platform. This platform can vary both in software and in hardware and it is usually identified by a triplet: the processor architecture, the ABI and the operating system.
What we are going to use here is i686-w64-mingw32, which identifies any x86 cpu since the Pentium III, the w64 ABI used on modern Windows NT systems (if I'm not wrong), and the mingw32 architecture, that is the Windows gcc variant.
[2] Prerequisites
Note that the name of the packages might be slightly different according to your distribution. We are going to need a quite recent mingw-runtime for VLC (>=3.00) which has not yet landed on Ubuntu, so we'll take it from our Debian cousins.
Execute this command$ wget http://ftp.jp.debian.org/debian/pool/main/m/mingw-w64/mingw-w64-dev_3.0~svn4933-1_all.deb
$ sudo dpkg -i mingw-w64-dev_3.0~svn4933-1_all.deb
and then install stock dependencies
$ sudo dpkg -i gcc-mingw-w64 g++-mingw-w64
$ sudo dpkg -i pkg-config yasm subversion cvs git-core
$ mkdir -p ~/win32-cross/{src,lib,include,share,bin}
#!/bin/shPlease not the use of the CFLAGS variables: without all the static parameters, the executable will dynamically link gcc, so you'll need to bundle the equivalent dll. I prefer to have one single exe, so everything goes static, but I'm not really sure which flag is actually needed. If you have any idea, please drop me a line.
TRIPLET=i686-w64-mingw32
export CC=$TRIPLET-gcc
export CXX=$TRIPLET-g++
export CPP=$TRIPLET-cpp
export AR=$TRIPLET-ar
export RANLIB=$TRIPLET-ranlib
export ADD2LINE=$TRIPLET-addr2line
export AS=$TRIPLET-as
export LD=$TRIPLET-ld
export NM=$TRIPLET-nm
export STRIP=$TRIPLET-strip
export PATH="/usr/i586-mingw32msvc/bin:$PATH"
export PKG_CONFIG_PATH="$HOME/win32-cross/lib/pkgconfig/"
export CFLAGS="-static -static-libgcc -static-libstdc++ -I$HOME/win32-cross/include -L$HOME/win32-cross/lib -I/usr/$TRIPLET/include -L/usr/$TRIPLET/lib"
export CXXFLAGS="$CFLAGS"
exec "$@"
$ cd ~/win32-cross/src
$ wget -qO - http://zlib.net/zlib-1.2.7.tar.gz | tar xzvf -
$ cd zlib-1.2.7
$ ../../mingw ./configure
$ sed -i"" -e 's/-lc//' Makefile
$ make
$ DESTDIR=../.. make install prefix=
$ cd ~/win32-cross/src
$ git clone git://git.libav.org/libav.git
$ cd libav
$ ./configure \
--target-os=mingw32 --cross-prefix=i686-w64-mingw32- --arch=x86 --prefix=../.. \
--enable-memalign-hack --enable-gpl --enable-avisynth --enable-runtime-cpudetect \
--disable-encoders --disable-muxers --disable-network --disable-devices
$ make
$ make install
$ cd ~/win32-cross/src
$ svn checkout http://ffmpegsource.googlecode.com/svn/trunk/ ffms
$ cd ffms
$ ../../mingw ./configure --host=mingw32 --with-zlib=../.. --prefix=$HOME/win32-cross
$ ../../mingw make
$ make install
$ cd $HOME/win32-x264/src
# Create a CVS auth file on your machine
$ cvs -d:pserver:anonymous@gpac.cvs.sourceforge.net:/cvsroot/gpac login
$ cvs -z3 -d:pserver:anonymous@gpac.cvs.sourceforge.net:/cvsroot/gpac co -P gpac
$ cd gpac
$ chmod +rwx configure src/Makefile
# Hardcode cross-prefix
$ sed -i'' -e 's/cross_prefix=""/cross_prefix="i686-w64-mingw32-"/' configure
$ ../../mingw ./configure --static --use-js=no --use-ft=no --use-jpeg=no \
--use-png=no --use-faad=no --use-mad=no --use-xvid=no --use-ffmpeg=no \
--use-ogg=no --use-vorbis=no --use-theora=no --use-openjpeg=no \
--disable-ssl --disable-opengl --disable-wx --disable-oss-audio \
--disable-x11-shm --disable-x11-xv --disable-fragments--use-a52=no \
--disable-xmlrpc --disable-dvb --disable-alsa --static-mp4box \
--extra-cflags="-I$HOME/win32-cross/include -I/usr/i686-w64-mingw32/include" \
--extra-ldflags="-L$HOME/win32-cross/lib -L/usr/i686-w64-mingw32/lib"
# Fix pthread lib name
$ sed -i"" -e 's/pthread/pthreadGC2/' config.mak
# Add extra libs that are required but not included
$ sed -i"" -e 's/-lpthreadGC2/-lpthreadGC2 -lwinmm -lwsock32 -lopengl32 -lglu32/' config.mak
$ make
# Make will fail a few commands after building libgpac_static.a
# (i586-mingw32msvc-ar cr ../bin/gcc/libgpac_static.a ...).
# That's fine, we just need libgpac_static.a
i686-w64-mingw32-ranlib bin/gcc/libgpac_static.a
$ cp bin/gcc/libgpac_static.a ../../lib/
$ cp -r include/gpac ../../include/
Finally we can compile x264 at full power! The configure script will provide a list of what features have been activated, make sure everything you need is there!$ cd ~/win32-cross/src
$ git clone git://git.videolan.org/x264.git
$ cd x264
$ ./configure --cross-prefix=i686-w64-mingw32- --host=i686-w64-mingw32 \
--extra-cflags="-static -static-libgcc -static-libstdc++ -I$HOME/win32-cross/include" \
--extra-ldflags="-static -static-libgcc -static-libstdc++ -L$HOME/win32-cross/lib" \
--enable-win32thread
$ make
$ git clone git://git.videolan.org/vlc.git vlc
$ cd vlc
And let's get the dependencies through the contrib scripts, qt4 needs to be compiled by hand as the version in Ubuntu repositories doesn't cope well with the rest of the process. I also had to remove some of the files because they were of the wrong architecture (mileage might vary here) .$ mkdir -p contrib/win32
$ cd contrib/win32
$ ../bootstrap --host=i686-w64-mingw32
$ make prebuilt
$ make .qt4
$ rm ../i686-w64-mingw32/bin/{moc,uic,rcc}
$ cd -
$ ./bootstrap
$ mkdir win32 && cd win32
$ ../extras/package/win32/configure.sh --host=i686-w64-mingw32
$ ./compile
$ make package-win-common
For future reference, all (or most of) functions and structs in libav have a prefix that indicates the exposure of that functions. Those are
Well, I've finished the new audio decoding API, which has been merged into Libav master. The new audio encoding API is basically done, pending a (hopefully final) round of review before committing.
Next up is audio timestamp fixes/clean-up. This is a fairly undefined task. I've been collecting a list of various things that need to be fixed and ideas to try. Plus, the audio encoding API revealed quite a few bugs in some of the demuxers. Today I started a sort of TODO list for this stage of the project. I'll be editing it as the project continues to progress.
For the past few weeks I've been working on a new project sponsored by FFMTech. The entire project involves reworking much of the existing audio framework in libavcodec.
Part 1 is changing the audio decoding API to match the video decoding API. Currently the audio decoders take packet data from an AVPacket and decode it directly to a sample buffer supplied by the user. The video decoders take packet data from an AVPacket and decode it to an AVFrame structure with a buffer allocated by AVCodecContext.get_buffer(). My project will include modifying the audio decoding API to decode audio from an AVPacket to an AVFrame, as is done with video.
AVCODEC_MAX_AUDIO_FRAME_SIZE puts an arbitrary limit on the amount of audio data returned by the decoder. For example, each FLAC frame can hold up to 65536 samples for 8 channels at 32-bit sample depth, which is 2097152 bytes of raw audio, but AVCODEC_MAX_AUDIO_FRAME_SIZE is only 192000. Using get/release_buffer() for audio decoding will solve this problem. It will, however, require changes to every audio decoder. Most of those changes are trivial since the frame size is known prior to decoding the frame or is easily parsed. Some of the changes are more intrusive due to having to determine the frame size prior to allocating and writing to the output buffer.
As part of the preparation for the new API, I have been cleaning up all the audio decoder, which has been quite tedious. I've found some pretty surprising bugs along the way. I'm getting close to finishing that part so I'll be able to move on to implementing the new API in each decoder.
So, I've moved on from AHT now, and it's on to Spectral Extension (SPX). I got the full syntax working yesterday, now I just need to figure out how to calculate all the parameters. I have a feeling this will help quality quite a bit, especially when used in conjunction with variable bandwidth/coupling. My vision for automatic bandwidth adjustment is starting to come together.
SPX encoding/decoding is fairly straightforward, so I expect this won't take too long to implement. Similar to channel coupling, the encoder writes coarsely banded scale factors for frequencies above the fully-encoded bandwidth, along with noise blending factors. The decoder copies lower frequency coefficients to the upper bands, multiplies them by the scale factors, and blends them with noise (which has been scaled according to the band energy and the blending factors in the bitstream). For the encoder, I just need to make the reconstructed coefficients match the original coefficients as closely as possible by calculating appropriate spectral extension coordinates and blending factors. Also, like coupling coordinates, the encoder can choose how often to resend the parameters to balance accuracy vs. bitrate.
Once SPX encoding is working properly, I'll revisit variable bandwidth. However, instead of adjusting the upper cutoff frequency (which is somewhat complex to avoid very audible attack/decay), it will adjust the channel coupling and/or spectral extension ranges to keep the cutoff frequency constant while still adjusting to changes in signal complexity to keep a more stable quality level at a constant bitrate. This could also be used in a VBR mode with constrained bitrate limits.
If you want to follow the development, I have a separate branch at my Libav github repository.
http://github.com/justinruggles/Libav/commits/eac3_spx
I finally got the complete AHT syntax working properly. Unfortunately, the quality seems to be lower at all bitrates than with the normal AC-3 quantization. I'm hoping that I just need to pick better gain values, but I have a suspicion that some of the difference is related to vector quantization, which the encoder has no control over (a basic 6-dimensional VQ minimum distance search is the best it can do).
My first step is to find out for sure if choosing better gain values will help. One problem is that the bit allocation model is saying we need X number of bits for each mantissa. Using mode=0 (all zero gains) gives exactly X number of bits per mantissa (with no overhead for encoding the gain values), but the overall quality is lower than with normal AC-3 quantization or even GAQ with simplistic mode/gain decisions. So I think that means there is some bias built-in to the AHT bit allocation table that assumes GAQ will appropriately fine-tune the final allocations. Additionally, it could be that AHT should not always be turned on when the exponents are reused in blocks 1 through 5 (the condition required to use AHT). This is probably the point where I need a more accurate bit allocation model...
edit: After analyzing the bit allocation tables for AC-3 vs. E-AC-3, it seems there is no built-in bias in the GAQ range. They are nearly identical. So the difference is clearly in VQ. Next step, try a direct comparison of quantized mantissas using VQ vs. linear quantization and consider that in the AHT mode decision.
edit2: dct+VQ is nearly always worse than linear quantization... I also tried turning AHT off for a channel if the quantization difference was over a certain threshold, but as the threshold approached zero, the quality approached that with AHT turned off. I don't know what to do at this point... *sigh*
note: analyzation of a commercial E-AC-3 sample using AHT shows that AHT is always turned on when the exponent strategy allows it.
edit3: It turns out that the majority of the quality difference was in the 6-point DCT. If I turn it off in both the encoder and decoder (but leave the quantization the same) the quality is much better. I hope it's a bug or too much inaccuracy (it's 25-bit fixed-point) in my implementation... If not then I'm at another dead-end.
edit4: I'm giving up on AHT for now. The DCT is definitely correct and is very certainly causing the quality decrease. If I can get my hands on a source + encoded E-AC-3 file from a commercial encoder that uses AHT then I will revisit this. Until then, I have nothing to analyze to tell me how using AHT can possibly produce better quality.
Well, I finally got a working E-AC-3 encoder committed to Libav. The bitstream format does save a few bits here and there, but the overall quality difference is minimal. However, it will be the starting point for adding more E-AC-3 features that will improve quality.
The first feature I completed was support for higher bit rates. This is done in E-AC-3 by using fewer blocks per frame. A normal AC-3 frame has 6 blocks of 256 samples each, but E-AC-3 can reduce that to 1, 2, or 3 blocks. This way a small range can be used for the per-frame bit rate, but it still allow for increasing the per-second bit rate. For example, 5.1-channel E-AC-3 content on HD-DVDs was typically encoded at 1536 kbps using 1 block per frame.
Currently I am working on implementing AHT (adaptive hybrid transform). The AHT process uses a 6-point DCT on each coefficient across the 6 blocks in the frame. It basically uses the normal AC-3 bit allocation process to determine quantization of each DCT-processed "pre-mantissa" but it uses a finer resolution for quantization and different quantization methods. I have the 6-point DCT working and one of the two quantization methods. Now I just need to finish the other quantization method and implement mantissa bit counting and bitstream output.
Feed | RSS | Last fetched |
---|---|---|
Ambient Language | XML | 2020-10-17 00:31 |
Kostya's Boring Codec World | XML | 2020-10-17 00:31 |
libav – alpaastero | XML | 2020-10-17 00:31 |
Libav – Federico Tomassetti – Consultant Software Engineer | XML | 2020-10-17 00:31 |
libav – Luca Barbato | XML | 2020-10-17 00:31 |
Multimedia – Flameeyes's Weblog | XML | 2020-10-17 00:31 |
Project Symphony | XML | 2020-10-17 00:31 |
Sasshka's | XML | 2020-10-17 00:31 |
Scarabeus' blag | XML | 2020-10-17 00:31 |
If you want your blog to be added, send an email to Luca Barbato.
Planet service provided by Luminem.