318 lines
22 KiB
Plaintext
318 lines
22 KiB
Plaintext
|
This is a log of my development of dr_opus - a single file public domain Opus decoder in C. The purpose of this log is
|
||
|
to track the development of the project and log how I solved various problems.
|
||
|
|
||
|
References:
|
||
|
RFC 6716 - https://tools.ietf.org/html/rfc6716 - Definition of the Opus Audio Codec
|
||
|
RFC 7845 - https://tools.ietf.org/html/rfc7845 - Ogg Encapsulation for the Opus Audio Codec
|
||
|
|
||
|
|
||
|
The main reference for development is RFC 6716. An important detail with this specification is in Section 1
|
||
|
|
||
|
The primary normative part of this specification is provided by the
|
||
|
source code in Appendix A. Only the decoder portion of this software
|
||
|
is normative, though a significant amount of code is shared by both
|
||
|
the encoder and decoder. Section 6 provides a decoder conformance
|
||
|
test. The decoder contains a great deal of integer and fixed-point
|
||
|
arithmetic that needs to be performed exactly, including all rounding
|
||
|
considerations, so any useful specification requires domain-specific
|
||
|
symbolic language to adequately define these operations.
|
||
|
Additionally, any conflict between the symbolic representation and
|
||
|
the included reference implementation must be resolved. For the
|
||
|
practical reasons of compatibility and testability, it would be
|
||
|
advantageous to give the reference implementation priority in any
|
||
|
disagreement. The C language is also one of the most widely
|
||
|
understood, human-readable symbolic representations for machine
|
||
|
behavior. For these reasons, this RFC uses the reference
|
||
|
implementation as the sole symbolic representation of the codec.
|
||
|
|
||
|
What this is basically saying is that the source code should be considered the main point of reference for the format of
|
||
|
the Opus bit stream. We cannot, however, just willy nilly take code from the reference implementation because that would
|
||
|
cause licensing problems. However, I'm going to make a prediction before I even begin: I think RFC 6716 with a bit
|
||
|
intuition, problem solving and help from others it can be figured out. Indeed, I think RFC 6716 may even be enough, and
|
||
|
that perhaps the authors of said specification refer to the source code as a self-defence mechanism to guard themselves
|
||
|
from possible subtle errors in the specification. If I'm wrong, the log below will track it.
|
||
|
|
||
|
Lets begin...
|
||
|
|
||
|
|
||
|
2018/11/30 6:55 PM
|
||
|
==================
|
||
|
- Inserted skeleton code with licensing information. Dual licensed as Public Domain _or_ MIT, whichever you prefer. API
|
||
|
is undecided as of yet.
|
||
|
- Decided on supporting C89. Not yet decided on how the standard library will be used.
|
||
|
- Deciding for now to use the init() style of API from dr_wav and dr_mp3 which takes a pre-allocated decoder object. May
|
||
|
need to change this to dr_flac style open() (or perhaps new()/delete()) if the size of the structure is too big.
|
||
|
- dr_wav, etc. use booleans for results status so I'm copying that for consistency.
|
||
|
- Added boilerplate for standard sized types (copied from mini_al).
|
||
|
- Added a comment at the top of the file to show that this isn't complete (even though it's in the "wip" folder!).
|
||
|
- Added boilerplate for read and seek callbacks (copied from dr_flac).
|
||
|
- Added shell APIs:
|
||
|
- dropus_init()
|
||
|
- Not requiring onSeek for the moment because I'm hoping that we can make that optional somehow, with restriction.
|
||
|
- dropus_uninit()
|
||
|
- Does nothing at the moment, but in future will need to close a file when the decoder is opened against a file.
|
||
|
- Added some common overrideable macros (taken from dr_flac):
|
||
|
- DROPUS_ASSERT
|
||
|
- DROPUS_COPY_MEMORY
|
||
|
- DROPUS_ZERO_MEMORY
|
||
|
- DROPUS_ZERO_OBJECT
|
||
|
- Decided to use the C standard library for now. This simplifies things like memset() and assert(). Currently using the
|
||
|
following headers:
|
||
|
- stddef.h
|
||
|
- stdint.h
|
||
|
- stdlib.h
|
||
|
- string.h
|
||
|
- assert.h
|
||
|
- Added some infrastructure for future APIs: dropus_init_file() and dropus_init_memory()
|
||
|
- In dr_flac, dr_wav and dr_mp3, a mistake was made with how the internal file handle is managed by the decoder. In
|
||
|
these libraries the pUserData member of the decoder structure is recycled to hold the FILE pointer. dr_opus is
|
||
|
changing this to be it's own member of the main decoder structure.
|
||
|
- The internal members for use with dropus_init_memory() use the same system as dr_mp3. See drmp3.memory.
|
||
|
- Added dropus_init_file()
|
||
|
- All init() APIs call dropus_init_internal() which is where the main initialization work is done.
|
||
|
- This is using the stdio FILE API.
|
||
|
- dropus_uninit() has been updated to close the file.
|
||
|
- Added dropus_fopen() for compiler compatibility. Taken from dr_flac, but added the "mode" parameter for consistency
|
||
|
with fopen().
|
||
|
- Added dropus_fclose() for namespace consistency with dropus_fopen().
|
||
|
- Added dropus_init_memory()
|
||
|
- onRead and onSeek callbacks taken from dr_mp3.
|
||
|
- INITIAL COMMIT
|
||
|
|
||
|
2018/12/01 12:46 PM
|
||
|
===================
|
||
|
- Downloading all official test vectors in preparation for future testing.
|
||
|
|
||
|
2018/12/01 03:58 PM
|
||
|
===================
|
||
|
- Decided to figure out the different encapsulations for Opus so I can figure out a design for data flow.
|
||
|
- Ogg: https://tools.ietf.org/html/rfc7845
|
||
|
- Matroska: https://wiki.xiph.org/MatroskaOpus
|
||
|
- WebM: https://www.webmproject.org/docs/container/
|
||
|
- MPEG-TS: https://wiki.xiph.org/OpusTS
|
||
|
- Mysterious encapsulation used by the official test vectors. Looking at this it's hard to tell what they are using for
|
||
|
this, but I'm assuming it's not documented. Will need to figure that one out for myself later.
|
||
|
- Since WebM is a subset of Matroska, can we combine those together to avoid code duplication?
|
||
|
- Upon reading into Ogg and MPEG-TS encapsulation (not looked into Matroska in detail yet) it looks like handling large
|
||
|
channel counts is going to be a pain. It can support up to 255 channels, which are made up of multiple streams. My
|
||
|
instict is telling me that we won't be able to use a fixed sized data structure and instead might require dynamic
|
||
|
allocations. Darn...
|
||
|
- Leaning towards a two-tier architecture - a low-level decoder for raw Opus streams, and a high-level API for encapsulated
|
||
|
streams. Opus packets support only 1 or 2 channels which means the size of the structure _can_ be fixed.
|
||
|
- Currently I'm thinking something like the following:
|
||
|
- struct dropus_stream - low-level Opus stream decoding
|
||
|
- struct dropus - high-level encapsulated Opus stream decoding
|
||
|
- struct dropus_ogg (Ogg Opus)
|
||
|
- struct dropus_mk (Matroska)
|
||
|
- struct dropus_ts (MPEG-TS)
|
||
|
- struct dropus_tv (That annoying non-standard and undocumented encapsulation used by the test vectors)
|
||
|
- The maximum number of samples per frame (per-channel) is 960 because:
|
||
|
- The maximum frame size for wideband (16kHz) is 60: 16*60 = 960
|
||
|
- The maximum frame size for fullband (48kHz) is 20: 48*20 = 960
|
||
|
- To account for stereo frames we would need to multiply 960 by 2: 960*2 = 1920
|
||
|
- This is small enough that it can be stored statically inside the dropus_stream structure.
|
||
|
- It looks like low level Opus streams won't be able to do efficient seeking without an encapsulation.
|
||
|
- An Opus stream is made up of packets. Each packet has 1 or more frames. Each frame can be made up of 1 or 2 channels.
|
||
|
- The size of the packet in bytes is unknown at the Opus stream level - it depends on the encapsulation to know the
|
||
|
size of the packet. This could be a bit of a problem because we may need to know the size of a frame in order to know
|
||
|
how much data we need from the client. I'm wondering if the low-level API should look something like the following:
|
||
|
- dropus_result dropus_stream_decode_packet(dropus_stream* pOpusStream, const void* pCompressedData, size_t compressedDataSize);
|
||
|
- This will require the caller to have information on the size of the packet.
|
||
|
- Added struct dropus_stream_packet, dropus_stream_frame.
|
||
|
- Each packet contains a byte that defines the structure of the packet. It's called the TOC byte
|
||
|
(Section 3.1. The TOC Byte) in the RFC 6716.
|
||
|
- Added skeleton implementations for dropus_stream_init() and dropus_stream_decode_packet().
|
||
|
- Added helpers for decoding the different parts of TOC byte.
|
||
|
- dropus_toc_config(), dropus_toc_s() and dropus_toc_c()
|
||
|
- Added enum for the three different modes as outlined in Section 3.1 of RFC 6716. Also added an API to extract this
|
||
|
mode from the config component of the TOC: dropus_toc_config_mode() and dropus_toc_mode().
|
||
|
- Added API to extract sample rate from the TOC: dropus_toc_config_sample_rate() and dropus_toc_sample_rate()
|
||
|
- Added API to extract the size of an Opus frame in PCM frames from the TOC:
|
||
|
- dropus_toc_config_frame_size_in_pcm_frames() and dropus_toc_frame_size_in_pcm_frames().
|
||
|
- COMMIT
|
||
|
|
||
|
2018/12/02 09:45 AM
|
||
|
===================
|
||
|
- Compiling for the first time. Targetting C89 so need to get it all working property right from the start.
|
||
|
- "test" folder added to my local repository. Not version controlled yet as I'm not sure I want to include all of
|
||
|
the test stuff in the repository. Will definitely not be version controlling the test vectors. Perhaps use a submodule?
|
||
|
- Have decided to not support the mysterious encapsulation format used by the test vectors in dr_opus itself. Instead
|
||
|
I'm just doing it manually in the test program.
|
||
|
- After attempting to build the C++ version, it appears Clang will attempt to #include stdint.h from Visual Studio's
|
||
|
standard library which fails. May need to come up with a way to avoid the dependency on stdint.h. Same thing also
|
||
|
happens with stdlib.h.
|
||
|
- Have cleaned up the sized types stuff for Clang to avoid the dependency on stdint, however errors are still happening
|
||
|
due to the use of stdlib.h. Leaving the Clang C++ issue for now. It's technically a Clang issue rather than dr_opus.
|
||
|
- COMMIT
|
||
|
- Have decided to version control test programs, but not test vectors. Instead I'm putting a note in a readme about
|
||
|
where to download and place the test vectors.
|
||
|
- In my experience I have found that having a testing infrastructure set up earlier saves a lot of time in the long run
|
||
|
so the next thing I have decided to work on is a test program. It makes sense to test against the official test vectors,
|
||
|
so the first thing is to figure out this undocumented encapsulation format...
|
||
|
|
||
|
Test Vector Encapsulation Format
|
||
|
--------------------------------
|
||
|
- All test vectors seem to start with two bytes of 0x00.
|
||
|
- This does not feel like a magic number or identifier.
|
||
|
- Noticed that first 4 bytes add up to a relatively small number. A counter of some kind in little endian format?
|
||
|
- A pattern can be seen in testvector02.bit:
|
||
|
- The bytes 00 00 00 XX can been seen repeating. This indicates some kind of blocking.
|
||
|
- From the first 4 bytes you see the hex value of 0x0000001E which equals 30 in decimal
|
||
|
- Counting forward 30 bytes brings us 4 bytes before the next set of 00 00 00 XX bytes. Definitely starting to feel like
|
||
|
the first 4 bytes are a byte counter.
|
||
|
- Skipping past those four bytes and looking at the next 00 00 00 XX bytes we get 0x0000001D which is 29 in decimal.
|
||
|
- Again, counting forward by that number of bytes brings us to 4 bytes prior to the next 00 00 00 XX block. First 4
|
||
|
bytes are definitely a byte counter.
|
||
|
- It makes sense that each of these blocks are an Opus packet.
|
||
|
- Still don't know what the second set of 4 bytes could represent... Does it matter for our purposes? All we care about
|
||
|
is the raw Opus bit stream.
|
||
|
- A Google search for "opus test vector format" brings up the following result (a PDF document):
|
||
|
- https://www.ietf.org/proceedings/82/slides/codec-4.pdf
|
||
|
- From the "Force Multipliers" page:
|
||
|
- "The range coder has 32 bits of state which must match between the encoder and decoder"
|
||
|
- "Provides a checksum of all encoding and decoding decisions"
|
||
|
- "opus_demo bitstreams include the range value with every packet and test for mismatches"
|
||
|
- Willing to bet that the extra 4 bytes are that "32 bits of state". This is actually good for dr_opus.
|
||
|
- I think the format is just a list of Opus packets with 8 bytes of header data for each one:
|
||
|
4 bytes: Opus packet size in bytes
|
||
|
4 bytes: 32 bit of range coder state
|
||
|
[Opus Packet]
|
||
|
--------------------------------
|
||
|
|
||
|
- The test program can either hard code a list of vector files or we can enumerate over files in known directories. I'm
|
||
|
preferring the latter to make it easier to add new vectors. It's also consistent with the way dr_flac does it which has
|
||
|
been pretty good in the past.
|
||
|
- Started work on a "common" lib for test programs. First thing is the new file iterator.
|
||
|
- File iterator now done, but only supporting Windows currently.
|
||
|
- Have implemented the logic for loading test vectors and the byte counter (and I'm assuming the range state) is actually
|
||
|
in big-endian, not little-endian. I'm an idiot!
|
||
|
- Copying endian management APIs from dr_flac over to dr_opus.
|
||
|
- Basic infrastructure for testing against the standard test vectors is now done.
|
||
|
- COMMIT
|
||
|
|
||
|
2018/12/03 6:15 PM
|
||
|
==================
|
||
|
- The packet enumeration in the test program seems to work so far, but I'm still not 100% sure that my assumptions are
|
||
|
correct. Adding more validation to dropus_stream_decode_packet() to increase my certainty. The next stage is to
|
||
|
implement all of the main packet validation.
|
||
|
- Question for later. How are "Discontinuous Transmission (DTX) or lost packet" handled? Should
|
||
|
dropus_stream_decode_packet() return an error in this case?
|
||
|
- Code for validating packet codes 0, 1 and 2 are now in, however untested so far. Code 3 is a bit more complicated and
|
||
|
right now I need to eat dinner...
|
||
|
- COMMIT
|
||
|
|
||
|
2018/12/03 6:40 PM
|
||
|
==================
|
||
|
- Code 3 CBR is done (no VBR yet). Still untested.
|
||
|
- Code 3 VBR is done. Still untested.
|
||
|
- Testing and debugging against the standard test vectors coming next.
|
||
|
- Fixed a few compiler warnings.
|
||
|
- COMMIT
|
||
|
|
||
|
2019/03/08 6:28 PM
|
||
|
==================
|
||
|
- Starting on range decoding. From figure 4 in RFC 6716 the diagram shows that both the SILK and CELT decoders draw
|
||
|
their data from the Range Decoder. Therefore the Range Decoder needs to be figured out before SILK and CELT decoding.
|
||
|
I have no prior knowledge on range coding, so hopefully RFC 6716 has enough information for a complete enough
|
||
|
implementation for purpose of Opus decoding.
|
||
|
- Section 4.1.1 of RGC 6716 is how to initialize the decoder. It references a variable called "rng", but doesn't
|
||
|
specificy it's possible ranges. Setting to a 32-bit unsigned integer for now.
|
||
|
- In this section, it says "or containing zero if there are no bytes in this Opus frame". I'm assuming we do this
|
||
|
by looping over each frame in order...
|
||
|
- Just noticed that the end of section 4.1.1 says "rng > 2**23". So "rng" needs to be 32-bits.
|
||
|
- And now I've just noticed that section 4.1 explicitly says the following: "Both val and rng are 32-bit unsigned integer
|
||
|
values." Problem solved.
|
||
|
- Having read over the range coding stuff in RFC 6716 I had some difficulty figuring out where symbols frequencies are
|
||
|
being stored. They're not stored, but rather specified in the specification in a per-context basis. In the spec there
|
||
|
are references to "PDFs" which is like this "{f[0], f[1]}/ft". Example: "{1, 1}/2". Simple, now that I've finally
|
||
|
figured it out...
|
||
|
- I can therefore not do the range decoder independant of some higher level context. I will start with the SILK decoder
|
||
|
since that's the next section...
|
||
|
|
||
|
2019/03/09 6:52 AM
|
||
|
==================
|
||
|
- I think I'm getting a feel for the requirements for the range decoding API. I've decided for now to use a structure
|
||
|
for this called dropus_range_decoder.
|
||
|
- dropus_range_decoder_init() implemented based on section 4.1.1.
|
||
|
- Added stub for dropus_stream_decode_frame(). I think the idea is that we iterate over each frame in the packet and
|
||
|
decode them in order? Not fully sure yet how frame transitions will work, so may need to rethink this.
|
||
|
- I'm still not sure if the range decoder is supposed to be initialized once per frame or once per packet...
|
||
|
- After more testing it turns out my test program wasn't reporting errors properly. *Sigh*. These bugs were originating
|
||
|
from the packet decoding section.
|
||
|
- COMMIT
|
||
|
- Section 4.1.2 mentions that there's two steps to decoding a symbol. The first step seems easy enough. Implemented in
|
||
|
dropus_range_decoder_fs().
|
||
|
- The second step requires updating "val" and "rng" of the range decoder based on (fl[k], fh[k] and ft). Implemented in
|
||
|
dropus_range_decoder_update(pRangeDecoder, fl, fh, ft).
|
||
|
- Normalization of the range decoder is easy (RFC 6716 - 4.1.2.1). Implemented in dropus_range_decoder_normalize().
|
||
|
- I think at this point we have the foundations done for the range decoder. Still more to do for CELT, but I think I
|
||
|
now have enough to hack away a bit more at the SILK-specific data.
|
||
|
- COMMIT
|
||
|
- I'm still not 100% sure how range decoding works - I know it depends on the context of the specific thing being
|
||
|
decoded, but not sure how the index "k" is handled. The first part of the SILK frame is the VAD flags PDF={1,1}/2.
|
||
|
Do I use the value of k (which I think is 0 or 1) to know whether or not the bit is set? I am going to assume this
|
||
|
for now.
|
||
|
|
||
|
2019/03/10 7:40 AM
|
||
|
==================
|
||
|
- At the moment I am using what feels like a very cumbersome way of decoding "k" from the range decoder. I'm calling
|
||
|
dropus_range_decoder_fs(), followed by a function called dropus_range_decoder_k() to retrieve "k", and then
|
||
|
dropus_range_decoder_update() at the end of it. I can simplify this into a single function, I think, which I'm
|
||
|
calling dropus_range_decode() which will perform the "fs" decode, perform the update, then return "k".
|
||
|
- As I'm writing dropus_range_decoder_decode() I've realized that I can simplify my dropus_range_decoder_update()
|
||
|
function to do both the "k" lookup and the the update from section 4.1.2. It just needs to be changed to take
|
||
|
the "context" probabilities and "fs" and to return "k".
|
||
|
- This _seems_ to work. Or at least, the results are consistent with my first attempt. Committing.
|
||
|
- Added dropus_Q13() for doing the stereo weight lookup (RFC 7616 - Table 7).
|
||
|
- Section 4.2.7.1 mentions that the weights are linearly interpolated with the weights from the previous frame. What
|
||
|
factor do I use for the interpolation?
|
||
|
- Finished initial work on RFC 7616 - Section 4.2.7.1. Still don't know how the interpolation part works. I'm hoping
|
||
|
this is explained later in the spec.
|
||
|
- COMMIT
|
||
|
- Completed initial implementation of RFC 7616 - Section 4.2.7.2.
|
||
|
|
||
|
2020/03/29 6:33 AM
|
||
|
==================
|
||
|
- After a whole year, time to take another look at this. Only problem is, I've forgotten everything! Let's do a
|
||
|
quick pass on the high level stuff to get it consistent with dr_flac. Two mistakes I made with dr_flac: Not having
|
||
|
a low-level API that works on the FLAC frame level; and returning booleans instead of result codes. The first one
|
||
|
is already being handled, but lets update the high level APIs to use result codes.
|
||
|
- High level APIs now updated to return result codes.
|
||
|
- People are going to want to log errors: added dropus_result_description() to retrieve a human readable description
|
||
|
of a result code for logging and error reporting purposes.
|
||
|
- I've had reports of people having compilation errors with inlining before. Need to bring over the INLINE macro from
|
||
|
dr_flac to ensure we don't get more reports about that.
|
||
|
- I've had people request the ability to customize whether or not public APIs are marked as static or extern in other
|
||
|
libraries, so adding the DROPUS_API decoration. It's better to do this sooner than later, because this happened with
|
||
|
miniaudio recently which was a ~40K line library at the time, and it was nightmare to update!
|
||
|
- There will be a requirement to allocate memory for high level APIs. Support for allocation macros and allocation
|
||
|
callbacks need to be added now so we can avoid breaking APIs later.
|
||
|
- There have been requests in the past to support loading files from a wchar_t encoded path. Adding support by pulling
|
||
|
in custom fopen() and wfopen() implementations from dr_flac.
|
||
|
- Have not yet compiled on VC6 so lets do a pass on that to make it easier for us later.
|
||
|
- VC6 is now compiling clean. Time to check GCC and Clang in strict C89 mode.
|
||
|
- Getting compiler errors about some Linux specific code for handling byte swaps. Removed and replaced with a cross-
|
||
|
platform solution.
|
||
|
- From what I can see it looks like our boilerplate is now up to date with dr_flac, dr_wav and dr_mp3. Time to get onto
|
||
|
some actual Opus decoding.
|
||
|
- Looks like a typo in dropus_range_decoder_update() for handling the `fl[k] > 0` case. Changed this, but now getting
|
||
|
a division by zero. Debugging...
|
||
|
- And fixed. The last paragraph in RFC 6716 - Section 4.1.2 is the part I was missing:
|
||
|
After the updates, implemented by ec_dec_update() (entdec.c), the
|
||
|
decoder normalizes the range using the procedure in the next section,
|
||
|
and returns the index k.
|
||
|
- Reading through dropus_stream_decode_packet() to refresh my memory and some stuff needs cleaning up. Important from
|
||
|
now on to distinguish between PCM frames and Opus frames. All variables referring to one of the other needs to be
|
||
|
prefixed with "pcm" or "opus", depending on it's use. Example: opusFrameCount / pcmFrameCount.
|
||
|
- Added infrastructure for handling CELT and Hybrid frames when the time comes. For now still focused on getting SILK
|
||
|
frames working.
|
||
|
- Reviewing SILK decoding logic to refresh my memory and looks like I've misinterpreted the encoding for per-frame
|
||
|
LBRR flags. When there's more than one SILK frame, the first LBRR flag is just used to determine whether or not
|
||
|
the per-frame LBRR flags is present. I was _always_ reading the LBRR flags regardless of the value of the primary
|
||
|
LBRR flag.
|
||
|
|
||
|
2020/03/30 7:30 AM
|
||
|
==================
|
||
|
- The decoding of SILK frames needs to be put in their own function in order to handle LBRR and regular SILK frames.
|
||
|
Looking at the spec, there's some complicated branching logic for determine what needs to be decoded. I've put in
|
||
|
a rough draft as a start, but needs some work. Will continue on this later.
|