[V]ector quantization ... works well at bitrates around 1 kbps or lower, but quickly reaches its limits when using higher bitrates. For example, even at a bitrate as low as 3 kbps, and assuming the encoder produces 100 vectors per second, one would need to store a codebook with more than 1 billion vectors, which is infeasible in practice.
In SoundStream, we address this issue by proposing a new residual vector quantizer (RVQ), consisting of several layers
With that jump in quality, makes you wonder about the processing power consumption ratio.
Opus is a remarkable codec because it’s excellent at almost everything. The only areas where it’s being beaten are extreme narrowband, which it can’t do, and narrowband, where it’s still not shabby (though some of this new stuff is redefining what’s possible).
Opus tackled a broad field of competitors that were each somewhat specialised for their part of the field, and pretty much beat all of them at their own game. And in most cases the incumbents were high-latency, while Opus not only achieves quality superiority but also supports low-latency operation.
https://en.wikipedia.org/wiki/Opus_(audio_format) has some good diagrams and explanation. https://upload.wikimedia.org/wikipedia/commons/8/8d/Opus_qua... especially shows Opus’ superiority except in narrowband.
Past about 16kb/s, Opus is pretty much just the format to use, except for extreme niches like if you want to represent over 20kHz (above which Opus cuts).
Opus is so good there’s pretty much nothing left to improve for now (for now), and even if you improved things it probably wouldn’t be worth it. That’s why all the development is happening in narrowband, because that’s the only interesting space left. Perhaps eventually some of those techniques will be scaled up past narrowband. I don’t know.
I'm in a hotel in a schengen country right now and I am lucky to get 200kbps via the wifi.
Other, less industrialized countries are frequently worse.
It's safe to say a ~billion people's lives can be improved with better low bitrate codecs.
Opus is fine down to 8kbps. It fits over a cheap, shitty mid-20th century analogue telephone line with room to spare.
The ultra-narrow band stuff is very niche, and is consequently unlikely to have the broad impact you're imagining.
In contrast there is continue enthusiasm for these pointless midband codecs that are similar in performance to Opus but have the "advantage" that somebody gets $$$.
8kbps is enough - when you don't spare anything anything to bit correction. Maybe in those cases, somewhat okay analogue audio is enough (for example, in long-distance raditelephony). But having a very impressive digital codec raises the bar significantly, especially the last time someone bothered with this is someone in Nokia trying to fit 6kbps using what was now rudimentary phone chips.
Additionally, there are people in the world (including US) who are stuck using unreliable 28kbps lines. Having an option to do excellent audio and video is something that no-one seemed bothered to do.
Very surprising, will be migrating all my music to opus to save space.
For on-listening-device though, oh heck yes - Opus is great.
Its interesting that you didn't see any differences between flac and opus, as opus has a hard 20 kHz low-pass filter and is 48kHz-only.
I did not liked degradation at 140-128kb/s and noise above 16kHz and some artifacts 10-16khz that is clearly visible and much more impactful.
Here is spectrograms https://disk.yandex.com/d/PvQGSS2xBu7ucQ
Doesn't Ogg Vorbis perform better than Opus on bitrates above 128kbps?
After AAC, researchers realized that AAC produced perceptually transparent audio with excellent compression and that growing bandwidth and disk space meant there wasn't much point to further improvement in that direction.
And remember that, unlike MP3, AAC is more like an entire toolbox of compression technologies that applications/encoders can use for a huge variety of purposes and bitrates -- so if you need better performance, there's a good change you can just tune the AAC settings rather than need a whole new codec.
So research shifted in two other directions -- super low-bandwidth (like this post) for telephony etc., and then spatial audio (like with AirPods).
I'm sure there are unusual cases such as live streaming over satellite internet where getting an extra 1% compression on high quality audio is a big deal, but even that's likely a temporary problem. Starlink users already get >50mpbs
That's true only for if you don't mind losing a substantial portion of your potential audience.
In contrast, AAC (including low-bitrate HE-AAC and HE-AACv2 flavors) is ubiquitous as MP3 for anything made in the past 10 years.
So if you want higher quality at a relatively high bitrate? Use Higher Bitrate. Just use 256Kbps AAC-LC instead of 128Kbps, all of its patents has expired so it is truly patent free. The only not so good thing is that all the Open Source AAC Encoder are't anywhere near as good as the one iTunes is provided. Or you could use Opus, at a slightly better quality / bitrate if it fit your usage scenario. Even high bitrate MP3 for quite literally 100% backward compatibility. If you want something at 256Kbps+ but doesn't want to go Lossless, Musepack continues to be best in its class but you get very little decoding support.
 I propose we call it that unless there's already a snappy name for it?
Suppose a person says something that the codec interprets differently. Perhaps they have one of the many ever evolving accents that almost certainly were not and absolutely could not possibly be included in the training set (ongoing vowel shifts might be a big cause of this). The algorithm removes the ambiguity, but the listener can't tell because they hear themselves through their own sense of hearing. Assume the user has somehow overcome the odd psychological effects that come hearing the computer generated audio played back, if that audio is mixed with what the person is already hearing, it's likely they still won't notice because they still hear themselves. They would have to listen to a recording some time later and detect that the recording doesn't match up with what they thought they said... which happens all the time because memory is incredibly lossy and maleable.
Most of the time, it won't matter. People have context (assuming they're actually listening, which is a whole other tier of hearing what you expect to hear) and people pick the wrong word or pronounce things incorrectly (as in not recognizable to the listener as what the speaker intended) all the time. But it'll be really hard to know that the recording doesn't record what was actually said. You need to record a local accurate copy, the receiver's processed copy, and know to look for it in what will likely be many hours of audio. It's also possible that "the algorithm said that" will be a common enough argument due to other factors (incorrect memory and increasing awareness of ML-based algorithms) that it'll out number the cases where it really happens.
In a live situation, they will ask you to repeat if it's unclear.
But idk this may be too slow for real time.
I agree, neural networks are exactly the type of system that works well "most of the time" but then can fail unexpectedly in odd and subtle ways.
A NN is going to fail on that at least as missereably as linear interpolation fails in CS.
We already have trouble like this with texting autocorrection.
So, at what point during a phoneme does it become distinct?
After any one phoneme has been sent down the wire you can truncate the audio and send a phoneme identifier; at the other end they just replay the phoneme from a cache.
Like doing speech to text, but using phonemes, and sending the phoneme hash string instead of the actual audio.
Must have been done already?
I've tried them, the current state of the art means that they're only useful on relatively static things (some shaking etc) while spike up to AV1-level bitrate to reach perceptual similarity when the movement is too much. Maybe in the future (or whatever under the wraps concoction Nvidia, Qualcomm or another player have), ML-based video codecs will surpass handtuned codecs, but it's not (yet) the present state.
The premise was using e2e learning to avoid patent issues. I am sure that with some preprocessing you can plug a NN inside the deciding process and learn very meaningful stuff.
Well that does seem to be Microsoft's position at least. (see: Copilot)
What if we want the codec to produce easier-perceptible sound, not just a reconstruction. Could bake in noise and echo reduction, voice improvements etc
I suppose there are a few futures:
- the paper was very innovative but nothing really happens in the ‘real world’
- Google roll it out globally to one of their products and we discover that some voices/accents/languages work poorly (or well)
- The same but with slow rollout, feedback, studies, and improvements to try to make sure everyone gets a good result.
It's certainly able to be kept and it's not a particularly unwieldly amount of data for the average power user, but if I ever had to bring my music somewhere (and I mean all of it) I'd be screwed. Plus that data would take hours to copy if ever needed. I spent just an hour today dding a 64GB SD card.
I just keep my favorite albums and discographies in 320 MP3 as I know there is almost no music player on earth that will fail to play it. Then I keep that in a micro SD in my wallet, for that emergency in which you desperately need some Pink Floyd.
In a few years you’ll have it on your phone anyway!
2. Consider https://beets.io for management, it may be just the system you want. One of its features is converting you library into a lossy version while keeping your high res files in a very managed way.
These days, I can fortunately download most stuff directly from bandcamp :)
See also: xerox photocopiers changing dimensions of technical drawings with "too clever" compression schemes.
Note also that Codec2 had some experimental work extending it with the WaveNet neural network, which improved the performance.
Given both of this, it seems disingenuous to exclude Codec2 from the comparison. I can only assume its left out because it performs well at even lower bitrates.
I guess the Google AI team works separately from the main Google Team. Lot of respect for pointing out EVS.
also would be kinda neat to see something that is pretrained in this way, and then does a small amount of humans in the loop training iterations to see if quality improves or perhaps an uncovering of something previously unknown about human auditory perception...
Why 5.9 instead of 6 kbps? Did I want to have some kind of PR victory over Opus?