Hacker News new | threads | past | comments | ask | show | jobs | submit DanielBMarkham (43643) | logout
SoundStream: An End-to-End Neural Audio Codec (googleblog.com)
286 points by todsacerdoti 3 days ago | flag | hide | past | favorite | 103 comments

The post is contaminated with some wet marketing language, adding unnecessary noise to the information. The most important part of this whole post is this:

... [V]ector quantization ... works well at bitrates around 1 kbps or lower, but quickly reaches its limits when using higher bitrates. For example, even at a bitrate as low as 3 kbps, and assuming the encoder produces 100 vectors per second, one would need to store a codebook with more than 1 billion vectors, which is infeasible in practice.

In SoundStream, we address this issue by proposing a new residual vector quantizer (RVQ), consisting of several layers ...

What is “wet marketing language?”

This is really impressive. They say that at 3kbps this new codec sounds as good as Opus at 12kbps, and that it is trained across a wide range of bandwidth. It is a much bigger deal than Lyra was. I'd like to know whether it can run at low latency.

> They say that at 3kbps this new codec sounds as good as Opus at 12kbps

With that jump in quality, makes you wonder about the processing power consumption ratio.

Low latency was a design goal.

How low though? Lyra added significant overhead, from what I read. They said it was small but then in practice it was like 100ms or something, which given a typical ok connection is easily 50ms would put it in the perceivable range of delay.

that's why they opensource lyra so they can focus on this

Just curious, is there any progression for relatively high bitrate audio codec? Not that I'm not satisfied with the current state of AAC, but I found most of these new development often about some super low bitrate (this case is even more extreme, 3kbps!?).


Opus is a remarkable codec because it’s excellent at almost everything. The only areas where it’s being beaten are extreme narrowband, which it can’t do, and narrowband, where it’s still not shabby (though some of this new stuff is redefining what’s possible).

Opus tackled a broad field of competitors that were each somewhat specialised for their part of the field, and pretty much beat all of them at their own game. And in most cases the incumbents were high-latency, while Opus not only achieves quality superiority but also supports low-latency operation.

https://en.wikipedia.org/wiki/Opus_(audio_format) has some good diagrams and explanation. https://upload.wikimedia.org/wikipedia/commons/8/8d/Opus_qua... especially shows Opus’ superiority except in narrowband.

Past about 16kb/s, Opus is pretty much just the format to use, except for extreme niches like if you want to represent over 20kHz (above which Opus cuts).

Opus is so good there’s pretty much nothing left to improve for now (for now), and even if you improved things it probably wouldn’t be worth it. That’s why all the development is happening in narrowband, because that’s the only interesting space left. Perhaps eventually some of those techniques will be scaled up past narrowband. I don’t know.

Yeah. In reality the main reason new audio codecs are developed post-Opus isn't technical, it's so that companies can get their patents into new standards and rake in the licensing royalties. There are better codecs for really low bitrates but that's quite niche these days; even telephony is going wideband and higher bitrate.

I think you perhaps overestimate the bandwidth available to most telephone handsets in most of the world.

I'm in a hotel in a schengen country right now and I am lucky to get 200kbps via the wifi.

Other, less industrialized countries are frequently worse.

It's safe to say a ~billion people's lives can be improved with better low bitrate codecs.

Let's assume for a moment that you're not stupid enough to confuse 200kBps (1.6Mbps) for 200kbps.

Opus is fine down to 8kbps. It fits over a cheap, shitty mid-20th century analogue telephone line with room to spare.

The ultra-narrow band stuff is very niche, and is consequently unlikely to have the broad impact you're imagining.

In contrast there is continue enthusiasm for these pointless midband codecs that are similar in performance to Opus but have the "advantage" that somebody gets $$$.

"Pointless" is relative here.

8kbps is enough - when you don't spare anything anything to bit correction. Maybe in those cases, somewhat okay analogue audio is enough (for example, in long-distance raditelephony). But having a very impressive digital codec raises the bar significantly, especially the last time someone bothered with this is someone in Nokia trying to fit 6kbps using what was now rudimentary phone chips.

Additionally, there are people in the world (including US) who are stuck using unreliable 28kbps lines. Having an option to do excellent audio and video is something that no-one seemed bothered to do.

That's 200kbps for a building with a few dozen people in it.

Why Whatsapp calls are so terrible even on 100 Mbps links?

I just compared 730 kb/s Flac with 160kb/s Opus, I can see no difference even on spectrogram using 'mother of mp3' track: Tom's Diner (Live At The Royal Albert Hall, London, UK, November 18, 1986).

Very surprising, will be migrating all my music to opus to save space.

Beware of phase differences, they won't show up on a spectrogram but could seriously upset your stereo impression. Before you compress all your Flac content and only figure this one out afterwards. That could be quite annoying.

Yeah - definitely keep your backups in FLAC. Having a lossless source gives you infinite future flexibility, and that cannot be underestimated. Otherwise you're kinda doomed to hit https://www.youtube.com/watch?v=fZCRYo-0K0c eventually.

For on-listening-device though, oh heck yes - Opus is great.

Just as a side note, spectrogram are not in any way indicative of compression quality.

Its interesting that you didn't see any differences between flac and opus, as opus has a hard 20 kHz low-pass filter and is 48kHz-only.

There is visible 20 kHz LPF, but it is meaningless for use case of music listening, right ?

I did not liked degradation at 140-128kb/s and noise above 16kHz and some artifacts 10-16khz that is clearly visible and much more impactful.

Here is spectrograms https://disk.yandex.com/d/PvQGSS2xBu7ucQ

As I said, spectrograms are not indicative of compression quality. Codecs should be judged by ears only. What you see in a spectrogram will vastly differ between codecs and will not reflect their compression efficiency.

> I just compared 730 kb/s Flac with 160kb/s Opus, I can see no difference even on spectrogram

Doesn't Ogg Vorbis perform better than Opus on bitrates above 128kbps?

Opus is transparent at 160kb/s, so no.

When comparing, your vorbis examples were encoded with 48kHz too?

AFAIK, not really.

After AAC, researchers realized that AAC produced perceptually transparent audio with excellent compression and that growing bandwidth and disk space meant there wasn't much point to further improvement in that direction.

And remember that, unlike MP3, AAC is more like an entire toolbox of compression technologies that applications/encoders can use for a huge variety of purposes and bitrates -- so if you need better performance, there's a good change you can just tune the AAC settings rather than need a whole new codec.

So research shifted in two other directions -- super low-bandwidth (like this post) for telephony etc., and then spatial audio (like with AirPods).

For 99.9% of use cases, that's a solved problem. You'll never use anything other than FLAC and Opus for lossless and lossy compression respectively.

I'm sure there are unusual cases such as live streaming over satellite internet where getting an extra 1% compression on high quality audio is a big deal, but even that's likely a temporary problem. Starlink users already get >50mpbs

I'm sure Sirius XM would be interested in something that outperforms AAC v2 by a large margin

> You'll never use anything other than FLAC and Opus for lossless and lossy compression respectively.

That's true only for if you don't mind losing a substantial portion of your potential audience.

In contrast, AAC (including low-bitrate HE-AAC and HE-AACv2 flavors) is ubiquitous as MP3 for anything made in the past 10 years.

Oh I wrote something about that [1] on HN and HydrogenAudio But basically there are zero incentive to do so. We are no longer limited by Storage or Bandwidth. Bandwidth or Cost Per transfer decline at a much greater rate than any Audio or Video Codec Advancement.

So if you want higher quality at a relatively high bitrate? Use Higher Bitrate. Just use 256Kbps AAC-LC instead of 128Kbps, all of its patents has expired so it is truly patent free. The only not so good thing is that all the Open Source AAC Encoder are't anywhere near as good as the one iTunes is provided. Or you could use Opus, at a slightly better quality / bitrate if it fit your usage scenario. Even high bitrate MP3 for quite literally 100% backward compatibility. If you want something at 256Kbps+ but doesn't want to go Lossless, Musepack continues to be best in its class but you get very little decoding support.

[1] https://news.ycombinator.com/item?id=10787455

The article mentions that it can scale bitrate directly by adding or removing layers. But I sure wish they had included some hi-fi quality sample audio.

General Harmonics[0] may have come up with something. [0]www.generalharmonics.com

Would anyone compress their music library using a deep learning codec, given the possibility of awkward artifacts?

Any sufficiently fancy compression for communication formats immediately makes me worry about the Xerox Effect[1], where the reconstructed content is valid yet semantically different in some important way to the original.

[1] I propose we call it that unless there's already a snappy name for it?

Indeed. I also expect this failure mode will be undetected for a long time due to how our sense of hearing works. My last neuroscience class was many years ago, but I do remember that in some sense, we hear what we expect to hear (more so than vision if I recall correctly, though there is plenty that happens in our vision processing) in that our ears tune for particular frequencies to filter out ambiguities.

Suppose a person says something that the codec interprets differently. Perhaps they have one of the many ever evolving accents that almost certainly were not and absolutely could not possibly be included in the training set (ongoing vowel shifts might be a big cause of this). The algorithm removes the ambiguity, but the listener can't tell because they hear themselves through their own sense of hearing. Assume the user has somehow overcome the odd psychological effects that come hearing the computer generated audio played back, if that audio is mixed with what the person is already hearing, it's likely they still won't notice because they still hear themselves. They would have to listen to a recording some time later and detect that the recording doesn't match up with what they thought they said... which happens all the time because memory is incredibly lossy and maleable.

Most of the time, it won't matter. People have context (assuming they're actually listening, which is a whole other tier of hearing what you expect to hear) and people pick the wrong word or pronounce things incorrectly (as in not recognizable to the listener as what the speaker intended) all the time. But it'll be really hard to know that the recording doesn't record what was actually said. You need to record a local accurate copy, the receiver's processed copy, and know to look for it in what will likely be many hours of audio. It's also possible that "the algorithm said that" will be a common enough argument due to other factors (incorrect memory and increasing awareness of ML-based algorithms) that it'll out number the cases where it really happens.

This seems similar to being able to read your own handwriting, when others can't. If it's an important recording, someone else should listen, and it would be better to verify a transcription.

In a live situation, they will ask you to repeat if it's unclear.

Yep, it's kinda happening with the music example on the page: the Lyra (3kbps) sample have some human sounding part when the original reference is just music without any speech. Probably because Lyra was trained on speech.

It's a valid concern but I think it can also be solved. Compress-decompress and compare to the original using a method not susceptible to xerox effect. If the sound has materially changed then use a fallback method, robust but less efficient, for that particular window.

But idk this may be too slow for real time.

I remember that: https://news.ycombinator.com/item?id=6156238

I agree, neural networks are exactly the type of system that works well "most of the time" but then can fail unexpectedly in odd and subtle ways.

Next step: train a NN to predict what we’re going to say to compensate network latency :)

Have you ever played a counterstrike with a lagging network? Observe the teammates running into walls and such. That's what the conversation would sound like.

Running into a wall is not a result of NN prediction. It's a result of naïve linear prediction. NN will predict that he'll stop, turn around, may be shot someone. The hard thing is to not disappoint NN by actually running into a wall.

You need to take into consideration the complexity. Counterstrike has very tiny set of possible actions you can take. Compare that to the space of things one can say.

A NN is going to fail on that at least as missereably as linear interpolation fails in CS.

When speech prediction fails, it should sound clearly wrong. Otherwise we risk serious misunderstanding when the prediction says something that sounds good but means different.

We already have trouble like this with texting autocorrection.

It could make for a hilarious new comedic art form.

Don‘t modern codecs already do something like this to conceal packet loss? (Obviously only at the phoneme and not word or semantic level.)

Does this work for the starting phoneme of a word?

Just running with that a little ...

So, at what point during a phoneme does it become distinct?

After any one phoneme has been sent down the wire you can truncate the audio and send a phoneme identifier; at the other end they just replay the phoneme from a cache.

Like doing speech to text, but using phonemes, and sending the phoneme hash string instead of the actual audio.

Must have been done already?

Wondering the same.

Googles duo already does this. In my experience, it's amazing. http://ai.googleblog.com/2020/04/improving-audio-quality-in-...

I wonder how big the codec itself is. Neural networks are not exactly small. Also, the processing latency - Lyra mentions on its GitHub page 100ms.

The paper says 8.4 million parameters for the default model. Assuming 4 bytes per parameter, that's about 30 MB.

Being able to download in advance is huge. Looking forward to satellite audio for everything spoken-voice one day. Some org was working on it, but they were working with 2kbps 24/7, which is definitely not a lot!

In the papier they compare lattencies from 7ms to 26ms and observe no loss of quality. Processing requirement is lower for higher latencies because of batching.

Has anyone thought of trying to end-to-end train an h.265 decoder? The results might not be 100% perfect, but the resulting codec might bypass a ton of patents.

Since the h265 relies heavily on operations which are not easily differentiable, such as translation of patches of images, together with a pretty complicated binary format, I'd be pretty amazed if the NN actually learned anything meaningful at all.

Interpolated translation is continuous and easily differentiated. There's lots of work on machine-learned video codecs already, from Nvidia, Qualcomm and others.

> There's lots of work on machine-learned video codecs already, from Nvidia, Qualcomm and others.

I've tried them, the current state of the art means that they're only useful on relatively static things (some shaking etc) while spike up to AV1-level bitrate to reach perceptual similarity when the movement is too much. Maybe in the future (or whatever under the wraps concoction Nvidia, Qualcomm or another player have), ML-based video codecs will surpass handtuned codecs, but it's not (yet) the present state.

It's not difficult to propagate gradients while translating an image. Learning "pick a 8x8 patch from (145,17) apply X to it and translate it by (4,-8)" from data is on different level, is not it?

The premise was using e2e learning to avoid patent issues. I am sure that with some preprocessing you can plug a NN inside the deciding process and learn very meaningful stuff.

Aren't software implementations royalty free anyway? and even if they weren't, could they not classify a neural network as a software implementation? (from the patent enforcing point of view) Because if that is not the case, this idea would be applicable to a lot of things right? Seems like an easy hack

> Because if that is not the case, this idea would be applicable to a lot of things right? Seems like an easy hack

Well that does seem to be Microsoft's position at least. (see: Copilot)

Is it possible to apply this same technique to video codecs? If not general video, then at least video streams where the center subject is a human face?

> that induce the reconstructed audio to sound like the uncompressed original input.

What if we want the codec to produce easier-perceptible sound, not just a reconstruction. Could bake in noise and echo reduction, voice improvements etc

Did you read the “ Joint Audio Compression and Enhancement” section part of the announcement?

I wonder how the training works in that system

I assume think they start with the "signal" and "noise" as separate audio files, and then they play them together in order to create a synthetic noisy input. Then they can train the output against only the signal so that it will learn to filter out the noise.

I wonder if this approach would discriminate against certain voices.

Almost certainly, but this is true of most low bitrate codecs. I've got a very deep voice and it becomes largely unintelligible in marginal mobile signal conditions. If anything this one might be more tweakable and/or personalizable than what we use today.

To some extent, surely. In their samples they have some music and some audio with background noise. The music survives ok and the clanging of the background noise is reduced to clicking so maybe languages with clicking sounds do ok too.

I suppose there are a few futures:

- the paper was very innovative but nothing really happens in the ‘real world’

- Google roll it out globally to one of their products and we discover that some voices/accents/languages work poorly (or well)

- The same but with slow rollout, feedback, studies, and improvements to try to make sure everyone gets a good result.

For the previous version of Lyra, there was a concern that it would discriminate against some languages.

For a company the penalizes sites for being "not mobile friendly", they really drop the ball on this blog. All the graphs have their right sides off screen, hidden, with no chance to see them on a normal modern Android phone.

The pictures and graphs are embedded in an html table for reasons I don't understand.

Looks cut off in portrait mode but fits fine in landscape (iPhone X).

Now this is impressive. Truly quite amazing technology. I pray we reach the day where my 1200GB music collection can be compressed without noticing, to a fraction of the size.

Depends what format you have it in now. 160kbps opus is transparent for almost all music. That's 1/5 the space of a FLAC collection, or 1/2 the size of 320kbps mp3, two other popular ways to store music without noticeable loss in quality.

I actually do encode to 128 opus and crank up the encoder settings for music synced to my own devices, because I know they can play it. It's pretty much transparent to me but I am not an audiophile with audiophile equipment, which is why I keep the originals.

Why? that's not much more than 1TB, easily available in various storage formats for a very low price. What's the issue with the size?

It's not very portable. I can't keep it on my phone without a 1TB micro SD, and I could theoretically carry an SSD but thats not really portable as it requires internal connection. And if it's an external drive with a USB connection then it's probably much larger than my phone.

It's certainly able to be kept and it's not a particularly unwieldly amount of data for the average power user, but if I ever had to bring my music somewhere (and I mean all of it) I'd be screwed. Plus that data would take hours to copy if ever needed. I spent just an hour today dding a 64GB SD card.

I just keep my favorite albums and discographies in 320 MP3 as I know there is almost no music player on earth that will fail to play it. Then I keep that in a micro SD in my wallet, for that emergency in which you desperately need some Pink Floyd.

More likely that storage capacity/$ will scale faster that codec efficiency.

In a few years you’ll have it on your phone anyway!

1. Wow, what an obsessive music geek you are.

2. Consider https://beets.io for management, it may be just the system you want. One of its features is converting you library into a lossy version while keeping your high res files in a very managed way.

As you seem knowledgeable, it looks like beets manages music you've already ripped, do you have a suggestion for ripping? Last time I tried, it was a disaster (I got a lot of poor quality audio and ended up nuking everything because it was too much work to listen to everything and try again)

It's been a while since I ripped my last CD, but I always found https://github.com/bleskodev/rubyripper very useful. It uses cdparanoia to get multiple rips and combines them into what it thinks is the exact information located on the original cd.

These days, I can fortunately download most stuff directly from bandcamp :)

Now that's cool. Thank you so much, I'll look into that.

Personally for me the issue is not the size but the transfer (think streaming, service migration, etc.)

I just have a syncthing folder shared between my VPS, my home server, my desktop, laptop, and phone; it syncs it continuously so a download once syncs it everywhere else. It's basically set it and forget it.

In some countries data consumption is still capped, and reducing bandwidth usage goes a long way in making media consumption more affordable.

Notice in the examples how Lyra tries to construct voice-like sounds out of the "music" sample? This is what scares me a bit about these codecs.


Because of the possibility of it adding "words" that were never there.

See also: xerox photocopiers changing dimensions of technical drawings with "too clever" compression schemes.

How does it compare to Codec2?

in low bandwidth situations, for recording speeches or podcasts, i have a similar question. the codec2 examples have a 8 KHz range so can't be compared to the lyra ones as is. perhaps you could encode your own voice wavs with ffmpeg and compare. in this case there's also the question of portability, can the resulting files be played on android or iphone and how much cpu cycles/battery power would it cost? I'd rather listen to 1 hour of lyra speech then codec2 speech if the battery would last twice as long.

I don’t know but there seemed to be a demo for Codec2 running on STM32F4, whereas Lyra repository README explains how its optimized implementation allows it to run on a midrange phone in real time, so…

Isn't Codec2 a speech codec? This seems to target general audio.

Lyra is also a speech-only codec, yet is included in the comparison.

Note also that Codec2 had some experimental work extending it with the WaveNet neural network, which improved the performance.

Given both of this, it seems disingenuous to exclude Codec2 from the comparison. I can only assume its left out because it performs well at even lower bitrates.

>Over the past few years, different audio codecs have been successfully developed to meet these requirements, including Opus and Enhanced Voice Services (EVS).

I guess the Google AI team works separately from the main Google Team. Lot of respect for pointing out EVS.

would be interesting to characterize the behavior of the encoder (ie; how does it differ from a mel warped spectrum... or what is the warping that it learns?)

also would be kinda neat to see something that is pretrained in this way, and then does a small amount of humans in the loop training iterations to see if quality improves or perhaps an uncovering of something previously unknown about human auditory perception...

> Opus is a versatile speech and audio codec, supporting bitrates from 6 kbps (kilobits per second) to 510 kbps, which has been widely deployed across applications ranging from video conferencing platforms, like Google Meet, to streaming services, like YouTube. EVS is the latest codec developed by the 3GPP standardization body targeting mobile telephony. Like Opus, it is a versatile codec operating at multiple bitrates, 5.9 kbps to 128 kbps.

Why 5.9 instead of 6 kbps? Did I want to have some kind of PR victory over Opus?

Is this good for synced music transmissions or is it another simulation of the 1950s telephone sound?

3 kbps is not enough for music in any meaningful way.

Where is a line between end-to-end audio compression and end-to-end audio encryption ?

Well encryption would likely be bad if it made things smaller and compression would be bad if it didn’t make things smaller.

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact