May 18, 2024
  • At Meta, we help real-time communication (RTC) for billions of individuals by means of our apps, together with Messenger, Instagram, and WhatsApp.
  • We’ve seen important advantages by adopting the AV1 codec for RTC.
  • Right here’s how we’re enhancing the RTC video high quality for our apps with instruments just like the AV1 codec, the challenges we face, and the way we mitigate these challenges.

The previous couple of many years have seen large enhancements in cell phone digital camera high quality in addition to video high quality for streaming video companies. But when we have a look at real-time communication (RTC) purposes, whereas the video high quality additionally has improved over time, it has at all times lagged behind that of digital camera high quality. 

Once we checked out methods to enhance video high quality for RTC throughout our household of apps, AV1 stood out as the best choice. Meta has more and more adopted the AV1 codec over time as a result of it affords excessive video high quality at bitrates a lot decrease than older codecs. However, as we’ve implemented AV1 for mobile RTC, we’ve additionally needed to tackle various challenges together with scaling,  enhancing video high quality for low-bandwidth customers in addition to high-end networks, CPU and battery utilization, and sustaining high quality stability.

Bettering video high quality for low-bandwidth networks

This submit goes to concentrate on peer-to-peer (P2P, or 1:1) calls, which contain two contributors. 

Individuals who use our services and products expertise a spread of community circumstances – some have actually nice networks, whereas others are utilizing throttled or low-bandwidth networks.

This chart illustrates what the distribution of bandwidth seems to be like for a few of these calls on Messenger:

Determine 1: Bandwidth distribution of P2P calls on Messenger.

As seen in Determine 1, some calls function in very low-bandwidth circumstances. 

We take into account something lower than 300 Kbps to be a low-end community, however we additionally see quite a lot of video calls working at simply 50 Kbps, and even below 25 Kbps.

Be aware that this bandwidth is the share for the video encoder. Complete bandwidth is shared with audio, RTP overhead, signaling overhead, RTX (re-transmissions of packets to deal with misplaced packets)/FEC (ahead error correction)/duplication (packet duplication), and so forth. The large assumption right here is that the bandwidth estimator is working appropriately and estimating true bitrates. 

There are not any common definitions for low, mid, and excessive networks, however for the aim of this weblog submit, lower than 300 Kbps shall be thought of as low, 300-800 Kbps as mid, and above 800 Kbps as a excessive, HD-capable, or high-end community.

Once we appeared into enhancing the video high quality for low-bandwidth customers, there have been few key choices. Migrating to a more moderen codec corresponding to AV1 introduced the best alternative. Different choices corresponding to higher video scalers and region-of-interest encoding supplied incremental enhancements. 

Video scalers

We use WebRTC in most of our apps, however the video scalers shipped with WebRTC don’t have the highest quality video scaling. We’ve been in a position to enhance the video scaling high quality considerably by leveraging in-house scalers. 

At low bitrates, we regularly find yourself downscaling the video to encode at ¼ decision (assuming the digital camera seize is 640×480 or 1280×720). With our customized scaler implementations, we’ve got seen important enhancements in video high quality. From public checks we noticed beneficial properties in peak signal-to-noise ratio (PSNR) by 0.75 db on common.

Here’s a snapshot displaying outcomes with the default libyuv scaler (a field filter):

Determine 2.a: Video picture outcomes utilizing WebRTC/libyuv video scaler.

And the outcomes after downscaling with our video scaler:

Determine 2.b: Video picture outcomes utilizing Meta’s video scaler.

Area-of-interest encoding

Figuring out the area of curiosity (ROI) allowed us to optimize by spending extra encoder bitrate within the space that’s most necessary to a viewer (the speaker’s face in a speaking head video, for instance). Most cell units have APIs to find the face area with out using any CPU overhead. As soon as we’ve got discovered the face area we are able to configure the encoder to spend extra bits on this necessary area and fewer on the remaining. The simplest method to do that was to have some APIs on encoders to configure the quantization parameters (QP) for ROI versus the remainder of the picture. These adjustments offered incremental enhancements within the video high quality metrics like PSNR. 

Adopting the AV1 video codec

The video encoder is a key component in the case of video high quality for RTC. H.264 has been the preferred codec during the last decade, with {hardware} help and most purposes supporting it. However it’s a 20-year-old codec. Again in 2018, the Alliance for Open Media (AOMedia) standardized the AV1 video codec. Since then, a number of firms together with Meta, YouTube, and Netflix have deployed it at a large scale for video streaming

At Meta, transferring from H.264 to AV1 led us to our best enhancements in video high quality at low bitrates.

Determine 3: Enhancements over time, transferring from H.262 to AV1 and H.266

Why AV1?

We selected to make use of AV1 partly as a result of it’s royalty-free. Codec licensing (and concurrent charges) was an necessary side in our decision-making course of. Sometimes, if an utility makes use of a tool’s {hardware} codec, no further codec licensing prices shall be incurred. But when an utility is transport a software program model of the codec,  there’ll probably be licensing prices to cowl.

However why do we have to use software program codecs though most telephones have hardware-supported codecs?

Most cell units have devoted {hardware} for video encoding and decoding. And nowadays most cell units help H.264 and even H.265s. However these encoders are designed for widespread use circumstances corresponding to digital camera seize, which makes use of a lot greater resolutions, body charges, and bitrates. Most cell machine {hardware} is at present able to encoding 4K 60 FPS in actual time with very low battery utilization, however the outcomes of encoding a 7 FPS, 320×180, 200 Kbps video are sometimes worse than these of software program encoders working on the identical cell machine. 

The rationale for that’s prioritization of the RTC use case. Most impartial {hardware} distributors (IHVs) should not conscious of the community circumstances the place RTC calls function; therefore, these {hardware} codecs should not optimized for RTC eventualities, particularly for low bitrates, resolutions, and body charges. So, we leverage software program encoders when working in these low bitrates to offer high-quality video.

And since we are able to’t ship software program codecs with out a license, AV1 is an excellent possibility for RTC.

AV1 for RTC

The most important cause to maneuver to a extra superior video codec is easy: The identical high quality expertise may be delivered with a a lot decrease bitrate, and we are able to ship a a lot higher-quality real-time calling expertise for our customers who’re on bandwidth-constrained networks.

Measuring video high quality is a fancy subject, however a comparatively easy method to have a look at it’s to make use of the Bjontegaard Delta-Bit Rate (BD-BR) metric. BD-BR compares how a lot bitrate numerous codecs want to supply a sure high quality stage. By producing a number of samples at totally different bitrates, measuring the standard of the produced video offers a rate-distortion (RD) curve, and from the RD curve you may derive the BD-BR (as proven under).

As may be seen in Determine 4, AV1 offered greater high quality for all bitrate ranges in our native checks.

Determine 4: Bitrate distortion comparability chart.

Display-encoding instruments

AV1 additionally has just a few key instruments which might be helpful for RTC. Display content material high quality is turning into an more and more necessary issue for Meta, with related use circumstances, together with display sharing, sport streaming, and VR distant desktop, requiring high-quality encoding. In these areas, AV1 really shines. 

Historically, video encoders aren’t effectively suited to complicated content material corresponding to textual content with quite a lot of high-frequency content material, and people are delicate to studying blurry textual content. AV1 has a set of coding instruments—palette mode and intra-block copy—that drastically enhance efficiency for display content material. Palette mode is designed based on the commentary that the pixel values in a screen-content body often consider the restricted variety of coloration values. Palette mode can signify the display content material effectively by signaling the colour clusters as a substitute of the quantized transform-domain coefficients. As well as, for typical display content material, repetitive patterns can often be discovered throughout the similar image. Intra-block copy facilitates block prediction throughout the similar body, in order that the compression effectivity may be improved considerably. That AV1 offers these two instruments on the baseline profile is a large plus.

Reference image resampling: Fewer key frames

One other helpful function is reference image resampling (RPR), which permits decision adjustments with out producing a key body. In video compression, a key body is one which’s encoded independently, like a nonetheless picture. It’s the one sort of body that may be decoded with out having one other body as reference. 

For RTC purposes, because the bandwidth retains on altering typically, there are frequent decision adjustments wanted to adapt to those community adjustments. With older codecs like H.264, every of those decision adjustments requires a key body that’s a lot bigger in dimension and thus inefficient for RTC apps. Such giant key frames enhance the quantity of knowledge needing to be despatched over the community and end in greater end-to-end latencies and congestion. 

Through the use of RPR, we are able to keep away from producing any key frames.

Challenges round enhancing video high quality for low-bandwidth customers

CPU/battery utilization

AV1 is nice for coding effectivity, however codecs obtain this at the price of greater CPU and battery utilization. Plenty of fashionable codecs pose these challenges when working real-time purposes on cell platforms.

Primarily based on native lab testing, we anticipated a roughly 4 % enhance in battery utilization, and we noticed related leads to public checks. We used an influence meter to do that native battery measurement.

Regardless that the AV1 encoder itself elevated CPU utilization three-fold when in comparison with H.264 implementation, the general contribution of CPU utilization from the encoder was a small a part of the battery utilization. The telephone show display, networking/radio, and different processes utilizing the CPU contribute considerably to battery utilization, therefore the rise in battery utilization was 5-6 % (a big enhance in battery utilization). 

Plenty of calls run out of machine battery, or individuals hold up as soon as their working system signifies a low battery, so growing battery utilization isn’t worthwhile for customers until it offers elevated worth corresponding to video high quality enchancment. Even then it’s a trade-off between video high quality versus battery use.

We use WebRTC and Session Description Protocol (SDP) for codec negotiation, which permits us to barter a number of codecs (e.g., AV1 and H.264) up entrance after which swap the codecs with none want for signaling or a handshake throughout the name. This implies the codec swap is seamless, with out customers noticing any glitches or pauses in video.

We created a customized encoder that encapsulates each H.264 and the AV1 encoders. We name it a hybrid encoder. This allowed us to modify the codec throughout the name primarily based on triggers corresponding to CPU utilization, battery stage, or encoding time — and to modify to the extra battery-efficient H.264 encoder when wanted. 

Elevated crashes and out of reminiscence errors

Even with out new leaks added, AV1 used extra reminiscence than H.264. Any time further reminiscence is used, apps usually tend to hit out of reminiscence (OOM) crashes or hit OOM sooner due to different leaks or reminiscence calls for on the system from different apps. To mitigate this, we needed to disable AV1 on units with low reminiscence. That is one space for enchancment and for additional optimizing the encoder’s reminiscence utilization.

In-product high quality measurement

To check the standard between H.264 and AV1 utilizing public checks, we would have liked a low-complexity metric. Metrics corresponding to encoded bitrates and body charges received’t present any beneficial properties as the whole bandwidth accessible to ship video remains to be the identical, as these are restricted by the community capability, which implies the bitrates and body charges for video is not going to change a lot with the change within the codec. We had been utilizing composite metrics that mix the quantization parameter (QP is commonly used as a proxy for video high quality, as this introduces pixel information loss throughout the encoding course of), resolutions, and body price, and freezes it to generate video composite metrics, however QP shouldn’t be comparable between AV1 and H.264 codecs, and therefore can’t be used.

PSNR is a typical metric, nevertheless it’s reference-based and therefore doesn’t work for RTC. Non-reference, video-quality metrics are fairly CPU-intensive (e.g., BRISQUE: Blind/Referenceless Picture Spatial High quality Evaluator), although we’re exploring these as effectively.

Determine 6: Excessive-level structure for PSNR computation in RTC.

We’ve give you a framework for PSNR computation. We first modified the encoder to report distortions brought on by compression (most software program encoders have already got help for this metric). Then we designed a light-weight, scaling-distortion algorithm that estimates the distortion launched by video scaling. This algorithm can mix these scaling distortions with the encoder distortions to supply output PSNR. We developed and verified this algorithm regionally and shall be sharing the findings in publications and at tutorial conferences over the subsequent yr. With this light-weight PSNR metric, we noticed 2 db enhancements with AV1 in comparison with H.264.

Challenges round enhancing video high quality for high-end networks

As a fast evaluation: For our functions, excessive bandwidth covers customers for whom bandwidth is larger than 800 kbps. 

Through the years, there have been big enhancements in digital camera seize high quality. In consequence, individuals’s expectations have gone up, they usually wish to see RTC video high quality on par with native digital camera seize high quality. 

Primarily based on native testing, we settled on settings leading to video high quality that appears much like that of digital camera recordings. We name this HD mode. We discovered that with a video codec like H.264 encoding at 3.5 Mbps and 30 frames per second, 720p decision appeared similar to native digital camera recordings. We additionally in contrast 720p to 1080p in subjective high quality checks and located that the distinction shouldn’t be noticeable on most units aside from these with a bigger display once we performed subjective high quality checks.

Bandwidth estimator enhancements

Bettering the video high quality for customers who’ve high-end telephones with good CPUs, good batteries, {hardware} codecs, and good community speeds appears trivial. It might look like all it’s a must to do is enhance the utmost bitrate, seize decision, and seize body charges, and customers will ship high-quality video. However, in actuality, it’s not that straightforward. 

If you happen to enhance the bitrate, you expose your bandwidth estimation and congestion detection algorithm to hit congestion extra typically, and your algorithm shall be examined many extra instances than if you weren’t utilizing these greater bitrates. 

Determine 7: Instance displaying how utilizing greater bandwidth will increase the situations for congestion.

If you happen to have a look at the community pipeline in Determine 7, the upper the bitrates you’re utilizing, the extra your algorithm/code shall be examined for robustness over the time of the RTC name. Determine 7 reveals how utilizing 1 Mbps hits extra congestion than utilizing 500 Kbps and utilizing 3 Mbps hits extra congestion than 1 Mbps, and so forth. In case you are utilizing bandwidths decrease than the minimal throughput of the community, nonetheless, you received’t hit congestion in any respect. For instance, see the 500-Kbps name in Determine 7. 

To mitigate these points, we improved congestion detection. For instance, we added customized ISP throttling detection, one thing that was not being caught by the standard delay-based estimator of WebRTC. 

Bandwidth estimator and community resilience comprise a fancy space on their very own, and that is the place RTC merchandise stand out. They’ve their very own customized algorithms that work finest for his or her merchandise and clients.

Secure high quality

Individuals don’t like oscillations in video high quality. These can occur once we ship high-quality video for just a few seconds after which drop again to low-quality due to congestion. Studying from previous historical past, we added help in  bandwidth estimation to forestall these oscillations.

Audio is extra necessary than video for RTC

When community congestion happens, all media packets might be misplaced. This causes video freezes and damaged audio, (aka, robotic audio). For RTC, each are dangerous, however audio high quality is extra necessary than video. 

Damaged audio typically fully prevents conversations from occurring, typically inflicting individuals to hold up or redial the decision. Damaged video, however, typically leads to much less pleasant conversations, however, relying on the situation, it may be a block for some customers.

At excessive bitrates like 2.5 Mbps and better, you may afford to have three to 5 instances extra audio packets or duplication with none noticeable degradation to video. When working in these greater bitrates with cellphone connections, we noticed extra of those congestion, packet loss, and ISP throttling points, so we needed to make adjustments to our community resiliency algorithms. And since individuals are extremely delicate to information utilization on their cell telephones, we disabled excessive bitrates on mobile connections.

When to allow HD?

We used ML-based focusing on to guess which name ought to be HD-capable. We relied on the community stats from the customers’ earlier calls to foretell if HD ought to be enabled or not.

Battery regressions

We’ve a lot of metrics, together with efficiency, networking, and media high quality, to trace the standard of RTC calls. Once we ran checks for HD, we observed regressions in battery metrics. What we discovered was that almost all battery regressions don’t come from greater bitrates or decision however from the seize body charges.

To mitigate the regressions, we constructed a mechanism for detecting each caller and callee machine capabilities, together with machine mannequin, battery ranges, Wi-Fi or cell utilization, and so forth. To allow high-quality modes, we examine each side of the decision to make sure that they fulfill the necessities and solely then will we allow these high-quality, resource-intensive configurations.

Determine 8: Signaling server setup for turning HD on or off.

What the long run holds for RTC

{Hardware} producers are acknowledging the numerous advantages of utilizing AV1 for RTC. The brand new Apple iPhone 15 Professional helps AV1’s {hardware} decoder, and the Google Pixel 8 helps AV1 encoding and decoding. {Hardware} codecs are an absolute necessity for high-end community and HD resolutions. Video calling is turning into as ubiquitous as conventional audio calling and we hope that as {hardware} producers acknowledge this shift, there shall be extra alternatives for collaboration between RTC app creators and {hardware} producers to optimize encoders for these eventualities. 

On the software program aspect, we are going to proceed to work on optimizing AV1 software program encoders and creating new encoder implementations. We attempt to present the perfect expertise for our customers, however on the similar time we wish to let individuals have full management over their RTC expertise. We’ll present controls to the customers in order that they will select whether or not they need greater high quality at the price of battery and information utilization, or vice versa.

We additionally plan to work with IHVs to collaborate on {hardware} codec improvement to make these codecs usable for RTC eventualities together with low-bandwidth use circumstances. 

We additionally will examine forward-looking options corresponding to video processing to extend the decision and body charges on the receiver’s rendering stack and leveraging AI/ML to enhance bandwidth estimation (BWE) and community resiliency.

Additional, we’re investigating Pixel Codec Avatar applied sciences that may permit us to transmit the mannequin/share as soon as after which ship the geometry/vectors for receiver aspect rendering. This allows video rendering with a lot smaller bandwidth utilization than conventional video codecs for RTC eventualities.