Audio Watermarking and Deepfakes
In my 500 Words newsletter, I’ve been discussing audio deepfakes and how they are changing the career landscape for voice actors. If an actor can license a digital version of their voice, would they ever again need to go into a recording studio? Or would licensing a copy of an actor’s voice be a terrible, career-ending idea? Once a digital version of an actor’s voice is in the wild, how will we tell the difference between what’s real and a copy?
Worse, will voice thieves steal the voice, damaging the actor’s livelihood?
I needed experts to address these technical questions, so I turned to the folks at Pindrop. They are scientists and technologists who are developing the future of voice authentication. Nick Gaubitch, Director of Research, and Elie Khoury, VP of Research, answered a few questions from me over email.
1/ How close are we, given current software developments, to creating synthetic recordings of a person’s voice that would be indistinguishable from an authentic recording of that person’s voice?
To answer this question, we must first define what we mean by the term ‘indistinguishable’. A voice may be ‘indistinguishable’ from its genuine counterpart for a human listener. For this case, the current technology for synthetic speech generation and voice conversion has matured to a level where it is becoming close to impossible for a human listener (trained or not) to distinguish a genuine live voice from a synthetic one. Indeed, studies [1,2] have already shown that human performance of deepfake detection is close to random, especially for untrained listeners.
From a machine speaker recognition point of view, it has been difficult to distinguish between a genuine and a synthetic voice for some time and here the requirements are different from the human perception case. A voice that may sound robotic or synthetic to a human listener may be perfectly recognizable as the genuine voice by an automatic speaker recognition system [3]. Unless, of course, it has been designed specifically to avoid that.
Lastly, and perhaps most importantly, for systems that are specifically designed to detect synthetically generated speech, there are a plethora of cues (often inaudible to the human ear) that make it possible to distinguish, with high accuracy, synthetic speech from genuine speech. That is typically what Pindrop’s deepfake detection systems rely on [4].
2/ How would audio watermarking help us identify whether a recording of a person’s voice is authentic?
Digital audio watermarking comprises two components: an embedding unit that embeds the watermark signal into the carrier audio signal and a detection unit for detecting watermark presence in a given audio signal. The watermark itself is typically a pseudorandom sequence of {-1, 1}.
One promising approach would be to embed a watermark for each synthetically generated speech utterance. A watermark detector may then be used to check if the watermark is present in a particular audio recording and based on that to make a decision on the authenticity of the voice.
3/ What are some of the challenges we face when implementing effective audio watermarking?
In general, robust watermarking technology has to balance between three competing requirements: robustness to withstand different deliberate and non-deliberate attacks, perceptibility to ensure that the watermark is not perceived in the carrier signal (which means inaudible in speech watermarking), and capacity, the amount of information that a watermark can carry.
In the world of audio, it is generally accepted that robust and secure digital watermarking of speech is more difficult to achieve than for music signals. There are two main reasons for this. First, speech is much less spectrally rich than music which provides much less space for an imperceptible watermark to be added. Second, and perhaps more importantly, the quality of a speech signal can be degraded significantly by, for example, compression, additive noise and reverberation, while still achieving the goal of communicating a message. Thus a watermark has to be able to withstand much tougher degradations than typically expected in music.
In addition to the above, there are still open questions of how the watermarking could be deployed at scale for real-time communications, and how, when and by who the watermark embedding and detection should be done.
3a/ For example, audio watermarking requires cooperation amongst content creators, publishers, and developers. Is that only possible if a standard audio watermark is used? Or do you see another path?
I believe that a centralized and collaborative approach would be the best way forward for watermarking. Thus, all interested parties would have to agree on a watermarking standard and comply with it. One relevant and related example of such an initiative is the Coalition for Content Provenance and Authenticity (C2PA) [5].
Of course, this is not the only way forward, and the current trend is that each synthetic media content provider develops their own watermarking strategy dedicated to their own content. There are at least two drawbacks with such a distributed approach. First, it means that one would have to validate the presence of a watermark against every provider separately, which will become increasingly inefficient with a growing number of providers. Second, there will be inevitable differences in the quality of the watermarking technology between service providers. Both of these problems would be addressed by the centralized approach.
Nevertheless, whichever route is chosen, it does require synthetic content creators to comply with a set of rules. This will most certainly not be the case when fraudsters or other bad actors join the scene and start to build their own tools for synthetic speech generation or speech conversion. This is probably the greatest weakness of watermarking as a candidate for voice authenticity verification technology.
4/ Are there other ways that may help us determine whether an audio recording of a voice is authentic?
Watermarking is only one way towards the detection of synthetic media. Hopefully, it has become clear by now that it is not a very strong means of protection since it is susceptible to deliberate attacks and because it relies on the goodwill of the voice cloning providers. Hence, digital watermarking has the potential of a decent soft layer of defense for informing users of the authenticity of a voice in a collaborative environment. It will likely not work when bad actors are involved.
Consequently, we will need other layers of technology such as the more generic deep learning based tools for synthetic speech detection [6]. These tools show great promise for not only accurate detection of synthetically generated voices but also the identification of the software used to generate the synthetic voice.