Code[ish] logo
Related Podcasts

Looking for more podcasts? Tune in to the Salesforce Developer podcast to hear short and insightful stories for developers, from developers.

Tags

  • deep fakes
  • content creators
  • AI
  • synthetic media
  • audio processing

98. The Ethical Side of Deep Fakes

Hosted by Julián Duque, with guest Alex Serdiuk.

The rise of manipulated pictures and videos have given a name to this notorious practice: deep fakes. But Alex Serdiuk, the CEO of Respeecher, suggests its how we use these tools that makes them bad, not the technology in and of itself. He'll explain how his platform, which produces AI-generated audio samples, is actually helping the entertainment industry deliver fresh content to its customers.


Show notes

Julián Duque is a Lead Developer Advocate at Salesforce and Heroku. He's joined by Alex Serdiuk, the CEO of Respeecher. Respeecher has created AI software which works within the speech-to-speech domain: it takes one voice and makes it sound exactly like another. Alex rejects the premise that all deep fakes--that is, pictures and videos generated by AI--are inherently evil. He considers tools like CGI and Photoshop to fall within the realm of synthesized media, which helps artists create content. He positions Respeecher within that same mileu.

Respeecher has been working with Hollywood studios for some time. It removes pressure from actors who are unable to rerecord lines. It's also been used in situations where actors need to sound much younger, a visual-audio process called de-aging. In the future, applications of speech-to-speech work could also be used in museums, to provide a new dimension of history for audiences.

Of course, Alex recognizes that the main issue with deep fakes is not their existence, but their inability to be detected. To solve this problem, Respeecher watermarks its audio, to generate inaudible metadata which can nonetheless be analyzed to show whether a particular recording was faked. He also believes that more people need to be educated that synthesized media exists. Something one sees or hears might not be real, because technology is getting more and more advanced. We should all be mindful about the content we consume.

Transcript

Julián: Hello, and welcome to Code[ish]. My name is Julián Duque, and I am a Lead Developer Advocate here at Salesforce and Heroku. And today we have a very special topic that we will be covering on two different episodes. We are going to be discussing about the ethical side of deep fakes. Today I have with me, Alex Serdiuk. He's the CEO of Respeecher. Hello Alex, how you doing?

Alex: Hi there, doing good. Thanks for having me.

Julián: Of course, Alex, thank you very much for joining us today here at Code[ish]. Please introduce yourself to the audience.

Alex: I'm CEO of Respeecher. We are the startup company that created the technology that actually lets one voice sound exactly like another voice. We are in speech-to-speech domain. That's quite different from text-to-speech technologies you might have heard of. And we work for companies that produce content. So we help content creators get more flexibility in voiceover. Even resurrect the age voices. And we actually bring the way how you create content to the next level. We get rid of this tie between one person, one voice, and we can just operate with voices in very different way comparing to what we used to do previously.

Julián: Let me ask you something before starting with the topic. When you mention like speech-to-speech, that means I do a recording of my voice or a speech and then it is going to be changed live or after the recording.

Alex: Yeah, it still requires some time for processing. So there are two stages in our technologies. The first one we train our system to understand the difference between two particular voices. And once it's trained, we can do conversion. And conversion also take some time, but it's rather short. So you can hear output from the trained model within minutes after you did the recordings.

Julián: Oh, nice. Interesting. Now that I have satisfied my curiosity, thank you very much. We are going to start with today's topic. We are going to talk about the ethical side of the so-called deep fakes. So what is a deep fake? Or what is the proper name of that specific technology?

Alex: Yeah, I prefer calling this technology and not a deep fake but more of synthesized media-

Julián: Okay. Correct.

Alex: ... because deep fake always have this negative connotation-

Julián: Totally correct.

Alex: ... and synthesized media is like AI generated media or judge generated media or personalized media. Is more like a general term for artificial production, manipulation and to some extent modification of data and media. And there are many technologies under this umbrella, like text generation, we call it natural-language generation, music generation, video and image generation. And of course, voice generation where Respeecher operates.

Julián: Oh, okay. So synthesized media is like the proper technical term for the so-called deep fakes.

Alex: Yeah, exactly. So deep fake is actually related to the name of one person like nickname of one person who made like a visual. What we called deep fake some time ago and it became just a very generalized term of bad usage of synthesized media.

Julián: So we're going to be playing an example of what Respeecher do. So please tell us about the example we are going to play.

Alex: Yeah. I mean, you can play our MIT project. And in this project we actually made President Nixon say the speech that was written in case of Apollo 11 Mission goes wrong. And this speech was extremely powerful and we partnered with MIT and another company called Canny AI, to do the visual part, to deliver that speech like a piece of alternative history. That project was presented on a big documentary film festival in Amsterdam, IDFA. And now MIT brings it as an installation to other places. And that's kind of cool application of the technology where you can see and hear history in much colorful way, comparing to just reading this stuff.

Julián: Nice. Beautiful. So let's hear the example.

Speaker 4: Good evening, my fellow Americans. Fate has ordained that the men who went to the moon to explore in peace will stay on the moon to rest in peace. These brave men, Neil Armstrong and Edwin Albourne know that there's no hope for their recovery. But they also know that there is hope for mankind in their sacrifice. These two men are laying down their lives in mankind's most notable goal: the search for truth and understanding.

Julián: That was impressive. So I think this may have a lot bad ethical connotations because he's very uncanny. You're pretty much hearing that person and it is pretty much the same voice saying things that were not said. Now let's talk about the ethic aspects of synthesized media. But let's just start with the good side of things. Tell me about the proper use cases? And why this technology matters? And why it's important?

Alex: I mean, we can talk a lot about good side of deep fake of synthesized media, and there are some very cool applications you can find out for using synthesized media. And in general, CGI is synthesized media, Photoshop is synthesized media. All that stuff that helps you to create content and operate with content to some extent is related to synthesized media. What we do at Respeecher, we bring this kind of flexibility for content creators when they need to use some particular voice, but they cannot get this voice in the studio at the time they need it.

Alex: Sometime crucial for doing ADR. That's usual process for post production in all movies that are being created. When you need to get actor back to studio to re record some lines or just be available for spinoff TV shows. Something like that. And in that case, with the permission of an actor, give the voice to another human who would do the stakes on their behalf without actually getting them to studio and doing that kind of monotonous and to some extent boring work.

Alex: So they don't like to be in front of microphones. They are actors. They want to play in front of camera. But when you need to fix things, it's a lot of time spent in studio. So Don Hahn, once said that he usually spends 3 days for additional recordings that are needed to be done for any movie he participates in. And sometimes one line could take 2 hours to get right. So you can just imagine how annoying that could be.

Alex: Also, there are voices from the past of actors that needs to be used in the movie. And that would be voices of deceased people but also there is de-aging use case, we actually doing one project related to de-aging right now, where you need an actor sound much younger. Content creators are used to make them look younger. They apply a lot CGI and all that kind of stuff, but making them sound younger is the problem that cannot not been solved for a while. We can do that.

Alex: Also there are voices of kids that's kind of hard to use. From one point of view, it's hard to get kid in the studio and sit still and record what you need them to record. And you still, to some extent, steal their childhood but by just spending time in studio. But other issue is you just need to stick to one voice and voice of kids could be changed over a short period of time. And when you do a series of animation, you need to be consistent in the voice of this character. And we can actually help with getting this consistency. We can make the kid whose voice has been changed, be able to speak in their previous voice that was used in this content.

Julián: Mm-hmm (affirmative). That's great. And I imagine this is also being used on video games and other types of digital media.

Alex: Yeah, exactly. So for media, there are many use cases around this kind of flexibility content creators need in voiceover. But also there are some industries where more real-time voice conversion or extra conversion is really needed, like call centers, communication, online or Zoom with people for whom English is a second language. So that's a great use case where we can make, say, call center operators from other country be able to speak with less noticeable accent and improve their communication level.

Julián: It is very interesting how the future is being shaped right now. And I have a question. Right now, let's say, if you need a voice actor you will ask that actor for permission to use their voice, but what happens with deceased people? How to get the proper permissions to use their voices? For example, the Richard Nixon video.

Alex: Yeah. I mean, it depends from the voice itself. From the person themselves. And usually for actors, there are estates that own the rights for using this person identity. And this estates are quite big so you might have noticed that, one estate did like James Dean commercial for McDonald's. So they manage the new content that's been created. That's a common process where you want to use someone's identity. You just go to the estate that manages the rights for that actor or that celebrity. And the estate give you rights to create new content using their identity and voice being part of the identity just applies into that kind of legal framework.

Julián: Wow. Can it be also applied to musicians? For example, I've heard from certain musicians that never released songs, can this technology be used to recreate those songs that were never recorded before?

Alex: Yeah, it's possible to do. I mean, our technology just started to sing. It happened just a couple of months ago and frankly, we don't know to what extent it can sing, but we can make it sing, even having a speech data on the input for a target voice. And that's just a question of a couple of months for us to be able to provide a good singing conversion engine that could be used for performing the songs that were written, but never delivered. However, we need to understand that we just do voice and there is a big part of emotional content that should be created. And there is no way in my opinion, to duplicate the way how people used to sing, how people used to speak. You can just try to impersonate, but the voice part is being covered by Respeecher.

Julián: Okay. Got it. Well, it seems we can get a lot of different benefits. It can be applied on art. We also talk about entertainment, learning, like as a great resource for museums and history conversations. But what about the bad things around this technology? What are the main concerns about synthesized media?

Alex: Yeah, even the main concerns are related to this deep fake term. I mean that there is a bad use case where you pretend that someone said something they never said. And that's the main ethical concern about using synthesized media. However, it's not that new how many people think about deep fakes because as soon as humanity started to speak, they started to generate rumors. So they started to say that someone said something they never did.

Alex: And in that case, the fake is just next level of rumors. And we also need to understand that many technologies, if not all of them, have bad usage. Like internet could be used for bad things. Photoshop could be used for bad things. Printing machine, printing press was used for bad things. And we just have to get used to this new reality where Truth should be configured to some extent.

Julián: But I see a difference. Well, there are brilliant and brilliant skilled Photoshop artists out there doing impressive manipulations, of course. But, right now there are like enough technologies that are able to detect that an image has been manipulated digitally. That might be what they say a fake. But the examples I have seen about video and audio are also very realistic and the citizens get that on a different way, especially the audio. So how to differentiate that? That the audio has been synthesized? That it is not the real audio? How can we differentiate that?

Alex: Yeah, I mean, at this point, synthesized video is much more advanced than synthesized audio. So most of deep fakes you might have seen are about video part.

Julián: Mm-hmm (affirmative).

Alex: And it's quite hard to make audio sound very cool. And in my personal opinion, that's because when we see something, our imagination works for the purpose to actually make it more real. So if you see it not perfect picture, we can make it perfect in our mind. But when we hear some non-perfect sound, we usually hear all these artifacts, robustness, all that stuff that makes us think that it's not real. And the reason for that, because as human beings we were used to protect ourselves using our hearing. So we had to lie in the cave quite a while ago and be able to hear and not filter all the threats in surrounding.

Alex: And the way we can protect and how we can detect, I mean, there are two ways to detect. To create detectors that would catch some artifacts or some special things that are being produced in video and audio when it's being generated by machine, by artificial intelligence. Another way is to add what we can call watermarking to all generated images, to generated sound, and we actually work in both ways with our technology. So we've worked on watermarking technique to add in all our audio. And the idea is just to have some kind of picture on spectrogram that would be seeable by machine, but will not be hearable when you listen to it.

Alex: So you can run some recordings through a engine that detects this particular watermark and you can say that it was created digitally. That should be quite soon, this arms race between good audio, fake audio detectors and good synthesizers. I feel we are not there yet because there are not that many good synthesizers. And usually you can just easily tell when you hear the stuff that it was generated, or you can apply very simple techniques to see some standard artifacts that are present in good high quality synthesized audio.

Julián: Well, what I'm seeing is that now all of the different social platforms will need to start adding this type of checkers. Because yeah, there are people that has the technical knowledge to use those tools and be able to discern, identify that it is a fake audio. That it's not real. But there are people that are very gullible they believe everything they see on the internet and especially if they are hearing a voice the impact is going to be way more powerful. So we need to start being more conscious on how to be able to check the facts about what people are publishing online. Is there any regulation around any of these technologies?

Alex: There are not, as far as I know. They might come, but I mean, in this discussion we are missing one important argument to this equation. That would be about educating the society.

Julián: Yeah.

Alex: We not just need to detect the synthesized media, but we also need to educate society that synthesized media exists. And something you see or something you hear might be generated and might be not real because technology is getting more and more advanced. And that's a very big part that should be done by governments, where they spend time and efforts to actually educate their society. Well, think about example of this printing press.

Alex: A few generations ago, our parents and grandparents, they used to believe in everything that was printed. And now we do not believe in everything that was printed because we know that it could be not thorough. The same should happen with synthesized media, with video deep fakes and audio defects. So people should just be critical in the way they think about what they see or what they hear.

Julián: That's true, true. And a very good point. Let's say that a couple of years ago, if you wanted to do some sort of face deep fake, for example, you need needed to spend a bunch of time using different tools. Running models or to be able to have some sort of result. But today you just need to download an app on your phone and you have a tool to pretty much change any image with your face, even video. What tools are available to do synthesized voice today? I know you have a service, but how is easy for people to start creating these voices today?

Alex: Yeah. I mean, as we mentioned before, this video technologies and synthesized video are much more advanced comparing to synthesized audio. Yeah, that's true. Now it's kind of easier to get some code from GitHub to synthesize someone's face and replace someone's face in the video, but it's much harder to do with audio. So for example, for our technology that can produce very good high, realistic output, it's still quite a project. So we need to get recordings from a target voice. We need to get recordings from source voice. We need to train our models and it usually takes like a week or so to train the model on the particular voice player.

Alex: And then we need to ask the source speaker to say new phrases. And these new phrases will be converted into the target voice. But many models can make mistakes. And the way to work with that mistakes is still apply some post production magic, just pick the best takes. Stitch them altogether. Cut and paste. Do all this post-production stuff. So it's not that easy and it's not yet available to the public yet because it just requires a lot of efforts to create a very high realistic speech or singing.

Julián: And how is Respeecher doing it today, for example, if I want to use you to generate voices, how is that process?

Alex: Yeah. I mean, right now we are focused on working with studios, with content creators. And the first step would be get permission from a voice owner or from their estate to actually synthesize their voice for this particular piece of content. Usually studios have good relationship with balance and they land this kind of agreement where they either license or just get permission for using particular voice.

Alex: And then we ask for some data to train our system. With current system we launched at recently, it's a bit more simplified comparing to what we had for a few years. We need one hour of any recording of a target voice and then we need another hour of recording of a source voice. No matter what script is there we just need good samples of the voice to train our system. And then we proof listen to all the data. We clean it up if needed. We can throw away some bad pieces of data and we train our system and it takes some computational time.

Alex: Most of the time, I mentioned before, for training goes to computation directly. And our GPUs are being heated. But after that we can do what we call inference. So we can just get new lines recorded by source sector, and this model would convert these lines to a target voice that would be done within minutes. It doesn't take long.

Julián: Okay, nice. Interesting. What is the current status of the ecosystem? I mean, people working in this type of technologies, what are they doing? What type of applications or creative uses have you seen there in the ecosystem? Are more bad or fake things being created? Or more on the good side of things? Can you tell us how is the panorama today?

Alex: Yeah, there are more and more companies working synthesized media field and actually recently Samsung Next released a big overview of synthesized media landscape, and it covers a lot of different areas. And in voice domain, in speech generation domain, most of the companies are doing text-to-speech--

Julián: Exactly.

Alex: ... so their technology can generate speech from the text. So you put text on the input. We are doing speech-to-speech and there are just a few companies in the market that are doing the speech-to-speech, and it's quite different from text to speech because we take all the emotional content from real human and we still need to keep humans in the loop. So humans need to give the right emotion and the right way of saying the stuff as say, director of the movie wants it to be said.

Alex: And in terms of how it's been used in the media, I would say that we are very early in the adoption curve. So in most of our conversations with prospective clients, we just need to educate them to the way that the technology is already there. So it exists on that level. And it's possible to create voice and put it on big screen where it would not be noticeable that it was generated that there are some artifacts or something like that. So we are still very early.

Julián: Okay. And how that education strategy looks like, how are you educating your prospective customers or the people out there about this technology?

Alex: Yeah. We talked to studios. We actually talk and work with most of the Hollywood, biggest Hollywood studios. And we talk a lot with sound professionals. And frankly, these sound professionals, they were looking for technology like ours for years, because there's cases where they cannot get an actor back to studio or they need this to be done right now not in two weeks when actor would be available. Is a kind of a pain in the neck for sound professionals.

Alex: But most of them have never heard of good speech synthesis technology because this technologies were not there. So we need to explain that it's actually possible to do. And in some cases, people cannot even believe that it's possible to do. They think that we faked our demo that's why we had to do some demos with very famous voices and create some stuff that these famous voices have not said ever.

Julián: Oh, yeah. Yeah. Right. And I also imagine these being used in art expositions, museums, things that are going to be more for the public. So they'll know that this technology exists and can produce very exact results.

Alex: Yeah, exactly. Yeah. So this kind of historical projects are very interesting for us, because first, we bring history back to life. We bring more colors to the history where you can hear voice of Winston Churchill, or you can hear voice of president who deceased quite a while ago in the museum. But also it's a big part of our ethics policy where we educate the society about possibilities of the technology because the best way to educate is to show and the best way to show, is to show it on good well known examples.

Julián: Definitely. Definitely. So to wrap up, do you have any advice for our audience? What to look for to get more education about this topic? Or how to start playing a little bit with this technology?

Alex: I mean it's worth checking this Samsung Next synthetic media landscape that was released I believe a month ago, because it covers a lot of companies and a lot of examples of synthesized media. It's also worth thinking about the content that's being published in a critical way, because at some point in future, maybe in a year or in two years, the technology like ours could be available to bigger amount of people not just to studios and sound professionals. And in that case, we should be mindful about the content we consume.

Julián: Alex, thank you very much for joining us here today that courage, this is impressive. And there is a lot of new things to be aware of and learn about it. And definitely it is exciting to see how this technology is starting to shape our reality and our future. So we will continue this conversation on part two, where we are going to dive into the technical side on how this content is created and validated. So how can we identify if it is real or fake. Thank you, Alex, again for joining us and thank you everybody for listening to this episode, see you on the next one. And, bye-bye.

Alex: Thank you so much, Julián.

About code[ish]

A podcast brought to you by the developer advocate team at Heroku, exploring code, technology, tools, tips, and the life of the developer.

Hosted by

Avatar

Julián Duque

Principal Developer Advocate, Heroku

Developer Advocate, Community Leader, and Educator with experience in Node.js and JavaScript

With guests

Avatar

Alex Serdiuk

CEO, Respeecher

Alex is the co-founder of Respeecher. He and the team are working on building a fine voice cloning tool for content creators for more than 3 years.

More episodes from Code[ish]