Realistic Speech-Driven Talking Video Generation with Personalized Pose

<div>Pipeline of our method: the input information can be audio or text. When the audio information is used as the speaker synthesis network input, we convert the audio data into log-mel features and then input the Aud2Kps model to get the pose key points. When the input is text information, it is necessary to use the acoustic model to convert the text information to the log-mel feature as the input of the Aud2Kps network. The following steps are the same as the audio signal input process.</div>

Complexity

fig2

Figure 2

Figure 2: Realistic Speech-Driven Talking Video Generation with Personalized Pose