초록

With the progression of deep learning techniques, the field of generating videos automatically from audio or text inputs has emerged as a highly promising and rapidly evolving area of research. This paper presents NeRF-THIS(Neural Radiance Field based Talking Head Synthesis Incorporating Text-to-Speech), a novel approach to text-driven talking head generation that combines the strengths of text-based audio generation models with audio-driven video generation models. The method builds a Neural Radiance Fields (NeRF) based talking head generation architecture integrated with text-to-speech(TTS). This approch has a number of advantages. :1) it only needs 5 min of trainning data. 2)It is not constrained by Automatic Speech Recognition (ASR) models, thereby offering freedom from language barriers. 3)It cat support real-time inference in low computational cost. Our findings indicate a promising direction for future research in multimedia content generation, opening new avenues for applications in virtual reality, digital entertainment, and interactive media.