Recent research on automatic speaker recognition (ASR) showed that there are many challenges to overcome for real life applications. Some situations related to real environment, privacy and spoofing attack are proposed and solved. But in terms of real environment speaker recognition performance, short test utterance still exists as a main obstacle. Specifically, source features extracted from short speech utterances reported to have significantly decreased recognition performances than longer speech utterance features.
To overcome such challenges, many approaches showed improvements through methods like multi-scale frequency-channel attention, meta-learning, and segment aggregation. Although many models showed increased performance on short utterance situation, performance gap between full utterance and short utterance is still critical.
This thesis proposes an efficient system that is useful for various utterance length situations. First, better attentive network for more distinct speaker embedding is utilized. Second, speaker verification model architecture refined to show better performance from previous work. Finally, additional training strategy to overcome short utterance performance degradation has been implemented and compared with original methods.