목차

Title Page

ABSTRACT

국문 초록

PREFACE

Contents

CHAPTER 1. Introduction 21

CHAPTER 2. Evolution Until Large-Scale ASR Models 28

2.1. ASR Before Transformer 29

2.1.1. Basic Recurrent Neural Network Models 29

2.1.2. Encoder-Decoder with Attention 34

2.2. ASR After Transformer 40

2.2.1. Transformer Architecture 40

2.2.2. Evolution of the Encoder in ASR 44

2.2.2. Evolution of Self-Supervised Speech Representations 47

2.3. Large-Scale ASR Models 52

2.1.1. OpenAI's Whisper 53

2.1.2. Meta AI's MMS 60

CHAPTER 3. Parameter-Efficient Fine-Tuning 64

3.1. Traditional Fine-Tuning Approach 66

3.2. Low-Rank Adaptation for Whisper 71

3.3. Language Specific Adapter Layers of MMS 80

CHAPTER 4. Experiment Details 87

4.1. SPGIspeech Corpus 88

4.1.1. Key Characteristics of SPGIspeech Corpus 88

4.1.2. Performances of Large-Scale ASR Models on SPGIspeech Corpus 93

4.1.3. Fine-Tuning Specifications for SPGIspeech Corpus 96

4.2. Experiment Details on Fine-Tuning Whisper Model 99

4.2.1. Experiment Details for Full Fine-Tuning 100

4.2.2. Experiment Details for LoRA Fine-Tuning 105

4.2. Experiment Details on Fine-Tuning MMS Model 110

4.3.1. Experiment Details for Parameter-Efficient Fine-Tuning of the MMS Model 115

4.3.2. Experiment Details for Full Fine-Tuning 121

CHAPTER 5. Experiment Results 130

5.1. Efficiency Evaluations of Fine-Tuning the Whisper Model 131

5.2. Performance Evaluations 138

5.2.1. Comparative Analysis of Fine-Tuning on 'Tiny' Set 143

5.2.2. Comparative Analysis between Fine-Tuning on 'Tiny' and 'Small' Sets 148

5.2.3. Analyzing the Benefits Gained by Increased Volume of Data 152

5.3. Mathematical Evaluations 157

5.3.1. Amplification Factor for Low-Rank Matrices 159

5.3.2. Singular Vector Decomposition for the MMS model 165

CHAPTER 6. Conclusion 170

6.1. Reviewing the Dissertation 170

6.2. Key Findings 172

6.2.1. Findings from Fine-Tuning Whisper Model 172

6.2.2. Findings from Fine-Tuning MMS Model 174

6.3. Limitations and Further Research 176

REFERENCES 178

Table 1. Comparative architectural specifications of Whisper model variants. This table enumerates the differences in layers, dimensions, attention heads, and total parameters... 56

Table 2. Identifier key for Whisper model variants. Specific naming conventions is applied for Whisper model sizes, ranging from 'Tiny' to 'Large', and delineates between... 58

Table 3. LoRA trainable parameters for Whisper model variants 79

Table 4. Number and percentage of trainable parameters or LoRA fine-tuning Whisper models. The LoRA rank is set to 32 for all the models. 80

Table 5. Characteristics and exemplary sentences for SPGIspeech dataset's transcript. Verbatims for each characteristic are provided to give comparison and explain the... 90

Table 6. WER and CER on the SPGIspeech test set for baseline models 93

Table 7. WER and CER for Whisper models on SPGIspeech test set 94

Table 8. Comparison of PEFT methods for Whisper and MMS models on different SPGIspeech subsets. 97

Table 9. Hyperparameter settings for full fine-tuning the Whisper models on 'tiny' set of SPGIspeech corpus 100

Table 10. Hyperparameter settings for LoRA fine-tuning Whisper models on 'tiny' set of SPGIspeech corpus 105

Table 11. Hyperparameter settings for LoRA fine-tuning the Whisper models on 'small' set of SPGIspeech corpus 108

Table 12. Composition of the character set used for fine-tuning MMS model 112

Table 13. Hyperparameter settings for fine-tuning the MMS model with PEFT methodologies on 'tiny' set and 'small' of the SPGIspeech corpus 117

Table 14. Hyperparameter settings for full fine-tuning the MMS model on 'tiny' set of SPGIspeech corpus 122

Table 15. Effects of various learning rate on full fine-tuning of the MMS model 123

Table 16. CER for fine-tuned models. The value outside the parenthesis is the CER value, while the value inside is RCR. The performance metrics are expressed as percentages 143

Table 17. Comparative performance and resource utilization of fine-tuning methods on the MMS Model, fine-tuned on 'tiny' set of the SPGIspeech corpus 146

Table 18. Comparative results of WER and CER across Whisper model variants LoRA fine-tuned on the 'tiny' and 'small' sets of the SPGIspeech corpus. The performance... 149

Table 19. Comparative results of WER and CER for the MMS model applying PEFT methodologies on the 'tiny' and 'small' sets of the SPGIspeech Corpus. The... 150

Table 20. Comparative analysis of RCR and DRCR for LoRA fine-tuning on Whisper variants 153

Table 21. Comparative analysis of RCR and DRCR for the MMS model applying different PEFT methodologies 154

Figure 1. Diagram of how RNN process sequential data 30

Figure 2. Theprocess of CTC in ASR 31

Figure 3. The graphical illustration of basic encoder-decoder architecture 35

Figure 4. The graphical illustration of basic encoder-decoder architecture with attention mechanism 36

Figure 5. Illustration of an ASR system using encoder-decoder with attention mechanism 38

Figure 6. Graphical illustration of the Transformer architecture 41

Figure 7. Progression of encoder-decoder ASR model architectures, with key innovations in encoder design. This figure maps the advancement from the foundational Transformer... 45

Figure 8. The evolution of self-supervised speech representation models. The shaded area highlights models that incorporate a Transformer encoder, signifying a key architectural... 48

Figure 9. Schematic of the Whisper model architecture. This diagram showcases the Transformer-based encoder-decoder structure utilized by OpenAI's Whisper for ASR... 54

Figure 10. Evolution of multilingual self-supervised speech representations that applies the Wav2Vec 2.0 architecture 61

Figure 11. The two-phase learning process of BERT for NLP applications. Once pre-trained on large corpus, the fine-tuning requires much smaller data. 67

Figure 12. Visualization of homograph word embedding representation in NLP models. The word 'bass' receives the same vector representation despite being used in different... 68

Figure 13. Contextual Word Embeddings using BERT. BERT model generates distinct vector representations for the word 'bass' based on its usage in different sentences,... 69

Figure 14. Illustration of LoRA fine-tuning. Matrix A reduces the input into a lower dimension d, while matrix B restore the dimension into original dimension d. 73

Figure 15. Target modules for LoRA fine-tuning on Whisper models. 76

Figure 16. Whisper decoder architecture highlighting LoRA target modules. The red dashed outlines with arrows pinpoint the 'Query' and 'Value' components within the 'Multi-... 77

Figure 17. Comparative overview of fine-tuning Strategies for the MMS model. (a) Full fine- tuning: All layers including the LM Head and Transformer Encoder Blocks are trainable.... 81

Figure 18. Detailed architecture of the MMS model's Transformer encoder block with an integrated adapter layer. 83

Figure 19. Learning rate schedule during full fine-tuning of the Whisper models on 'tiny' set of SPGIspeech corpus 101

Figure 20. Cross-entropy loss value of the Whisper models during full fine-tuning on 'tiny' set of SPGIspeech corpus 103

Figure 21. Learning rate schedule during LoRA fine-tuning of the Whisper models on 'tiny' set of SPGIspeech corpus 106

Figure 22. Cross-entropy loss value of Whisper models during the LoRA fine-tuning on 'tiny' set of SPGIspeech corpus 107

Figure 23. Learning rate schedule during LoRA fine-tuning of the Whisper models on 'small' set of SPGIspeech corpus 109

Figure 24. Cross-entropy loss value of the Whisper models during LoRA fine-tuning on 'small' set of SPGIspeech corpus 110

Figure 25. The scope of training for various fine-tuning methodologies for the MMS model 114

Figure 26. Learning rate schedule during fine-tuning the MMS model with PEFT methodologies on 'tiny' set of the SPGIspeech corpus 118

Figure 27. Learning rate schedule during fine-tuning the MMS model with PEFT methodologies on 'small' set of the SPGIspeech corpus 118

Figure 28. CTC loss value of the MMS model during fine-tuning with PEFT methodologies on 'tiny' set of the SPGIspeech corpus. The y-axis is in logarithmic scale. 119

Figure 29. CTC loss value of the MMS model during fine-tuning with PEFT methodologies on 'small' set of the SPGIspeech corpus. The y-axis is in logarithmic scale. 120

Figure 30. CTC loss value of the MMS model during full fine-tuning with various learning rates on 'tiny' set of the SPGIspeech corpus. The y-axis is in logarithmic scale. 125

Figure 31. CTC loss value of the MMS model during fine-tuning with various fine-tuning methods on 'tiny' set of the SPGIspeech corpus. The y-axis is in logarithmic scale. 127

Figure 32. Maximum GPU Memory Allocation During Fine-Tuning of the Whisper model variants with a red dotted line marking the 16GB threshold. The hatched 'large-v2' bar... 132

Figure 33. Comparative training duration for Whisper model variants highlighting efficiency of LoRA fine-tuning. The hatched bar for the 'small.en' model represents LoRA fine-... 135

Figure 34. Illustration of WER and CER calculations for the SPGIspeech corpus test set. (a) is illustrated for WER calculation and (b) is illustrated for CER calculation. Dotted... 140

Figure 35. Comparative DRCR values across model sizes and PEFT methodologies. This figure illustrates the DRCR percentages plotted against the logarithmic scale of model... 155

Figure 36. Target modules for LoRA fine-tuning Whisper model 161

Figure 37. Comparison of amplification factors for the Whisper large-v2 model variants, each fine-tuned using LoRA on the 'tiny' and 'small' sets of the SPGIspeech corpus. The... 163

Figure 38. Singular values of LM head layer of the MMS model for varying fine-tuning methodology and dataset volume 167