Title Page
ABSTRACT
국문 초록
PREFACE
Contents
CHAPTER 1. Introduction 21
CHAPTER 2. Evolution Until Large-Scale ASR Models 28
2.1. ASR Before Transformer 29
2.1.1. Basic Recurrent Neural Network Models 29
2.1.2. Encoder-Decoder with Attention 34
2.2. ASR After Transformer 40
2.2.1. Transformer Architecture 40
2.2.2. Evolution of the Encoder in ASR 44
2.2.2. Evolution of Self-Supervised Speech Representations 47
2.3. Large-Scale ASR Models 52
2.1.1. OpenAI's Whisper 53
2.1.2. Meta AI's MMS 60
CHAPTER 3. Parameter-Efficient Fine-Tuning 64
3.1. Traditional Fine-Tuning Approach 66
3.2. Low-Rank Adaptation for Whisper 71
3.3. Language Specific Adapter Layers of MMS 80
CHAPTER 4. Experiment Details 87
4.1. SPGIspeech Corpus 88
4.1.1. Key Characteristics of SPGIspeech Corpus 88
4.1.2. Performances of Large-Scale ASR Models on SPGIspeech Corpus 93
4.1.3. Fine-Tuning Specifications for SPGIspeech Corpus 96
4.2. Experiment Details on Fine-Tuning Whisper Model 99
4.2.1. Experiment Details for Full Fine-Tuning 100
4.2.2. Experiment Details for LoRA Fine-Tuning 105
4.2. Experiment Details on Fine-Tuning MMS Model 110
4.3.1. Experiment Details for Parameter-Efficient Fine-Tuning of the MMS Model 115
4.3.2. Experiment Details for Full Fine-Tuning 121
CHAPTER 5. Experiment Results 130
5.1. Efficiency Evaluations of Fine-Tuning the Whisper Model 131
5.2. Performance Evaluations 138
5.2.1. Comparative Analysis of Fine-Tuning on 'Tiny' Set 143
5.2.2. Comparative Analysis between Fine-Tuning on 'Tiny' and 'Small' Sets 148
5.2.3. Analyzing the Benefits Gained by Increased Volume of Data 152
5.3. Mathematical Evaluations 157
5.3.1. Amplification Factor for Low-Rank Matrices 159
5.3.2. Singular Vector Decomposition for the MMS model 165
CHAPTER 6. Conclusion 170
6.1. Reviewing the Dissertation 170
6.2. Key Findings 172
6.2.1. Findings from Fine-Tuning Whisper Model 172
6.2.2. Findings from Fine-Tuning MMS Model 174
6.3. Limitations and Further Research 176
REFERENCES 178
Table 1. Comparative architectural specifications of Whisper model variants. This table enumerates the differences in layers, dimensions, attention heads, and total parameters... 56
Table 2. Identifier key for Whisper model variants. Specific naming conventions is applied for Whisper model sizes, ranging from 'Tiny' to 'Large', and delineates between... 58
Table 3. LoRA trainable parameters for Whisper model variants 79
Table 4. Number and percentage of trainable parameters or LoRA fine-tuning Whisper models. The LoRA rank is set to 32 for all the models. 80
Table 5. Characteristics and exemplary sentences for SPGIspeech dataset's transcript. Verbatims for each characteristic are provided to give comparison and explain the... 90
Table 6. WER and CER on the SPGIspeech test set for baseline models 93
Table 7. WER and CER for Whisper models on SPGIspeech test set 94
Table 8. Comparison of PEFT methods for Whisper and MMS models on different SPGIspeech subsets. 97
Table 9. Hyperparameter settings for full fine-tuning the Whisper models on 'tiny' set of SPGIspeech corpus 100
Table 10. Hyperparameter settings for LoRA fine-tuning Whisper models on 'tiny' set of SPGIspeech corpus 105
Table 11. Hyperparameter settings for LoRA fine-tuning the Whisper models on 'small' set of SPGIspeech corpus 108
Table 12. Composition of the character set used for fine-tuning MMS model 112
Table 13. Hyperparameter settings for fine-tuning the MMS model with PEFT methodologies on 'tiny' set and 'small' of the SPGIspeech corpus 117
Table 14. Hyperparameter settings for full fine-tuning the MMS model on 'tiny' set of SPGIspeech corpus 122
Table 15. Effects of various learning rate on full fine-tuning of the MMS model 123
Table 16. CER for fine-tuned models. The value outside the parenthesis is the CER value, while the value inside is RCR. The performance metrics are expressed as percentages 143
Table 17. Comparative performance and resource utilization of fine-tuning methods on the MMS Model, fine-tuned on 'tiny' set of the SPGIspeech corpus 146
Table 18. Comparative results of WER and CER across Whisper model variants LoRA fine-tuned on the 'tiny' and 'small' sets of the SPGIspeech corpus. The performance... 149
Table 19. Comparative results of WER and CER for the MMS model applying PEFT methodologies on the 'tiny' and 'small' sets of the SPGIspeech Corpus. The... 150
Table 20. Comparative analysis of RCR and DRCR for LoRA fine-tuning on Whisper variants 153
Table 21. Comparative analysis of RCR and DRCR for the MMS model applying different PEFT methodologies 154
Figure 1. Diagram of how RNN process sequential data 30
Figure 2. Theprocess of CTC in ASR 31
Figure 3. The graphical illustration of basic encoder-decoder architecture 35
Figure 4. The graphical illustration of basic encoder-decoder architecture with attention mechanism 36
Figure 5. Illustration of an ASR system using encoder-decoder with attention mechanism 38
Figure 6. Graphical illustration of the Transformer architecture 41
Figure 7. Progression of encoder-decoder ASR model architectures, with key innovations in encoder design. This figure maps the advancement from the foundational Transformer... 45
Figure 8. The evolution of self-supervised speech representation models. The shaded area highlights models that incorporate a Transformer encoder, signifying a key architectural... 48
Figure 9. Schematic of the Whisper model architecture. This diagram showcases the Transformer-based encoder-decoder structure utilized by OpenAI's Whisper for ASR... 54
Figure 10. Evolution of multilingual self-supervised speech representations that applies the Wav2Vec 2.0 architecture 61
Figure 11. The two-phase learning process of BERT for NLP applications. Once pre-trained on large corpus, the fine-tuning requires much smaller data. 67
Figure 12. Visualization of homograph word embedding representation in NLP models. The word 'bass' receives the same vector representation despite being used in different... 68
Figure 13. Contextual Word Embeddings using BERT. BERT model generates distinct vector representations for the word 'bass' based on its usage in different sentences,... 69
Figure 14. Illustration of LoRA fine-tuning. Matrix A reduces the input into a lower dimension d, while matrix B restore the dimension into original dimension d. 73
Figure 15. Target modules for LoRA fine-tuning on Whisper models. 76
Figure 16. Whisper decoder architecture highlighting LoRA target modules. The red dashed outlines with arrows pinpoint the 'Query' and 'Value' components within the 'Multi-... 77
Figure 17. Comparative overview of fine-tuning Strategies for the MMS model. (a) Full fine- tuning: All layers including the LM Head and Transformer Encoder Blocks are trainable.... 81
Figure 18. Detailed architecture of the MMS model's Transformer encoder block with an integrated adapter layer. 83
Figure 19. Learning rate schedule during full fine-tuning of the Whisper models on 'tiny' set of SPGIspeech corpus 101
Figure 20. Cross-entropy loss value of the Whisper models during full fine-tuning on 'tiny' set of SPGIspeech corpus 103
Figure 21. Learning rate schedule during LoRA fine-tuning of the Whisper models on 'tiny' set of SPGIspeech corpus 106
Figure 22. Cross-entropy loss value of Whisper models during the LoRA fine-tuning on 'tiny' set of SPGIspeech corpus 107
Figure 23. Learning rate schedule during LoRA fine-tuning of the Whisper models on 'small' set of SPGIspeech corpus 109
Figure 24. Cross-entropy loss value of the Whisper models during LoRA fine-tuning on 'small' set of SPGIspeech corpus 110
Figure 25. The scope of training for various fine-tuning methodologies for the MMS model 114
Figure 26. Learning rate schedule during fine-tuning the MMS model with PEFT methodologies on 'tiny' set of the SPGIspeech corpus 118
Figure 27. Learning rate schedule during fine-tuning the MMS model with PEFT methodologies on 'small' set of the SPGIspeech corpus 118
Figure 28. CTC loss value of the MMS model during fine-tuning with PEFT methodologies on 'tiny' set of the SPGIspeech corpus. The y-axis is in logarithmic scale. 119
Figure 29. CTC loss value of the MMS model during fine-tuning with PEFT methodologies on 'small' set of the SPGIspeech corpus. The y-axis is in logarithmic scale. 120
Figure 30. CTC loss value of the MMS model during full fine-tuning with various learning rates on 'tiny' set of the SPGIspeech corpus. The y-axis is in logarithmic scale. 125
Figure 31. CTC loss value of the MMS model during fine-tuning with various fine-tuning methods on 'tiny' set of the SPGIspeech corpus. The y-axis is in logarithmic scale. 127
Figure 32. Maximum GPU Memory Allocation During Fine-Tuning of the Whisper model variants with a red dotted line marking the 16GB threshold. The hatched 'large-v2' bar... 132
Figure 33. Comparative training duration for Whisper model variants highlighting efficiency of LoRA fine-tuning. The hatched bar for the 'small.en' model represents LoRA fine-... 135
Figure 34. Illustration of WER and CER calculations for the SPGIspeech corpus test set. (a) is illustrated for WER calculation and (b) is illustrated for CER calculation. Dotted... 140
Figure 35. Comparative DRCR values across model sizes and PEFT methodologies. This figure illustrates the DRCR percentages plotted against the logarithmic scale of model... 155
Figure 36. Target modules for LoRA fine-tuning Whisper model 161
Figure 37. Comparison of amplification factors for the Whisper large-v2 model variants, each fine-tuned using LoRA on the 'tiny' and 'small' sets of the SPGIspeech corpus. The... 163
Figure 38. Singular values of LM head layer of the MMS model for varying fine-tuning methodology and dataset volume 167