Title Page
Abstract
Contents
Chapter 1. Introduction 10
Chapter 2. Background and Related Works 13
2.1. Transformers 13
2.1.1. The Architecture of Transformer 13
2.1.2. The Output Structure of Transformer 14
2.1.3. Multi-head Attention 14
2.2. Knowledge Distillation on Transformers 15
2.2.1. Knowledge Distillation on Transformer Encoders 16
2.2.2. Knowledge Distillation on Transformer Decoders 17
2.2.3. Knowledge Distillation on Transformer Encoders and Decoders 17
Chapter 3. Proposed Method 20
3.1. Finding Replaceable Pairs in Encoder and De-coder 22
3.2. Warm up with Simplified Task 23
3.2.1. Simplified task by Reducing the Number of Target Classes 25
3.2.2. Modeling the Prediction Probabilities to Simplified Task Labels 26
3.3. Layer-wise Attention Head Sampling 28
Chapter 4. Experiments 30
4.1. Experimental Settings 30
4.1.1. Dataset 30
4.1.2. Competitors 31
4.1.3. Evaluation Metric 31
4.2. Translation Accuracy of PET 32
4.3. Translation Speed of PET 34
4.4. Effectiveness of Replaceable Pair 34
4.5. Effectiveness of Simplified Task 35
4.6. Effectiveness of Layer-wise Attention Head Sampling 36
4.7. Sensitivity Analysis 37
Chapter 5. Conclusion 39
References 40
요약 42
Table 1. Table of the symbols. 12
Table 2. The label assigning rules for the teacher's predictions. 26
Table 3. The summary of the datasets. 31
Table 4. Comparison of the BLEU score 33
Table 5. Comparison of the BLEU score [%] according to various replaceable pair. 35
Table 6. Comparison of the BLEU score [%] according to w./w.o., simplified task pre-training. 35
Table 7. Comparison of the BLEU score [%] according to pre-training methods. 36
Table 8. Comparison of the BLEU score [%] according to use of layer-wise at-tention head sampling 37
Figure 1. The model architecture of the transformer 18
Figure 2. The model output structure of the transformer. 19
Figure 3. The example of the encoder of PET with four layers. 23
Figure 4. The example of the decoder of PET with four layers. 24
Figure 5. The example of the pre-training method PET with four layers. 28
Figure 6. The example of the layer-wise attention head sampling 29
Figure 7. The translation accuracy and speed of the IWSLT'14 DE↔EN. 32
Figure 8. The translation accuracy according to the beam size. 38