목차

Title Page

Abstract

Contents

Chapter 1. Introduction 10

Chapter 2. Background and Related Works 13

2.1. Transformers 13

2.1.1. The Architecture of Transformer 13

2.1.2. The Output Structure of Transformer 14

2.1.3. Multi-head Attention 14

2.2. Knowledge Distillation on Transformers 15

2.2.1. Knowledge Distillation on Transformer Encoders 16

2.2.2. Knowledge Distillation on Transformer Decoders 17

2.2.3. Knowledge Distillation on Transformer Encoders and Decoders 17

Chapter 3. Proposed Method 20

3.1. Finding Replaceable Pairs in Encoder and De-coder 22

3.2. Warm up with Simplified Task 23

3.2.1. Simplified task by Reducing the Number of Target Classes 25

3.2.2. Modeling the Prediction Probabilities to Simplified Task Labels 26

3.3. Layer-wise Attention Head Sampling 28

Chapter 4. Experiments 30

4.1. Experimental Settings 30

4.1.1. Dataset 30

4.1.2. Competitors 31

4.1.3. Evaluation Metric 31

4.2. Translation Accuracy of PET 32

4.3. Translation Speed of PET 34

4.4. Effectiveness of Replaceable Pair 34

4.5. Effectiveness of Simplified Task 35

4.6. Effectiveness of Layer-wise Attention Head Sampling 36

4.7. Sensitivity Analysis 37

Chapter 5. Conclusion 39

References 40

요약 42

Table 1. Table of the symbols. 12

Table 2. The label assigning rules for the teacher's predictions. 26

Table 3. The summary of the datasets. 31

Table 4. Comparison of the BLEU score 33

Table 5. Comparison of the BLEU score [%] according to various replaceable pair. 35

Table 6. Comparison of the BLEU score [%] according to w./w.o., simplified task pre-training. 35

Table 7. Comparison of the BLEU score [%] according to pre-training methods. 36

Table 8. Comparison of the BLEU score [%] according to use of layer-wise at-tention head sampling 37

Figure 1. The model architecture of the transformer 18

Figure 2. The model output structure of the transformer. 19

Figure 3. The example of the encoder of PET with four layers. 23

Figure 4. The example of the decoder of PET with four layers. 24

Figure 5. The example of the pre-training method PET with four layers. 28

Figure 6. The example of the layer-wise attention head sampling 29

Figure 7. The translation accuracy and speed of the IWSLT'14 DE↔EN. 32

Figure 8. The translation accuracy according to the beam size. 38