목차

Title Page

Abstract

초록

Contents

Chapter 1. Introduction 15

1.1. Various Token-level Considerations 16

1.2. Augmenting the Token Set 18

1.3. Structure of the Thesis 19

Chapter 2. Token-level Data Augmentation 21

2.1. Background 22

2.2. TokenMixup 23

2.2.1. Sample Difficulty Assessment 23

2.2.2. Attention-guided Saliency Detection 24

2.2.3. Optimal Assignment 25

2.2.4. Token-level Mixup 27

2.3. Experiments 29

2.3.1. Implementation Details 29

2.3.2. Benchmark Experiments 30

2.3.3. Component Ablation 31

2.3.4. Sensitivity Experiments 33

2.3.5. Robustness Experiments 34

2.4. Discussions 35

2.4.1. Comparison with Gradient-based Methods 35

2.4.2. Comparison with Manifold Mixup 36

2.4.3. TokenMixup as Curriculum Learning 38

Chapter 3. Recurrent Token Decoding 39

3.1. Background 39

3.2. Recurrent DETR 41

3.2.1. Preliminaries 41

3.2.2. Recurrent DETR Architecture 43

3.2.3. Pondering Hungarian Loss 46

3.3. Experiments 49

3.3.1. Implementation Details 49

3.3.2. Standard Setting: COCO 2017 51

3.3.3. Crowded Setting: CrowdHuman 53

3.3.4. Component Ablation 54

3.4. Discussions 56

3.4.1. Effect of Novelty Bias 56

3.4.2. Effect of Pondering 57

3.4.3. Usefulness of Non-Maximum Suppression 60

3.4.4. Computation–Performance Tradeoff 61

Chapter 4. Conclusion 62

Bibliography 64

Table 2.1. Experimental results on CIFAR. 31

Table 2.2. Experimental results on ImageNet-1K. 32

Table 2.3. Ablation study on CIFAR-100. 32

Table 2.4. Sample difficulty and saliency thresholds. 33

Table 2.5. Robustness to Gaussian noise 34

Table 2.6. Robustness to PGD attack 34

Table 2.7. Saliency detector comparison. 35

Table 2.8. Mixup combinations. 36

Table 3.1. Dataset statistics of COCO 2017 and CrowdHuman. 50

Table 3.2. Experiments on COCO 2017. 52

Table 3.3. Experiments on CrowdHuman. 54

Table 3.4. Experiments with CrowdHuman++. 54

Table 3.5. Ablation study on model components of Recurrent DETR with COCO 2017. 55

Table 3.6. Deformable DETR performance on different post-processors. 60

Table 3.7. FPS and average iteration. 61

Figure 2.1. TokenMixup. 24

Figure 2.2. Qualitative comparison of saliency detectors. 35

Figure 2.3. Attention map comparison of HTM and Manifold Mixup. 37

Figure 2.4. Trend in the number of mixed instances. 38

Figure 3.1. Recurrent DETR architecture. 45

Figure 3.2. Average loss evaluated at each iteration. 56

Figure 3.3. Relative box size detected in each iteration. 57

Figure 3.4. Qualitative Comparison. 58

Figure 3.5. Pondering of Recurrent DETR–Q5. 59