목차

Title Page

ABSTRACT

국문 초록

PREFACE

Contents

CHAPTER 1. INTRODUCTION 12

CHAPTER 2. RELATED WORK 16

CHAPTER 3. PROPOSED METHOD 17

3.1. Dynamic Spatiotemporal Sampling 17

3.2. Backbone 19

3.3. Teacher-Student Framework 20

3.4. Cross-Resolution Correspondence 22

3.5. Masked Self-Distillation 23

3.6. Total Loss 23

CHAPTER 4. EXPERIMENTS 24

4.1. Benchmark Datasets 24

4.2. Implementation Details 24

4.3. Ablation Studies 25

4.4. Results 31

CHAPTER 5. CONCLUSION 33

REFERENCES 34

Table 1. Ablation table of augmentation combinations for global and local clips. C means random cropping, S is soft color jittering, H is hard color jittering, B is image blur, F is image... 26

Table 2. Ablation study of spatial and temporal resolution variations. We utilized the model pre-trained on Kinetics400 dataset and fine-tuned on UCF101 and HMDB51. 27

Table 3. Action recognition performance on CNN and Transformer backbones. When we pre-trained the model on UCF101 dataset, we used 200 training epochs with a learning rate... 28

Table 4. Ablation table of proposal tasks. Correspondence means our proposed cross-resolution correspondence task and self-distillation is our masked self-distillation. We... 30

Table 5. Comparison of self-supervised objects in action recognition performances. Methods with gray color required a massive computational cost for training or utilized ImageNet... 31

Figure 1. The global clip with fast playback speed has wide temporal coverage and little spatial information with low temporal resolution. In contrast, the local clip with normal (or... 12

Figure 2. Our STATS framework allows self-distillation with spatial masking and learns cross-resolution correspondence by matching features of video clips with varying spatial and... 14

Figure 3. Illustration of our dynamic spatiotemporal sampling method. First, the global and local views are created by collecting frames according to each playback speed from a... 18

Figure 4. Illustration of overall STATS framework. 21

Figure 5. Illustration of action recognition accuracy with model considering various playback speeds. We trained a TimeSformer backbone over 100 epochs in UCF101 and... 25

Figure 6. Illustration of action recognition accuracy with model considering various masking ratios. We trained TimeSformer backbone on the Kinetics400 and evaluated it on UCF101 29