목차

Title Page

Abstract

초록

Contents

1. Introduction 12

2. Related Work 16

2.1. 3D Human Pose and Shape Estimation 16

2.2. Uncertainty Modeling 18

2.3. Transformer Architecture in Vision Tasks 19

3. Proposed Method 21

3.1. SMPL Model 21

3.2. Backbone Module 22

3.3. Encoder 26

3.4. Decoder 27

3.5. Loss Functions 28

3.6. Implementation Details 30

3.7. Extension to Video Dataset 32

4. Experimental Results 36

4.1. Datasets and Evaluation Metrics 36

4.2. Comparisons to the State-of-the-art Methods 37

4.3. More Results on Occlusion 44

4.4. Ablation Study 49

4.5. More Results on Video Dataset 51

5. Conclusions and Future Works 57

References 57

Table 1. Evaluation of state-of-the-art methods on 3DPW dataset. The best results are highlighted in bold and "-" shows the results that are not available. 38

Table 2. Evaluation of state-of-the-art methods on Human3.6M dataset. The best results are highlighted in bold and "-" shows the results that are not available. 40

Table 3. Quantitative comparison of keypoint localization AP with state-of-the-art methods on the COCO dataset. 43

Table 4. Evaluation of 2D keypoint projection accuracy on 3DPW dataset. 43

Table 5. Evaluation on occlusion datasets 3DPW-OCC and 3DOH. 46

Table 6. Evaluation on 3DPW-Crowd dataset. 47

Table 7. Ablation study on the proposed main modules. 49

Table 8. Evaluation of state-of-the-art methods on 3DPW, MPI-INF-3DHP, and Human3.6M datasets. The best results are highlighted in bold and "-" shows the results that are not available. Our method achieves the state-of-... 52

Figure 1. The overall framework of the proposed method. Backbone extracts the human-related features and the uncertainty features. Encoder module refines the features to reflect the uncertainty and relationships between body... 23

Figure 2. We propose a method that estimates 3D human pose and shape from video using uncertainty information and part-based 3D dynamics. Our method is able to recover the accurate and smooth 3D motion, achieving the... 31

Figure 3. The overall framework of the proposed method. Given a temporal sequence of images, the model extracts an uncertainty-aware temporal feature that includes uncertainty in 2D space and optical flow information. Then, the... 34

Figure 4. Qualitative comparison of reconstruction results on the COCO dataset. From left to right: Image, SPIN results, PARE results, and UDT results. 41

Figure 5. Qualitative comparison of reconstruction results on the partially occluded images. From left to right: Image, SPIN results, PARE results, and UDT results. 45

Figure 6. Visualization of our results from a variety of viewpoints. The estimated 3D meshes appear natural even when viewed from directions that are... 46

Figure 7. Qualitative comparison of reconstruction results on the images with multiple human. From left to right: Image, PARE results and UDT results. 48

Figure 8. Qualitative comparison of reconstruction results for the ablation study. From top to bottom: (a): Image, (b) w/o encoder and decoder, (c) w/o... 50

Figure 9. Comparison of the acceleration errors for our method, MEVA, and VIBE. Our method shows clearly lower acceleration errors than previous methods. 53

Figure 10. Qualitative comparison with VIBE. For each sequence, the top row shows the input images, the middle row shows our results (blue), and... 55

Figure 11. Qualitative comparison on 3DPW dataset with VIBE. The output meshes from VIBE and our method are rendered in pink and blue, respectively. Our method (blue) is able to produce accurate 3D meshes for difficult poses. 56