목차

Title Page

Contents

Abstract 11

I. Introduction 13

1.1. Research Background and Significance 13

1.2. Important concepts and application status 15

1.2.1. Data pre-processing 15

1.2.2. Data Cleaning 16

1.3. Research content and organization of the thesis 17

1.3.1. Main research content of the thesis 17

1.3.2. Thesis Structure 19

1.4. Summary of this chapter 19

II. Principles and knowledge of machine learning and core algorithms 20

2.1. Machine Learning 20

2.2. Unsupervised Machine Learning 21

2.3. Application of unsupervised algorithm in anomaly data detection 22

2.4. Introduction to the unsupervised algorithms in this paper 22

2.4.1. Gaussian mixture model 22

2.4.2. OPTICS (Ordering points to identify the clustering structure) 24

2.4.3. MissForest 27

2.4.4. IsolationForest 29

2.5. Summary of this chapter 31

III. Data distillation: a solution for separating trustworthy data from massive amounts of unreliable data 32

3.1. Brief description of data distillation 32

3.2. Basic ideas and principles of data distillation 33

3.2.1. The concept and principle of distillation 33

3.2.2. Processing framework and methods for data distillation 33

3.3. Data distillation model processing flow 34

3.3.1. Noise reduction processing of data based on critical values 34

3.3.2. Null-filling approach based on MissForest 35

3.3.3. Gaussian mixture model-based data modeling and decomposition 35

3.3.4. Isolation Forest-based outlier sieving 35

3.3.5. Average and OPTICS-based quadratic clustering 35

3.3.6. Evaluation of data distillation results based on Symmetric Mean Absolute Percentage Error (SMAPE) 36

3.4. Summary of this chapter 37

IV. Data distillation experiments and analysis 38

4.1. Experiment data 38

4.2. First-time noise rejection based on lower bound critical values 40

4.3. Missing value filling based on MissForest 41

4.4. GMM-based first round clustering 42

4.5. Isolation Forest-based outlier detection 44

4.6. Secondary clustering based on OPTICS 45

4.7. Summary and analysis of experimental results 46

4.8. Summary of this chapter 50

V. Summary and Outlook 51

5.1. Discovery and Innovation Points 51

5.2. Prospects 52

References 55

Achievements during the master's studies 58

Figure 1.1. Relationship between data distillation and other data processing concepts 14

Figure 1.2. Key elements of data pre-processing 15

Figure 1.3. Data cleansing steps 16

Figure 2.1. Image based on Gaussian distribution 24

Figure 2.2. OPTICS clustering algorithm flow chart 26

Figure 2.3. Example of an Isolation Forest model (outliers are marked in red) 30

Figure 3.1. Distillation experiment illustration 33

Figure 3.2. Schematic diagram of data distillation principle 34

Figure 3.3. Principle of critical value noise reduction 35

Figure 4.1. Test site, material and equipment 38

Figure 4.2. Raw data in csv file format 39

Figure 4.3. Time-weight pair data set 40

Figure 4.4. Scatter plot of raw data 41

Figure 4.5. Plot of missed data 42

Figure 4.6. The trend of BIC values under the four covariance types when k values are set to 100, 50 and 20, respectively 43

Figure 4.7. Two-dimensional Gaussian mixture model clustering results 44

Figure 4.8. Outliers filtered by Isolation Forest 45

Figure 4.9. Secondary clustering based on OPTICS 46

Figure 4.10. Comparison of the measured values (1) at 5 days intervals with the results calculated by this algorithm 47

Figure 4.11. Comparison of the measured values (2) at 5 days intervals with the results calculated by this algorithm 48

Figure 4.12. Comparison of the data containing only delivery day measurements (3-113) with the results calculated by this algorithm 49

Figure 4.13. Comparison of the delivery day measurements of the other 111 datasets with the results calculated by this algorithm 50

2.1. One-dimensional Gaussian distribution formula 23

2.2. Multi-dimensional Gaussian distribution formula 23

2.3. Gaussian mixture model formula 24

2.4. Gaussian mixture model component formula(Same as 2.2) 24

2.5. MissForest iteration termination condition formula 29

2.6. Formula for calculating anomaly scores for Isolation Forest 31

3.1. MAPE formula 36

3.2. SMAPE formula 36