Title Page
Contents
Abstract 11
I. Introduction 13
1.1. Research Background and Significance 13
1.2. Important concepts and application status 15
1.2.1. Data pre-processing 15
1.2.2. Data Cleaning 16
1.3. Research content and organization of the thesis 17
1.3.1. Main research content of the thesis 17
1.3.2. Thesis Structure 19
1.4. Summary of this chapter 19
II. Principles and knowledge of machine learning and core algorithms 20
2.1. Machine Learning 20
2.2. Unsupervised Machine Learning 21
2.3. Application of unsupervised algorithm in anomaly data detection 22
2.4. Introduction to the unsupervised algorithms in this paper 22
2.4.1. Gaussian mixture model 22
2.4.2. OPTICS (Ordering points to identify the clustering structure) 24
2.4.3. MissForest 27
2.4.4. IsolationForest 29
2.5. Summary of this chapter 31
III. Data distillation: a solution for separating trustworthy data from massive amounts of unreliable data 32
3.1. Brief description of data distillation 32
3.2. Basic ideas and principles of data distillation 33
3.2.1. The concept and principle of distillation 33
3.2.2. Processing framework and methods for data distillation 33
3.3. Data distillation model processing flow 34
3.3.1. Noise reduction processing of data based on critical values 34
3.3.2. Null-filling approach based on MissForest 35
3.3.3. Gaussian mixture model-based data modeling and decomposition 35
3.3.4. Isolation Forest-based outlier sieving 35
3.3.5. Average and OPTICS-based quadratic clustering 35
3.3.6. Evaluation of data distillation results based on Symmetric Mean Absolute Percentage Error (SMAPE) 36
3.4. Summary of this chapter 37
IV. Data distillation experiments and analysis 38
4.1. Experiment data 38
4.2. First-time noise rejection based on lower bound critical values 40
4.3. Missing value filling based on MissForest 41
4.4. GMM-based first round clustering 42
4.5. Isolation Forest-based outlier detection 44
4.6. Secondary clustering based on OPTICS 45
4.7. Summary and analysis of experimental results 46
4.8. Summary of this chapter 50
V. Summary and Outlook 51
5.1. Discovery and Innovation Points 51
5.2. Prospects 52
References 55
Achievements during the master's studies 58
Figure 1.1. Relationship between data distillation and other data processing concepts 14
Figure 1.2. Key elements of data pre-processing 15
Figure 1.3. Data cleansing steps 16
Figure 2.1. Image based on Gaussian distribution 24
Figure 2.2. OPTICS clustering algorithm flow chart 26
Figure 2.3. Example of an Isolation Forest model (outliers are marked in red) 30
Figure 3.1. Distillation experiment illustration 33
Figure 3.2. Schematic diagram of data distillation principle 34
Figure 3.3. Principle of critical value noise reduction 35
Figure 4.1. Test site, material and equipment 38
Figure 4.2. Raw data in csv file format 39
Figure 4.3. Time-weight pair data set 40
Figure 4.4. Scatter plot of raw data 41
Figure 4.5. Plot of missed data 42
Figure 4.6. The trend of BIC values under the four covariance types when k values are set to 100, 50 and 20, respectively 43
Figure 4.7. Two-dimensional Gaussian mixture model clustering results 44
Figure 4.8. Outliers filtered by Isolation Forest 45
Figure 4.9. Secondary clustering based on OPTICS 46
Figure 4.10. Comparison of the measured values (1) at 5 days intervals with the results calculated by this algorithm 47
Figure 4.11. Comparison of the measured values (2) at 5 days intervals with the results calculated by this algorithm 48
Figure 4.12. Comparison of the data containing only delivery day measurements (3-113) with the results calculated by this algorithm 49
Figure 4.13. Comparison of the delivery day measurements of the other 111 datasets with the results calculated by this algorithm 50
2.1. One-dimensional Gaussian distribution formula 23
2.2. Multi-dimensional Gaussian distribution formula 23
2.3. Gaussian mixture model formula 24
2.4. Gaussian mixture model component formula(Same as 2.2) 24
2.5. MissForest iteration termination condition formula 29
2.6. Formula for calculating anomaly scores for Isolation Forest 31
3.1. MAPE formula 36
3.2. SMAPE formula 36