초록

The majority of the data that people obtain in real life from various information systems is dirty data. Data cleaning efforts to improve accuracy frequently handle observations containing small amounts of noisy data well. However, when the amount of noisy data is large and the outliers are flooded due to measurement system errors, there is a scarcity of research on how to extract valuable data from them. In this paper, we propose a framework for data distillation based on multiple clustering for real data separation and extraction to fill this void. Data distillation is a technique or procedure for separating and extracting true data from noisy data. It is divided into two stages: data feature decomposition ("data evaporation") and feature clustering ("data condensation"). The "data evaporation" module combines the Gaussian mixture model, Isolation Forest noise reduction to extract data features through an iterative EM algorithm process. The "Data Condensation" module re-clusters the extracted values and sieves outliers while selecting them to achieve target selection using the OPTICS clustering algorithm. 113 datasets collected by sensors and validated by manual measurements were used in the experiments. The experimental results show that the method is feasible in practice. Finally, the framework for data distillation is summarized, and future research directions are suggested. The following are the main findings of our research: (i) Setting critical values can be used to remove noise that is obviously nonconforming to the scene rules; (ii) The outlier screening method using Isolated Forest has high robustness and can significantly screen outliers; (iii) The model accuracy is theoretically proportional to the number of components in the Gaussian mixture model (i.e., the smaller the BIC value, the better). However, if the value does not vary significantly, the lower the model complexity, the better the model. The BCI values of the first hundred components are compared in this experiment, and it is discovered that the number of GMM components is determined to be the most suitable at 15-20; (iv) MissForest filling the null does not change the initial distribution of the data and maintains the real-time and validity of the data analysis; (v) The Gaussian mixture model-based dual clustering method and the OPTICS-based dual clustering method can significantly narrow the range of feature values; and (vi) To more accurately determine the true value, the feature value with the highest weight is chosen.