Title Page
Contents
요약 13
Abstract 15
Ⅰ. INTRODUCTION 17
1.1. Introduction 17
1.2. Motivation 18
1.3. Background of Software Evolution and Change Propagation 20
1.3.1. Software Evolution 20
1.3.2. Change Propagation 20
1.3.3. Change coupling 22
1.3.4. Code change 22
1.4. Recommendation System 22
1.5. Research Questions 27
1.6. Research Objectives 28
1.7. Research Contributions 29
1.7.1. FCP2Vec: File-level Change Propagation to Vector 30
1.7.2. FCP2BERT: File-level change propagation to Bidirectional Encoder Representations from Transformer 31
1.7.3. FCP2DGNN: File-level change propagation to Dynamic Graph Neural Network 32
1.8. Data gathering required methods 34
1.9. Organization of the dissertation 35
Ⅱ. FCP2Vec: Deep learning-based approach to software change prediction by learning co-changing patterns from change logs 38
Abstract 38
2.1. Introduction 39
2.2. Materials and Methods 44
2.2.1. Leveraging changelog data in change propagation 45
2.2.2. Neural language model 51
2.3. Background 54
2.3.1. Problem definition 54
2.3.2. Word2vec 57
2.4. Proposed Approach 63
2.4.1. Data Preparation 63
2.4.2. Data processing 65
2.4.3. Learning d-dimensional element representation 67
2.4.4. Evaluation Metrics 69
2.4.5. Hyperparameters 71
2.5. Empirical Studies 75
2.5.1. Transaction data 76
2.5.2. System Environments 77
2.6. Result and Discussion 79
2.6.1. Comparison between FCP2Vec and DN 89
2.7. Implications for practical use 93
Ⅲ. FCP2BERT: A BERT-Based Sequential Recommendation System for Effective Change Propagation Prediction in Large Software Systems 95
Abstract 95
3.1. Introduction 96
3.2. Literature review 102
3.2.1. Leveraging changelog data in change propagation 102
3.2.2. Distributional representation 106
3.3. Background 111
3.3.1. Problem definition 111
3.3.2. Bidirectional Encoder Representation from Transformer(BERT) 113
3.4. Proposed methods 116
3.4.1. Data preprocessing 116
3.4.2. Learning Element Representation Data preprocessing 119
3.4.3. Evaluation Metrics Data preprocessing 121
3.4.4. Hyperparameters 123
3.5. Empirical studies 126
3.5.1. Transaction data 127
3.5.2. System environment 127
3.5.3. Results and discussion 128
3.6. Implications for practical use 131
Ⅳ. Dynamic Graph Neural Network -Based Sequential Recommendation System for Effective Change Propagation Prediction in Large Software Systems 134
Abstract 134
4.1. Introduction 135
4.2. Literature review 142
4.2.1. GNN and sequential recommendation system 143
4.3. Background 147
4.3.1. Problem definition 147
4.3.2. Graph Neural Network 148
4.4. Proposed method 149
4.4.1. Data preparation 151
4.4.2. Developer sequence generation 152
4.4.3. Dynamic graph and sub-graph sampling 153
4.4.4. Data operation 153
4.4.5. Learning graph data representation 154
4.4.6. Evaluation metrics 156
4.4.7. Experiment setup 157
4.5. Empirical studies 158
4.5.1. Transaction data 159
4.5.2. Results 160
4.6. Implications for practical use 160
Ⅴ. Comparison Results 163
5.1. Comparison between FCP2BERT and FCP2Vec 163
5.1.1. Change prediction performance 163
5.1.2. Computation efficiency 164
5.2. Comparison between FCP2BERT and FCP2DGNN 165
5.2.1. Change predication performance 166
5.2.2. Computation efficiency 167
5.3. General discussion 168
Ⅵ. Conclusion 174
5.1. Conclusions 174
5.2. Summary and contributions 177
5.3. Research directions 180
Reference 181
Table 1. Applications of recommendation systems in different domains 24
Table 2. Comprehensive overview of machine learning and deep learning in different domains. 46
Table 3. Exploring the versatility of word embeddings: Some of the usage in various domains. 52
Table 4. An example of a set of SG pairs of (target word, context word) where the context word appears in the neighboring context of the target word. 59
Table 5. Status of files before and after processing data. 65
Table 6. Optimal hyperparameters for each dataset at both file and package levels. The search method employed encompasses uniform and log-uniform. In the uniform method, integers are sampled uniformly between the lower and upper limits. In the log-uniform method, integers are... 73
Table 7. Utility information about development changelog datasets. This table provides the Unique Counts Before (UCB) and After (UCA) processing data for each dataset. 78
Table 8. Evaluation performance using normalized discount cumulative gain (NDCG) and hit ratio (HR) of the FCP2Vec model in multiple trial scenarios. Trial I results from the default parameter values for an estimator and Trial II results from Bayesian optimization over hyperparameters using... 80
Table 9. Evaluation performance of each study period using the normalized cumulative discount gain (NDCG) and the hit ratio (HR) of the FCP2Vec model on Vuze dataset at file level. All trial II re-sults from Bayesian optimization over hyperparameters. Based on the split ratio, A=90:5:5, B=... 81
Table 10. A sample of random results from the Vuze dataset at the file level, employing single-input queries and group queries with the highest probability score as output. The outcomes are organized in descending order, accompanied by a probability score for each element. 87
Table 11. Exploring the versatility of BERT: Some of the usage in various domains. 108
Table 12. Summarizes the optimal hyperparameters for the model based on BERT on HR@10 and NDCG@K. A: SASRec, B: FCP2BERT and C: FCP2BERT+. Random (R), Frequently changed (F) 125
Table 13. Utility information about Vuze development changelog dataset. 127
Table 14. Evaluation performance using normalized discount cumulative gain (NDCG) and hit ratio (HR) of the different models. SE: Number of segments, ST: Session Token, TE: temporal dimension 130
Table 15. Exploring the versatility of GNNs: Some of the applications area in various domains. 144
Table 16. Summarizes the optimal hyperparameters for the FCP2DGNN on HR@10. 158
Table 17. Utility information about Vuze development changelog dataset 159
Figure 1. An example of a historical changelog. 55
Figure 2. Architecture of the Word2vec training model. (a) CBOW and (b) SG model architectures. 57
Figure 3. The SK model considers the probability of predicting the surrounding context words given a center word. 60
Figure 4. Overall framework for predicting CP using the architecture of FCP2Vec model: word2vec skip-gram with negative sampling model (M1) and unsupervised nearest neighbor model (M2). 64
Figure 5. Sample of data from the Vuze raw historical development changelog. 66
Figure 6. Ascending order of transactional changelogs spanning Vuze (2003-07-10~2014-08-02), Spring framework (2008-07-1 1~2023-05-03) and Elasticsearch (2010-02-08~2023-05-05). 75
Figure 7. The two-dimensional vector representation of Vuze change logs at file level in the vector space. The denser the dot cluster, the more files are close together and the more... 84
Figure 8. A randomly selected data point from the two-dimensional vector representation of Vuze change logs at file level in the vector space. The closer the dot cluster is, the more files are close together and the more frequently changed together in the past. 85
Figure 9. Comparative hit ratios of change prediction for dependency network (DN) and FCP2vec in various scenarios at package level. 90
Figure 10. Computational time between different years and split ratio for Vuze datasets at file level: (a)shows optimization elapsed time and (b) compare training elapse time based on... 92
Figure 11. Sequential recommendation architecture: SASRec (a), BERT4Rec (b) and BERT4Rec+ (c): modified version of BERT4Rec in the input layer from model architecture... 110
Figure 12. An example of a historical changelog. 112
Figure 13. Overall framework for predicting change propagation using FCP2BERT. 117
Figure 14. Sample data from the Vuze raw historical development changelog. 117
Figure 15. Most frequently changed files. 123
Figure 16. Ascending order of transactional changelogs spanned between 2003-07-10~2014-08-02. 126
Figure 17. An example of time-variant graph 139
Figure 18. Overall framework for predicting change propagation using FCP2DGNN 150
Figure 19. Sample committed co-change data point from Vuze dataset 151
Figure 20. Ascending order of transactional changelogs spanned between 2003-07-10~2014-08-02. 159
Figure 21. Top Comparative HR@10 and NDCG@10 of change prediction recommendation for FCP2Vec and FCP2BERT model at file level. 164
Figure 22. Performance comparison of FCP2DGNN and FCP2BERT methods in terms of HR@10 and NDCG@10 at file level. 166
Figure 23. A simple web program that makes use of the domain class and a configuration file. 170
Figure 24. Raw data example of domain class and configuration file co-changes. 171