목차

Title Page

Abstract

Contents

Chapter 1. General introduction 16

1.1. Advancing error-free genome assembly 17

1.2. Structural error made by assembly artifacts 18

1.3. Challenges of false duplication 19

Chapter 2. Widespread false gene gains caused by duplication errors in genome assemblies 22

2.1. Abstract 23

2.2. Introduction 24

2.3. Materials and Methods 28

2.3.1. Assemblies and read data 28

2.3.2. Identifying false duplications 31

2.3.3. Evaluating false duplications 36

2.3.4. Identification of false gene gain annotation errors 37

2.3.5. False duplication correction using the VGP pipeline v1.7 39

2.3.6. Duplicated k-mers in different hummingbird assembly approaches 41

2.3.7. Gene ontology enrichment test for falsely duplicated genes 42

2.3.8. False duplications in emu assemblies 43

2.4. Results 44

2.4.1. Genome assemblies and identifying false duplications 44

2.4.2. False duplications in previous and VGP assemblies 50

2.4.3. High heterozygosity and sequencing errors associated with false duplications 56

2.4.4. False duplications cause false annotation errors 63

2.4.5. Specific categories of genes have higher levels of false duplications 80

2.4.6. False duplication and annotation errors remaining in VGP assemblies 85

2.4.7. Specific partitions of the genome with greater false duplications 100

2.4.8. Assembly methods to minimize false duplication 104

2.5. Discussion 111

Chapter 3. Automated HiFi-Based Genome Assemblies Reveal Lower Assembly Errors than Current Long-Read-Based Assembly. 115

3.1. Abstract 116

3.2. Introduction 118

3.3. Materials and Methods 121

3.3.1. Used data 121

3.3.2. Read mapping and coverage calculation 121

3.3.3. K-mer counting and K* calculation 122

3.3.4. False duplication and loss identification 123

3.4. Results and Discussion 126

3.4.1. K-mer profiles of CLR and HiFi-based assemblies 126

3.4.2. The amount of structural assembly errors in CLR and HiFi-based assemblies 129

3.4.3. Assembly errors in the current reference genome and the optimal strategy for assembly 133

Chapter 4. Purge mers: A New False Duplication Curation Tool Based on Sequencing Read and K-mers for Diploid Genome Assembly 136

4.1. Abstract 137

4.2. Introduction 138

4.3. Materials and Methods 141

4.3.1. Generating simulation data 141

4.3.2. Parameter estimation 145

4.3.3. False duplication identification 146

4.3.4. Performance assessment 149

4.4. Results 152

4.4.1. Model parameter estimation 152

4.4.2. False duplication candidate identification algorithm 156

4.4.3. Simulation statistics 157

4.4.4. Performance assessment 164

4.5. Discussion 169

Chapter 5. A K-mer Counting Method Minimizing GC bias in Sequencing Reads 173

5.1. Abstract 174

5.2. Introduction 175

5.3. Materials and Methods 178

5.3.1. Bias function estimation 178

5.3.2. Used data 180

5.3.3. K* calculation 181

5.4. Results and Discussion 183

5.4.1. Bias function estimation 183

5.4.2. K* distributions along GC proportions 185

General discussion 189

References 192

국문 초록 226

Table 2.1. Statistics of previous and VGP assemblies. Contig NG50 and Scaffold NG50 for each assembly were calculated using a source code in the VGP repository. 30

Table 2.2. False duplication statistics in previous and VGP assemblies. 55

Table 2.3. Mis-annotations caused by false duplications in both previous and VGP assemblies. 69

Table 2.4. False duplication of V1R family genes in the previous platypus assembly. 76

Table 2.5. False duplications on transposable elements in previous assemblies. 79

Table 2.6. Gene ontology enrichment analysis for the false gene gains, false chimeric gains and false exon gains in previous assemblies. 83

Table 2.7. Reduction of false duplications in the reassembled bTaeGut1.4 zebra finch genome with the VGP v1.7 pipeline. 99

Table 2.8. Proportion of k-mer duplication measured for each assembly strategy. 107

Table 4.1. Statistics of original zebra finch and human assemblies in this study. 143

Table 4.2. Statistics of simulation data. 160

Table 4.3. The amount of false duplication in each assembly calculated by each sequencing technology. 166

Figure 2.1. Unsupported sequences with or without assembly gaps. 34

Figure 2.2. Overview to identify false duplication. 48

Figure 2.3. Depth-coverage profiling of all assemblies. 52

Figure 2.4. K-mer profiling for all assemblies. 53

Figure 2.5. The amount of false duplication and factors that correlate with false duplication. 58

Figure 2.6. The presence of a gap and discordant reads between false duplications. 60

Figure 2.7. True duplication in a VGP assembly. 10X linked reads are shown as paired read alignments above the PacBio CLR read... 61

Figure 2.8. Mis-annotations due to false duplications. 67

Figure 2.9. The genome landscape of false gene gains. 70

Figure 2.10. Cases of false gene gain annotations in the prior hummingbird and platypus assemblies. 73

Figure 2.11. Genome landscape of platypus assembly false duplications using Sanger reads. 74

Figure 2.12. Additional findings for the genome false duplication landscape of the ADAMTS13-like gene. 75

Figure 2.13. Gene ontology enrichment analysis of falsely duplicated genes. 82

Figure 2.14. Heterozygosity of ATP-binding genes with or without false duplications. 84

Figure 2.15. False duplications left in VGP assemblies. 88

Figure 2.16. Chromosomal location of false duplications in the VGP assemblies. 90

Figure 2.17. False duplications and their correction in the VGP zebra finch assembly. 92

Figure 2.18. Example cases of false duplications in the VGP assemblies. 94

Figure 2.19. Correction of the NPNT gene in VGP v1.7 pipeline assembly. 98

Figure 2.20. Proportions of genomic partitions represented among the falsely duplicated regions. 101

Figure 2.21. Difference of proportion of each genomic partition containing false duplications relative to expected frequency. 102

Figure 2.22. The genome landscape of false duplications in emu assemblies. 108

Figure 2.23. K-mer profiling for emu assemblies. a, Previous short read based assembly. 110

Figure 3.1. K-mer evaluation of zebra finch assemblies made by PacBio CLR and HiFi reads. 128

Figure 3.2. Amount of false duplication and losses in zebra finch assemblies made by PacBio CLR and HiFi reads 130

Figure 3.3. Number of genes affected by false duplication (a) and losses (b). 132

Figure 3.4. K-mer evaluation between bTaeGut2 and bTaeGut1.4. 134

Figure 3.5. Genome characteristics profile of zebra finch assemblies, bTaeGut2 (a) and bTaeGut1.4 (b) assemblies estimated from GenomeScope. 135

Figure 4.1. Overview of identifying false duplication on both read coverage and K*. 153

Figure 4.2. Genome characteristics of zebra finch (a) and human (b) assemblies estimated by GenomeScope2. 158

Figure 4.3. K-mer profiles of simulated assemblies. 161

Figure 4.4. Bivariate distributions of read coverage and K* of each zebra finch and human assembly. 163

Figure 4.5. The proportion of false duplications and performance assessment. 167

Figure 5.1. Mean depth coverage and bias function with along GC proportions for a, zebra finch, and b, human assemblies. 184

Figure 5.2. K* distribution across GC proportion categories. 188