Title Page
Abstract
Contents
Chapter 1. General introduction 16
1.1. Advancing error-free genome assembly 17
1.2. Structural error made by assembly artifacts 18
1.3. Challenges of false duplication 19
Chapter 2. Widespread false gene gains caused by duplication errors in genome assemblies 22
2.1. Abstract 23
2.2. Introduction 24
2.3. Materials and Methods 28
2.3.1. Assemblies and read data 28
2.3.2. Identifying false duplications 31
2.3.3. Evaluating false duplications 36
2.3.4. Identification of false gene gain annotation errors 37
2.3.5. False duplication correction using the VGP pipeline v1.7 39
2.3.6. Duplicated k-mers in different hummingbird assembly approaches 41
2.3.7. Gene ontology enrichment test for falsely duplicated genes 42
2.3.8. False duplications in emu assemblies 43
2.4. Results 44
2.4.1. Genome assemblies and identifying false duplications 44
2.4.2. False duplications in previous and VGP assemblies 50
2.4.3. High heterozygosity and sequencing errors associated with false duplications 56
2.4.4. False duplications cause false annotation errors 63
2.4.5. Specific categories of genes have higher levels of false duplications 80
2.4.6. False duplication and annotation errors remaining in VGP assemblies 85
2.4.7. Specific partitions of the genome with greater false duplications 100
2.4.8. Assembly methods to minimize false duplication 104
2.5. Discussion 111
Chapter 3. Automated HiFi-Based Genome Assemblies Reveal Lower Assembly Errors than Current Long-Read-Based Assembly. 115
3.1. Abstract 116
3.2. Introduction 118
3.3. Materials and Methods 121
3.3.1. Used data 121
3.3.2. Read mapping and coverage calculation 121
3.3.3. K-mer counting and K* calculation 122
3.3.4. False duplication and loss identification 123
3.4. Results and Discussion 126
3.4.1. K-mer profiles of CLR and HiFi-based assemblies 126
3.4.2. The amount of structural assembly errors in CLR and HiFi-based assemblies 129
3.4.3. Assembly errors in the current reference genome and the optimal strategy for assembly 133
Chapter 4. Purge mers: A New False Duplication Curation Tool Based on Sequencing Read and K-mers for Diploid Genome Assembly 136
4.1. Abstract 137
4.2. Introduction 138
4.3. Materials and Methods 141
4.3.1. Generating simulation data 141
4.3.2. Parameter estimation 145
4.3.3. False duplication identification 146
4.3.4. Performance assessment 149
4.4. Results 152
4.4.1. Model parameter estimation 152
4.4.2. False duplication candidate identification algorithm 156
4.4.3. Simulation statistics 157
4.4.4. Performance assessment 164
4.5. Discussion 169
Chapter 5. A K-mer Counting Method Minimizing GC bias in Sequencing Reads 173
5.1. Abstract 174
5.2. Introduction 175
5.3. Materials and Methods 178
5.3.1. Bias function estimation 178
5.3.2. Used data 180
5.3.3. K* calculation 181
5.4. Results and Discussion 183
5.4.1. Bias function estimation 183
5.4.2. K* distributions along GC proportions 185
General discussion 189
References 192
국문 초록 226
Table 2.1. Statistics of previous and VGP assemblies. Contig NG50 and Scaffold NG50 for each assembly were calculated using a source code in the VGP repository. 30
Table 2.2. False duplication statistics in previous and VGP assemblies. 55
Table 2.3. Mis-annotations caused by false duplications in both previous and VGP assemblies. 69
Table 2.4. False duplication of V1R family genes in the previous platypus assembly. 76
Table 2.5. False duplications on transposable elements in previous assemblies. 79
Table 2.6. Gene ontology enrichment analysis for the false gene gains, false chimeric gains and false exon gains in previous assemblies. 83
Table 2.7. Reduction of false duplications in the reassembled bTaeGut1.4 zebra finch genome with the VGP v1.7 pipeline. 99
Table 2.8. Proportion of k-mer duplication measured for each assembly strategy. 107
Table 4.1. Statistics of original zebra finch and human assemblies in this study. 143
Table 4.2. Statistics of simulation data. 160
Table 4.3. The amount of false duplication in each assembly calculated by each sequencing technology. 166
Figure 2.1. Unsupported sequences with or without assembly gaps. 34
Figure 2.2. Overview to identify false duplication. 48
Figure 2.3. Depth-coverage profiling of all assemblies. 52
Figure 2.4. K-mer profiling for all assemblies. 53
Figure 2.5. The amount of false duplication and factors that correlate with false duplication. 58
Figure 2.6. The presence of a gap and discordant reads between false duplications. 60
Figure 2.7. True duplication in a VGP assembly. 10X linked reads are shown as paired read alignments above the PacBio CLR read... 61
Figure 2.8. Mis-annotations due to false duplications. 67
Figure 2.9. The genome landscape of false gene gains. 70
Figure 2.10. Cases of false gene gain annotations in the prior hummingbird and platypus assemblies. 73
Figure 2.11. Genome landscape of platypus assembly false duplications using Sanger reads. 74
Figure 2.12. Additional findings for the genome false duplication landscape of the ADAMTS13-like gene. 75
Figure 2.13. Gene ontology enrichment analysis of falsely duplicated genes. 82
Figure 2.14. Heterozygosity of ATP-binding genes with or without false duplications. 84
Figure 2.15. False duplications left in VGP assemblies. 88
Figure 2.16. Chromosomal location of false duplications in the VGP assemblies. 90
Figure 2.17. False duplications and their correction in the VGP zebra finch assembly. 92
Figure 2.18. Example cases of false duplications in the VGP assemblies. 94
Figure 2.19. Correction of the NPNT gene in VGP v1.7 pipeline assembly. 98
Figure 2.20. Proportions of genomic partitions represented among the falsely duplicated regions. 101
Figure 2.21. Difference of proportion of each genomic partition containing false duplications relative to expected frequency. 102
Figure 2.22. The genome landscape of false duplications in emu assemblies. 108
Figure 2.23. K-mer profiling for emu assemblies. a, Previous short read based assembly. 110
Figure 3.1. K-mer evaluation of zebra finch assemblies made by PacBio CLR and HiFi reads. 128
Figure 3.2. Amount of false duplication and losses in zebra finch assemblies made by PacBio CLR and HiFi reads 130
Figure 3.3. Number of genes affected by false duplication (a) and losses (b). 132
Figure 3.4. K-mer evaluation between bTaeGut2 and bTaeGut1.4. 134
Figure 3.5. Genome characteristics profile of zebra finch assemblies, bTaeGut2 (a) and bTaeGut1.4 (b) assemblies estimated from GenomeScope. 135
Figure 4.1. Overview of identifying false duplication on both read coverage and K*. 153
Figure 4.2. Genome characteristics of zebra finch (a) and human (b) assemblies estimated by GenomeScope2. 158
Figure 4.3. K-mer profiles of simulated assemblies. 161
Figure 4.4. Bivariate distributions of read coverage and K* of each zebra finch and human assembly. 163
Figure 4.5. The proportion of false duplications and performance assessment. 167
Figure 5.1. Mean depth coverage and bias function with along GC proportions for a, zebra finch, and b, human assemblies. 184
Figure 5.2. K* distribution across GC proportion categories. 188