KIB OpenIR  > 昆明植物所硕博研究生毕业学位论文
基于Snakemake构建不同杂合度区间的基因组组装流程
黄艺伟
Thesis Advisor杨俊波
KeywordSnakemake,基因组组装,杂合度,兰属 Snakemake, Genome Assembly, Heterozygosity, Cymbidium
Abstract随着测序技术的快速发展,植物基因组研究也进入了泛基因组学和高精度、高完整度基因组图谱的发展阶段。基因组图谱信息的解析为植物的遗传与性状变异研究提供了十分重要的基础。然而,植物基因组组装经常面临着因为数据量大、数据类型多样、研究对象差异大等因素带来的一系列软件安装、参数优化、步骤繁多、计算耗时长等问题。本课题基于3个具有不同基因组复杂度(大小、杂合度等)的植物基因组测序数据,使用Snakemake搭建了一套以杂合度为主要区分参数的自动化基因组组装方案,初步建立起一套完整的植物基因组组装流程。主要结果如下: 将目前常用的30多个基因组组装任务相关的开源工具,如数据质控、基因组杂合度分析、原始拼接、序列校正、冗余序列去除、染色体挂载和组装质量评估等软件,整合到环境管理平台Singularity,并创建其镜像文件进行整体封装。使用封装好的Singularity镜像文件即可免除基因组组装的软件安装与环境配置,一键完成基因组组装。 为了更准确的评估不同杂合度对基因组组装效果的影响,本课题将杂合度细分为三个区间:小于0.5%、0.5-1.0%、大于1.0%。每个区间内均使用不同的组装策略(包括不同软件配置与参数组合)。每个组装策略均基于Snakemake进行流程搭建,并将基因组组装过程中需要的软件转化为Snakemake语言运行所调用的文件格式,使用Python语言进行输入输出依赖处理。 分别使用拟南芥(Arabidopsis thaliana,杂合度0.18%)、大根兰(Cymbidium macrorhizon,0.65%)、兔耳兰(C. lancifolium,2.8%)对自主搭建的分析流程进行测试与评估。拟南芥使用搭建的流程组装后结果为:Nanopore数据(基因组大小119 Mb,N50为12.5 Mb),PacBio数据(基因组大小为122 Mb,N50为13.0 Mb),基因组组装指标均好于原文中的基因组指标。大根兰使用MECAT2软件搭建的流程组装后输出基因组大小为3.43 Gb,N50为1.37 Mb,BUSCO评估值为94.6%,基因组指标好于Wtdbg2软件的运行结果。兔耳兰使用Hifiasm软件搭建的流程组装后输出基因组大小为4.55 Gb,N50为5.31 Mb,BUSCO评估值为97.2%,为所有已发表的兰属植物中N50值最高。 综上所述,本研究通过Snakemake搭建一套适用于不同数据类型、不同基因组杂合度的植物基因组组装流程,并使用Singularity对分析流程进行模块化处理与整体封装,可极大简化组装过程,为快速构建植物基因组图谱信息提供了规模化、自动化的解决方案。; With the rapid development of sequencing technology, plant genomic research has entered the era of pan-genomics, accompanied by high precision and integrity of genome assemblies. The high-quality genomic data has provided essential information for better understanding the plant genome evolution and phenotypic variations. However, due to the large diversity of plant genomes among taxa, intractable data with huge volumes, and various formats, high-quality genome assembly still faces a range of problems such as cumbersome tools installation and environment configuration, parameter optimizations, complicated analysis process and time-consuming, etc. In this study, Snakemake was used to develop a genome assembly pipeline that enables automated genome assembly with heterozygosity as the main distinguishing parameter. Three plant genomes with different genome complexity (genome size, heterozygosity, etc.) were applied to test the performance of this pipeline. We initially established an easy-to-use pipeline for complete plant genome assembly. The main results are as follows: In this research, more than 30 commonly used open-source tools related to genome assembly were integrated into an environmental management platform named Singularity, such as data quality control, genome heterozygosity evaluation, raw reads mapping, sequence polish, genome de-redundancy, chromosome conformation and scaffolding, and genome quality assessment, etc, followed by creating the image file for overall packaging. The packaged Singularity image files exempt software installation and environment configuration, which means users can complete genome assembly with one click. To accurately evaluate the impact of different heterozygosity on the genome assembly, we subdivided the heterozygosity into three intervals as follows: lower than 0.5%, between 0.5-1.0%, higher than 1.0%. Different assembly strategies (including different software configurations and parameter combinations) were used in each heterozygosity interval. The assembly process is built based on the Snakemake pipeline, and the softwares required in the genome assembly process have been converted into the Snakemake-readable files, and input and output dependencies were processed by Python language. The self-developed automated genome assembly pipeline was tested using Arabidopsis thaliana (heterozygosity 0.18%), Cymbidium macrorhizon (0.65%), and Cymbidium lancifolium (2.8%). The assembly results of A. thaliana are as follows: using Nanopore data and PacBio data, the assembled genome size was 119 Mb with contig N50 of 12.5 Mb and 122 Mb with contig N50 of 13.0 Mb, respectively. Both assembly results are better than what were stated in the originally published paper. The output of C. macrorhizon using the MECAT2 assembly process (genome size was 3.43 Gb with contig N50 of 1.37 Mb, BUSCO evaluation reached 94.6%) was better than that using Wtdbg2. By using the Hifiasm assembly process, the length of the assembled genome
Language中文
2022-05
Degree Grantor中国科学院大学
Document Type学位论文
Identifierhttp://ir.kib.ac.cn/handle/151853/75090
Collection昆明植物所硕博研究生毕业学位论文
Recommended Citation
GB/T 7714
黄艺伟. 基于Snakemake构建不同杂合度区间的基因组组装流程[D]. 中国科学院大学,2022.
Files in This Item:
File Name/Size DocType Version Access License
黄艺伟-黄艺伟硕士学位论文0531897(11168KB)学位论文 限制开放CC BY-NC-SAApplication Full Text
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[黄艺伟]'s Articles
Baidu academic
Similar articles in Baidu academic
[黄艺伟]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[黄艺伟]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.