基于Snakemake构建不同杂合度区间的基因组组装流程
黄艺伟
导师杨俊波
关键词Snakemake,基因组组装,杂合度,兰属 Snakemake, Genome Assembly, Heterozygosity, Cymbidium
摘要随着测序技术的快速发展,植物基因组研究也进入了泛基因组学和高精度、高完整度基因组图谱的发展阶段。基因组图谱信息的解析为植物的遗传与性状变异研究提供了十分重要的基础。然而,植物基因组组装经常面临着因为数据量大、数据类型多样、研究对象差异大等因素带来的一系列软件安装、参数优化、步骤繁多、计算耗时长等问题。本课题基于3个具有不同基因组复杂度(大小、杂合度等)的植物基因组测序数据,使用Snakemake搭建了一套以杂合度为主要区分参数的自动化基因组组装方案,初步建立起一套完整的植物基因组组装流程。主要结果如下: 将目前常用的30多个基因组组装任务相关的开源工具,如数据质控、基因组杂合度分析、原始拼接、序列校正、冗余序列去除、染色体挂载和组装质量评估等软件,整合到环境管理平台Singularity,并创建其镜像文件进行整体封装。使用封装好的Singularity镜像文件即可免除基因组组装的软件安装与环境配置,一键完成基因组组装。 为了更准确的评估不同杂合度对基因组组装效果的影响,本课题将杂合度细分为三个区间:小于0.5%、0.5-1.0%、大于1.0%。每个区间内均使用不同的组装策略(包括不同软件配置与参数组合)。每个组装策略均基于Snakemake进行流程搭建,并将基因组组装过程中需要的软件转化为Snakemake语言运行所调用的文件格式,使用Python语言进行输入输出依赖处理。 分别使用拟南芥(Arabidopsis thaliana,杂合度0.18%)、大根兰(Cymbidium macrorhizon,0.65%)、兔耳兰(C. lancifolium,2.8%)对自主搭建的分析流程进行测试与评估。拟南芥使用搭建的流程组装后结果为:Nanopore数据(基因组大小119 Mb,N50为12.5 Mb),PacBio数据(基因组大小为122 Mb,N50为13.0 Mb),基因组组装指标均好于原文中的基因组指标。大根兰使用MECAT2软件搭建的流程组装后输出基因组大小为3.43 Gb,N50为1.37 Mb,BUSCO评估值为94.6%,基因组指标好于Wtdbg2软件的运行结果。兔耳兰使用Hifiasm软件搭建的流程组装后输出基因组大小为4.55 Gb,N50为5.31 Mb,BUSCO评估值为97.2%,为所有已发表的兰属植物中N50值最高。 综上所述,本研究通过Snakemake搭建一套适用于不同数据类型、不同基因组杂合度的植物基因组组装流程,并使用Singularity对分析流程进行模块化处理与整体封装,可极大简化组装过程,为快速构建植物基因组图谱信息提供了规模化、自动化的解决方案。; With the rapid development of sequencing technology, plant genomic research has entered the era of pan-genomics, accompanied by high precision and integrity of genome assemblies. The high-quality genomic data has provided essential information for better understanding the plant genome evolution and phenotypic variations. However, due to the large diversity of plant genomes among taxa, intractable data with huge volumes, and various formats, high-quality genome assembly still faces a range of problems such as cumbersome tools installation and environment configuration, parameter optimizations, complicated analysis process and time-consuming, etc. In this study, Snakemake was used to develop a genome assembly pipeline that enables automated genome assembly with heterozygosity as the main distinguishing parameter. Three plant genomes with different genome complexity (genome size, heterozygosity, etc.) were applied to test the performance of this pipeline. We initially established an easy-to-use pipeline for complete plant genome assembly. The main results are as follows: In this research, more than 30 commonly used open-source tools related to genome assembly were integrated into an environmental management platform named Singularity, such as data quality control, genome heterozygosity evaluation, raw reads mapping, sequence polish, genome de-redundancy, chromosome conformation and scaffolding, and genome quality assessment, etc, followed by creating the image file for overall packaging. The packaged Singularity image files exempt software installation and environment configuration, which means users can complete genome assembly with one click. To accurately evaluate the impact of different heterozygosity on the genome assembly, we subdivided the heterozygosity into three intervals as follows: lower than 0.5%, between 0.5-1.0%, higher than 1.0%. Different assembly strategies (including different software configurations and parameter combinations) were used in each heterozygosity interval. The assembly process is built based on the Snakemake pipeline, and the softwares required in the genome assembly process have been converted into the Snakemake-readable files, and input and output dependencies were processed by Python language. The self-developed automated genome assembly pipeline was tested using Arabidopsis thaliana (heterozygosity 0.18%), Cymbidium macrorhizon (0.65%), and Cymbidium lancifolium (2.8%). The assembly results of A. thaliana are as follows: using Nanopore data and PacBio data, the assembled genome size was 119 Mb with contig N50 of 12.5 Mb and 122 Mb with contig N50 of 13.0 Mb, respectively. Both assembly results are better than what were stated in the originally published paper. The output of C. macrorhizon using the MECAT2 assembly process (genome size was 3.43 Gb with contig N50 of 1.37 Mb, BUSCO evaluation reached 94.6%) was better than that using Wtdbg2. By using the Hifiasm assembly process, the length of the assembled genome
语种中文
2022-05
学位授予单位中国科学院大学
文献类型学位论文
条目标识符http://ir.kib.ac.cn/handle/151853/75090
专题昆明植物所硕博研究生毕业学位论文
推荐引用方式
GB/T 7714
黄艺伟. 基于Snakemake构建不同杂合度区间的基因组组装流程[D]. 中国科学院大学,2022.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
黄艺伟-黄艺伟硕士学位论文0531897(11168KB)学位论文 限制开放CC BY-NC-SA请求全文
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[黄艺伟]的文章
百度学术
百度学术中相似的文章
[黄艺伟]的文章
必应学术
必应学术中相似的文章
[黄艺伟]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。