Using advanced indexing strategies for genome reconstruction from metagenomic data

Sivakumar, Srinivas

Please use this identifier to cite or link to this item: http://dspace.cityu.edu.hk/handle/2031/9351

Full metadata record

DC Field	Value	Language
dc.contributor.author	Sivakumar, Srinivas	en_US
dc.date.accessioned	2020-11-17T09:36:24Z	-
dc.date.available	2020-11-17T09:36:24Z	-
dc.date.issued	2020	en_US
dc.identifier.other	2020eess977	en_US
dc.identifier.uri	http://dspace.cityu.edu.hk/handle/2031/9351	-
dc.description.abstract	DNA sequencing machines cannot sequence entire genomes, but they can sequence short fragments of DNA strands, known as 'reads.' Next Generation Sequencing (NGS) provide large amounts of reads from the sample. The objective of this project is to find the presence of a genome within the metagenomic data. For example, detecting the presence of the HIV virus in a human sample. The two approaches to finding genomes from the set of reads are read alignment and de novo assembly. Alignment is the process of aligning or matching each read with a reference genome. This is inherently a string-matching problem, where the query is the read, and the text is the genome. De novo assembly or overlap assembly, on the other hand, is the process of assembling a genome without a reference. In the process, reads are effectively joined together with a prefix-suffix match. The objective of this project is to obtain and 'remove' all bacteria genomes for viral metagenomic data. Once the bacterial reads have been removed from the sequencing data, both new and old viruses can then be recognized. The approach used in this project is to combine these two strategies as most bacteria do not have complete genomes. First, the reads are mapped to 16s rRNA strands - which are present in all types of bacteria - using Bowtie2 (Langmead B, 2012) or BWA (Li & Durbin, 2010), an efficient read-alignment tool. The output of the alignment is a subset of reads, mapped to the reference genome, referred to as the 'seed.' The seed is then extended iteratively using overlap extension with the remaining reads. Serial and multi-threaded implementations were developed using base code from SGA (Jared T. Simpson, June 2010) as it is an open-source project. The backbone of both implementations is the Burrows-Wheeler Transform (Burrows & Wheeler, 1994) and the FM index (Ferragina & Manzini, 2000). The effectiveness and correctness of the program were checked by creating simulated sequencing data. A scalable and memory-efficient implementation was developed using SGA’s codebase.	en_US
dc.rights	This work is protected by copyright. Reproduction or distribution of the work in any format is prohibited without written permission of the copyright owner.	en_US
dc.rights	Access is restricted to CityU users.	en_US
dc.title	Using advanced indexing strategies for genome reconstruction from metagenomic data	en_US
dc.contributor.department	Department of Electrical Engineering	en_US
dc.description.supervisor	Supervisor: Dr. Sun, Yanni; Assessor: Prof. Chow, Tommy W S	en_US
Appears in Collections:	Electrical Engineering - Undergraduate Final Year Projects

Files in This Item:

File	Size	Format
fulltext.html	147 B	HTML	View/Open

Show simple item record