Skip navigation
Run Run Shaw Library City University of Hong KongRun Run Shaw Library

Please use this identifier to cite or link to this item: http://dspace.cityu.edu.hk/handle/2031/9351
Full metadata record
DC FieldValueLanguage
dc.contributor.authorSivakumar, Srinivasen_US
dc.date.accessioned2020-11-17T09:36:24Z-
dc.date.available2020-11-17T09:36:24Z-
dc.date.issued2020en_US
dc.identifier.other2020eess977en_US
dc.identifier.urihttp://dspace.cityu.edu.hk/handle/2031/9351-
dc.description.abstractDNA sequencing machines cannot sequence entire genomes, but they can sequence short fragments of DNA strands, known as 'reads.' Next Generation Sequencing (NGS) provide large amounts of reads from the sample. The objective of this project is to find the presence of a genome within the metagenomic data. For example, detecting the presence of the HIV virus in a human sample. The two approaches to finding genomes from the set of reads are read alignment and de novo assembly. Alignment is the process of aligning or matching each read with a reference genome. This is inherently a string-matching problem, where the query is the read, and the text is the genome. De novo assembly or overlap assembly, on the other hand, is the process of assembling a genome without a reference. In the process, reads are effectively joined together with a prefix-suffix match. The objective of this project is to obtain and 'remove' all bacteria genomes for viral metagenomic data. Once the bacterial reads have been removed from the sequencing data, both new and old viruses can then be recognized. The approach used in this project is to combine these two strategies as most bacteria do not have complete genomes. First, the reads are mapped to 16s rRNA strands - which are present in all types of bacteria - using Bowtie2 (Langmead B, 2012) or BWA (Li & Durbin, 2010), an efficient read-alignment tool. The output of the alignment is a subset of reads, mapped to the reference genome, referred to as the 'seed.' The seed is then extended iteratively using overlap extension with the remaining reads. Serial and multi-threaded implementations were developed using base code from SGA (Jared T. Simpson, June 2010) as it is an open-source project. The backbone of both implementations is the Burrows-Wheeler Transform (Burrows & Wheeler, 1994) and the FM index (Ferragina & Manzini, 2000). The effectiveness and correctness of the program were checked by creating simulated sequencing data. A scalable and memory-efficient implementation was developed using SGA’s codebase.en_US
dc.rightsThis work is protected by copyright. Reproduction or distribution of the work in any format is prohibited without written permission of the copyright owner.en_US
dc.rightsAccess is restricted to CityU users.en_US
dc.titleUsing advanced indexing strategies for genome reconstruction from metagenomic dataen_US
dc.contributor.departmentDepartment of Electrical Engineeringen_US
dc.description.supervisorSupervisor: Dr. Sun, Yanni; Assessor: Prof. Chow, Tommy W Sen_US
Appears in Collections:Electrical Engineering - Undergraduate Final Year Projects 

Files in This Item:
File SizeFormat 
fulltext.html147 BHTMLView/Open
Show simple item record


Items in Digital CityU Collections are protected by copyright, with all rights reserved, unless otherwise indicated.

Send feedback to Library Systems
Privacy Policy | Copyright | Disclaimer