Computing the maximum similarity tri-clusters of gene expression data

Mu, Zongxu

Please use this identifier to cite or link to this item: http://dspace.cityu.edu.hk/handle/2031/7214

Title:	Computing the maximum similarity tri-clusters of gene expression data
Authors:	Mu, Zongxu
Department:	Department of Computer Science
Issue Date:	2013
Supervisor:	Supervisor: Prof. Wang, Lusheng; First Reader: Dr. Li, Shuai Cheng; Second Reader: Prof. Li, Qing
Abstract:	In Bioinformatics, the most popular technology after genome sequencing has been DNA microarray analysis, which is a functional genomics approach (Lim et al., 2012). However, huge amount of genome-wide data are produced by such experiments, leaving biologists a challenging plague to deal with. Over the past decade or so, bi-clustering has come out as an effective, and continuously improving, approach in microarray data analysis thanks to the intensive research in the literature. The method can be used to find sets of co-regulated genes under subsets of test conditions. With the advancement in DNA microarray technology, tri-dimensional (3-D) gene expression data (matrices) are increasingly more common in biomedical research (Bhar et al., 2012). The task is thus to determine which genes co-express under which subset of samples across which time duration. The underlying bases for using triclustering in the analysis of gene expression data are: (1) similar genes may exhibit similar behaviors only under a subset of test conditions, over only some of the time points; and (2) genes may participate in more than one function, resulting in one regulation pattern in one context or time and a different pattern in another. This project attempts to capture co-expression properties in a 3-D microarray gene expression dataset. The microarray gene expression dataset will be modeled as a 3-D matrix, with each element corresponding to the expression value of a gene under a sample or experimental condition at a particular time-point. Particularly, the project is inspired by the MSB (Maximum Similarity Bi-cluster) algorithm and its extensions proposed by Liu and Wang (2007). MSB is a polynomial algorithm designed to find an optimal bi-cluster with the maximum similarity score. This project presents an algorithm named MST (Maximum Similarity Tri-cluster), which makes use of the (R)MSBE ([Randomized] MSB Extended) algorithm. The MST algorithm first mines bi-clusters in gene-sample (G-S) matrices at different time points, and then merges the obtained bi-clusters into tri-clusters before merging them to polish the quality of the result triclusters. Apart from the algorithm, the project also delivers a homecooked software MSTS (MST System) which provides functionalities from reading data from input file or generating synthetic data to clustering and performance evaluation. The MSTS features its simple-to-use command-line user interface and efficiency in finding tri-clusters. In addition, experiments are done on both synthetic and real microarray datasets. For synthetic data, a match score is used to measure the quality of obtained tri-clusters under different settings or noise levels. For real data, the gene ontology project is used as references in order to measure the biological signi cance of the MST algorithm. In summary, the project proposes a novel tri-clustering algorithm and delivers a home-cooked bioinformatic software. It thus represents a self-completed study on tri-clustering of gene expression data.
Appears in Collections:	Computer Science - Undergraduate Final Year Projects

Files in This Item:

File	Size	Format
fulltext.html	145 B	HTML	View/Open

Show full item record