Skip navigation
Run Run Shaw Library City University of Hong KongRun Run Shaw Library

Please use this identifier to cite or link to this item: http://dspace.cityu.edu.hk/handle/2031/6713
Title: English document analysis and visualization
Authors: Lo, Chi Mun
Department: Department of Electronic Engineering
Issue Date: 2012
Supervisor: Supervisor: Prof. Chow, Tommy W S; Assessor: Prof. Chen, Guanrong
Abstract: This is a software project aims at developing a system which can be used to analyze and auto-categorize English documents, and study how different factors could affect system performance. The categorization system can be partitioned into 3 stages: data representation, classifier building and performance evaluation. As a document is originally raw text, system transforms document into N-grams which would be more suitable for later processing. And feature selection methods are used to filter out less important features (n-gram) and to reduce calculation complexity; and Chi Square Test is used in this project. After pre-processing of data, classifying approach is used to build a classifier for categorization. The system provides 2 classifying approaches; they are k-nearest neighbor (KNN) and Naive Bayes. By using the classifier, documents can be categorized automatically. To evaluate system performance, F1 score approach is used for measuring accuracy of classification. After testing of system, it is found that the classifying system is working with an acceptable accuracy with both KNN approach and Naive Bayes approach.
Appears in Collections:Electrical Engineering - Undergraduate Final Year Projects 

Files in This Item:
File SizeFormat 
fulltext.html146 BHTMLView/Open
Show full item record


Items in Digital CityU Collections are protected by copyright, with all rights reserved, unless otherwise indicated.

Send feedback to Library Systems
Privacy Policy | Copyright | Disclaimer