Please use this identifier to cite or link to this item:
http://dspace.cityu.edu.hk/handle/2031/6713
Title: | English document analysis and visualization |
Authors: | Lo, Chi Mun |
Department: | Department of Electronic Engineering |
Issue Date: | 2012 |
Supervisor: | Supervisor: Prof. Chow, Tommy W S; Assessor: Prof. Chen, Guanrong |
Abstract: | This is a software project aims at developing a system which can be used to analyze and auto-categorize English documents, and study how different factors could affect system performance. The categorization system can be partitioned into 3 stages: data representation, classifier building and performance evaluation. As a document is originally raw text, system transforms document into N-grams which would be more suitable for later processing. And feature selection methods are used to filter out less important features (n-gram) and to reduce calculation complexity; and Chi Square Test is used in this project. After pre-processing of data, classifying approach is used to build a classifier for categorization. The system provides 2 classifying approaches; they are k-nearest neighbor (KNN) and Naive Bayes. By using the classifier, documents can be categorized automatically. To evaluate system performance, F1 score approach is used for measuring accuracy of classification. After testing of system, it is found that the classifying system is working with an acceptable accuracy with both KNN approach and Naive Bayes approach. |
Appears in Collections: | Electrical Engineering - Undergraduate Final Year Projects |
Files in This Item:
File | Size | Format | |
---|---|---|---|
fulltext.html | 146 B | HTML | View/Open |
Items in Digital CityU Collections are protected by copyright, with all rights reserved, unless otherwise indicated.