Please use this identifier to cite or link to this item:
http://dspace.cityu.edu.hk/handle/2031/8219
Title: | Document Similarity Comparison |
Authors: | Chu, Yilei |
Department: | Department of Electronic Engineering |
Issue Date: | 2015 |
Supervisor: | Supervisor: Prof. CHOW, Tommy W S; Assessor: Prof. CHEN, Guanrong |
Abstract: | The algorithms behind document similarity comparison have been widely applied in fields like (1) plagiarism checking in academic libraries, (2) redundancy elimination in large collections of web pages, (3) web search engines like Google, etc. However, the past research relies on huge database consisting of millions or billions of webpages – recall of their experiments usually cannot be justified. This final year project has explored three most famous English text similarity detecting techniques: (1) cosine distance, (2) shingling and (3) SimHash. The database consists of short news articles crawled from BBC news. Both the accuracy and efficiency have been evaluated in order to find the most suitable algorithm for a short text search engine. All the experiments were conducted using Python and Java, relying on supports from open-source libraries like NLTK, Stanford POS Tagging, and Guava. |
Appears in Collections: | Electrical Engineering - Undergraduate Final Year Projects |
Files in This Item:
File | Size | Format | |
---|---|---|---|
fulltext.html | 145 B | HTML | View/Open |
Items in Digital CityU Collections are protected by copyright, with all rights reserved, unless otherwise indicated.