Please use this identifier to cite or link to this item:
http://dspace.cityu.edu.hk/handle/2031/8219
Full metadata record
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Chu, Yilei | en_US |
dc.date.accessioned | 2016-01-07T01:24:09Z | |
dc.date.accessioned | 2017-09-19T09:14:52Z | |
dc.date.accessioned | 2019-02-12T07:33:20Z | - |
dc.date.available | 2016-01-07T01:24:09Z | |
dc.date.available | 2017-09-19T09:14:52Z | |
dc.date.available | 2019-02-12T07:33:20Z | - |
dc.date.issued | 2015 | en_US |
dc.identifier.other | 2015eecy303 | en_US |
dc.identifier.uri | http://144.214.8.231/handle/2031/8219 | - |
dc.description.abstract | The algorithms behind document similarity comparison have been widely applied in fields like (1) plagiarism checking in academic libraries, (2) redundancy elimination in large collections of web pages, (3) web search engines like Google, etc. However, the past research relies on huge database consisting of millions or billions of webpages – recall of their experiments usually cannot be justified. This final year project has explored three most famous English text similarity detecting techniques: (1) cosine distance, (2) shingling and (3) SimHash. The database consists of short news articles crawled from BBC news. Both the accuracy and efficiency have been evaluated in order to find the most suitable algorithm for a short text search engine. All the experiments were conducted using Python and Java, relying on supports from open-source libraries like NLTK, Stanford POS Tagging, and Guava. | en_US |
dc.rights | This work is protected by copyright. Reproduction or distribution of the work in any format is prohibited without written permission of the copyright owner. | en_US |
dc.rights | Access is restricted to CityU users. | en_US |
dc.title | Document Similarity Comparison | en_US |
dc.contributor.department | Department of Electronic Engineering | en_US |
dc.description.supervisor | Supervisor: Prof. CHOW, Tommy W S; Assessor: Prof. CHEN, Guanrong | en_US |
Appears in Collections: | Electrical Engineering - Undergraduate Final Year Projects |
Files in This Item:
File | Size | Format | |
---|---|---|---|
fulltext.html | 145 B | HTML | View/Open |
Items in Digital CityU Collections are protected by copyright, with all rights reserved, unless otherwise indicated.