Top Banner
Copy or Not Dawei (David) Shi
24

Copy or Not

Feb 23, 2016

Download

Documents

clove

Copy or Not. Dawei (David) Shi. Copy Or Not. Introduction Algorithm Framework Future work Demo. Copy Or Not. Introduction Algorithm Framework Future work Demo. Introduction. A web-based document comparator Calculate accurate similarity between 2 documents. Copy Or Not. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Copy or Not

Copy or NotDawei (David) Shi

Page 2: Copy or Not

Copy Or Not Introduction Algorithm Framework Future work Demo

Page 3: Copy or Not

Copy Or Not Introduction Algorithm Framework Future work Demo

Page 4: Copy or Not

Introduction A web-based document comparator Calculate accurate similarity between 2

documents

Page 5: Copy or Not

Copy Or Not Introduction Algorithm Framework Future work Demo

Page 6: Copy or Not

Algorithm Preprocessing Vector space Similarity calculation

Page 7: Copy or Not

Preprocessing

LowercaseStop

words filtering

Stemming

Page 8: Copy or Not

Preprocessing Stemming

› Porter Stemming Algorithm› E.g.

cat – cats meet – meeting agree – agreed correct - correctness

Page 9: Copy or Not

Vector Space Build dictionary 1

› word -> frequency Sort the keys of dictionary 1 Build dictionary 2

› key -> (index, count) Build binary vectors

› index -> occurrence

Page 10: Copy or Not

Similarity Calculation Vectors v1 and v2 Similarity = v1 * v2 / (norm(v1) *

norm(v2))

Page 11: Copy or Not

Performance Algorithms coded in Python

› Dynamic typing› Not good at numerical operations

Solution: numpy

Page 12: Copy or Not

Numpy A Python extension module Written mostly in C Define numerical array and matrix

types and basic operations on them

Page 13: Copy or Not

Numpy vs Python Python code

› a = range(10000000)› b = range(10000000)› c = []› for i in range(len(a)):

c.append(a[i] + b[i]) Takes up to 10 seconds on a several

GHz processor

Page 14: Copy or Not

Numpy vs Python Numpy code

› import numpy as np› a = np.arrange(10000000)› a = np.arrange(10000000)› c = a + b

Almost Instant

Page 15: Copy or Not

Numpy Usage Vector dot product Vector normalization Vector zero filling

Page 16: Copy or Not

Copy Or Not Introduction Algorithm Framework Future work Demo

Page 17: Copy or Not

Framework Django

› The web framework for perfectionists with deadlines

Page 18: Copy or Not

Libraries Python

› Numpy› Porter Stemming

jQuery

Page 19: Copy or Not

Hosting Alwaysdata

› Django 1.3› Python 2.6

Page 20: Copy or Not

Copy Or Not Introduction Algorithm Framework Future work Demo

Page 21: Copy or Not

Future Work Support file uploading and comparison Add HTML5 features

Page 22: Copy or Not

Copy Or Not Introduction Algorithm Framework Future work Demo

Page 23: Copy or Not

Demo http://imds.alwaysdata.net

Page 24: Copy or Not

Thank you!