Top Banner
Topic Modelling in Farsi and APIs @aliostad Ali Kheyrollahi Barcelona 2015
38

Topic Modelling and APIs

Aug 08, 2015

Download

Technology

Ali Kheyrollahi
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Topic Modelling and APIs

Topic Modelling in Farsi

and APIs

@aliostad

Ali Kheyrollahi

Barcelona 2015

Page 2: Topic Modelling and APIs

Machine Learning

and APIs

Page 3: Topic Modelling and APIs
Page 4: Topic Modelling and APIs

Topic Modelling

Page 5: Topic Modelling and APIs

Topic Modelling

Page 6: Topic Modelling and APIs

Topic ModellingSearch

Page 7: Topic Modelling and APIs

Topic Modelling*Document to document similarity

Page 8: Topic Modelling and APIs

Farsi

Page 9: Topic Modelling and APIs

Farsi

23rd most popular Spoken Language,

above Italian and Polish

Page 10: Topic Modelling and APIs

Farsi

14th most popular Internet Language,

above Korean and Swedish

Page 11: Topic Modelling and APIs

Acquisition and pre-

My Book

کتاب‌منketaab -e- man

Page 12: Topic Modelling and APIs

What did you just say?

کتاب‌منکک ت اب ‌م ن

باککتکيیکیکم

Page 13: Topic Modelling and APIs

How weird can it get?

Unicode codez

Page 14: Topic Modelling and APIs

And there’s moreکتاب‌من

کتاب منZero-width non-joiner (0x200C)

‌HTML =>

کتابمن

Page 15: Topic Modelling and APIs

Topic Modelling

and LDA

Page 16: Topic Modelling and APIs
Page 17: Topic Modelling and APIs

Latent Dirichlet Allocation (LDA)

✤ Mainly a “clustering” algorithm

✤ Defines topics as latent variables within the documents

✤ Its implementations available in most programming languages

✤ Python => Gensim and Java => Mallet

Page 18: Topic Modelling and APIs

Topic Modelling concepts in LDA

✤ Document: “Bag of words” vs. “Markov chain”

✤ Word: mere an id (“library”=>123, “librarian”=>789)

✤ Dictionary: set of all words

✤ Corpus: set of all documents

✤ Topic: Distribution over words (LDA)

Page 19: Topic Modelling and APIs

Using Latent Dirichlet Allocation

✤ Document as a vector of topic weights {0: 0.01, 12: 0.19, 42: 0.23}

✤ Cosine similarity for document similarity

✤ Document similarity works really well

✤ Not great in some domains [to fix => Hierarchical]

✤ Boosting

Page 20: Topic Modelling and APIs

Topic Model and

APIs

Page 21: Topic Modelling and APIs

Resources: Dictionary

Dictionary

POST languages/farsi/dictionaries HTTP/1.1 Host: example.com

200 OK Location: languages/farsi/dictionaries/123

languages/{lang}/dictionaries/{id}

Create

Page 22: Topic Modelling and APIs

Resources: Dictionary

Dictionary

PUT languages/farsi/dictionaries/123 HTTP/1.1 Content-Type: application/json

{ “documents”: [ {“fullText”: “یک فوق‌تخصص قلب و عروق گفت”}, … ]

}

languages/{lang}/dictionaries/{id}

Add document words

Page 23: Topic Modelling and APIs

Resources: Corpus

Corpus

POST languages/farsi/corpi HTTP/1.1 Host: example.com

200 OK location: languages/farsi/corpi/123

languages/{lang}/corpora/{id}

Create

Page 24: Topic Modelling and APIs

Resources: Corpus

Corpus

PUT languages/farsi/corpi/123 HTTP/1.1 Content-Type: application/json

{ “documents”: [ {“fullText”: “یک فوق‌تخصص قلب و عروق گفت”}, … ]

}

languages/{lang}/corpora/{id}

Add documents

Page 25: Topic Modelling and APIs

Resources: TopicModel

TopicModel

POST languages/farsi/topicmodels?passes=6&alpha=auto HTTP/1.1 Host: example.com

{ “dictionaryId”:123, “corpusId”:456

}

languages/{lang}/topicmodels

Create (request)

Page 26: Topic Modelling and APIs

Resources*: TopicModel

TopicModel

202 Accepted Location: languages/farsi/topicmodels/789

languages/{lang}/topicmodels

Create (response)

Page 27: Topic Modelling and APIs

State of current ML APIs

Page 28: Topic Modelling and APIs

State of current ML APIs

HATEOAS

HypermediaREST

APIs C A C H E

Markov ChainGraph Theory

Deep LearningBayesian

Page 29: Topic Modelling and APIs

Server Authority

State

Page 30: Topic Modelling and APIs

Server Authority

Algorithm

Page 31: Topic Modelling and APIs

Is this really a resource?

Converts

Page 32: Topic Modelling and APIs

Mills

Page 33: Topic Modelling and APIs

Mills

✤ A single piece of work/specialty (& verb)

✤ Encapsulating an “algorithm”

✤ Do not own data (own config tho): Raw data in, processed result out

✤ All calls are safe and idempotent

Page 34: Topic Modelling and APIs

Topic Model Mills: classifier

TopicModellanguages/{lang}/topicmodels/{id}/classifier

classifier (request)

POST languages/farsi/topicmodels/789/classifier HTTP/1.1 Host: example.com

{ “fullText”:“یک فوق‌تخصص قلب و عروق گفت”, “refinement”:”hierarchical”

}

Page 35: Topic Modelling and APIs

Topic Model Mills: classify

TopicModellanguages/{lang}/topicmodels/{id}/classify

classifier (response)OK 200 Content-Type: application/json

{ “15”: 0.03, “123”: 0.2, “390”: 0.09, …

}

Page 36: Topic Modelling and APIs

Thank you!

@aliostadaliostad [at] gmail [dot] com

Page 37: Topic Modelling and APIs

Acknowledgements✤ Windmill picture: https://www.flickr.com/photos/capnkroaker/2473951927/in/photolist-4LBDHz-4Mba53-pnrARE-4Ktk7H

✤ Algorithm picture: https://www.flickr.com/photos/peterrosbjerg/4257452000/in/photolist-7udy9Q-834w2L-dcPgeA-dcPg7s-dcPdHZ-jiRgZs-jiQstc-8qnJKb-8qdWAF-8qh6D1-8qdWEz-b8ADMZ-b8Ausi-ansdvD-dcPgc9-8kNsd1-pNCgk1-b7G8ZT-8pQRPF-8pTvUy-eXse1A-99XXLF-eKT1Y-831n9H-jj7vuo-jiRZij-b7G84Z-b7G78P-fvrqUB-b7GajB-jiPF46-8ERYQY-jiNzQv-jiRkkW-jiNUNm-jiPgPZ-jiPbWU-jiNQQm-jiRhDF-jiPF7m-jiSdJs-jiPqDT-jiSa8u-jiPwr5-jiM5Ac-jiMKze-jiPZCZ-jiNwf6-jiNtMF-jiPHut

✤ Water Tanks picture: https://www.flickr.com/photos/psilver/2280385292/in/photolist-4tvz6N-9h1PHy-cuie7-RJ83k-696owN-85tcrs-74MqFc-pkuu5-o3BsGV-bR11F2-8jNAnA-ep2fVX-8YyHWv-ABECA-av9mMk-7LMozD-dMySvh-7Pipgo-5rXApy-Q8zgi-eFxGYc-7sDbjx-87LdLE-aELtQV-7AnXb7-dJqjNR-XYpHK-nAFVCS-95G4EU-9jxNiT-7F1RPj-68hFop-7VFYSs-nzr9W2-pb3zpe-9j5sua-9962cu-bJ1UED-dp6yqD-8UCQTj-NywAX-kBG3xr-9aTXxq-pVmJui-k8BDsX-7XXtce-7pKUVr-5Hn3CL-rvWcUu-kW6dat

✤ IT Web Jobs: http://www.itjobswatch.co.uk/jobs/uk/machine%20learning.do

✤ Question mark picture: https://www.flickr.com/photos/129627585@N07/15684220620/in/photolist-pTXJmU-4W4Xed-dKRgQ2-LLBYA-8uuSCh-8qk5Q-4y7wzQ-6feu6Z-6EsuSe-f7eVmb-9WAPNR-f75fPR-8Lzt7R-9L2t4y-apWJQR-fhdGoH-4v7kg9-65wR72-7FyjMW-epmYfa-abQKN1-6m1HuV-86Uor8-a64uYL-a61DMr-9oCDYc-dW2Xad-a64vnN-a61A6c-2Zvn7-5pjxSz-9Gd43a-oQRf6d-oQReWf-p8kRm6-p8iYSE-xtEEP-7oxXJg-a64sGS-7U52mA-2Z97S-a61D7e-a64uFd-aiWYZk-2Z9mV-4cmUWW-2Zoxb-2Zg4r-a61Fmi-a61B1T

✤ Timber picture: https://www.flickr.com/photos/simonbleasdale/2797031694/in/photolist-5gaw5o-hLB1BA-9Zafk9-bnjrwy-cSf2fU-cSf4tN-69qu5j-69qzMC-dkby9F-7wjTFp-kRxidH-53cRdq-nDr93N-kRFANh-25iCW3-cjLSKs-9R81XA-4xHk5Z-9R7MXG-5gay3S-baDJJF-bnjrdE-9R7XMf-9R8Knj-9R8Ceq-bAej8Z-bnjrhU-8WcymA-bnjrA5-9R8RgQ-9R5CtZ-9R8e6h-9R8Peh-9R5eca-9R8htw-9R5bbD-9R5yov-9R51kT-9R5hqK-dvvEFb-dvq6fP-dvq7GT-dvvFG7-dvvEyL-dvvGtA-dvq8bT-dvvGAN-dvq6U2-dvvGWQ-dvq5sD

✤ Timber: https://commons.wikimedia.org/wiki/Category:Timber#/media/File:Oregon_BLM_Forestry_10_(6871708937).jpg

Page 38: Topic Modelling and APIs

References

✤ Gensim: https://radimrehurek.com/gensim/

✤ LDA Paper: http://machinelearning.wustl.edu/mlpapers/paper_files/BleiNJ03.pdf

✤ Client-Server Domain Separation: http://byterot.blogspot.com.es/2012/11/client-server-domain-separation-csds-rest.html

✤ Mill proposal: Is coming!