Real-World Deep Learning for NLPbiconsulting.hu/letoltes/2017budapestdata/varju_zoltan... · 2017. 6. 13. · Java fejlesztő, Java programozó, Java Developer, Java Programmer, Junior

Real-World Deep Learning for NLP

Zoltan Varju, Precognox06.13.2017.

Budapest Data Forum

About the title

● Real-World: it’s about an enterprise search problem, not about dogs and cats or interlingua

● Deep Learning: word2vec, seq2seq (no fancy new architectures)

● NLP: correcting typos, finding synonyms

The Problem

Semantic Search

● Job titles:○ We have to find all possible title for a given query○ Java fejlesztő, Java programozó, Java Developer, Java Programmer, Junior Java Ninja,

Rockstar Java Programmer

● Typos:○ Queries are full of typos, esp. on mobile○ Programmer: porgrammer, programer, programme

● Broader terms○ If you can’t find anything for a given term, try a query with its broader term○ Prolog Developer -> Developer, Senior Project Manager -> Project Manager

Search Backend @ Profession

● Job thesaurus:○ Used for query expansion○ Typos, synonyms, broader and narrower terms○ Flat categories (e.g. Bank and Finance, Health, Agriculture)

● Solr index● InfoHarvester (Hadoop-based crawler)

Thesaurus Building

● Use the query log to find similar queries○ Use string similarity (e.g. Levensthein distance) to generate list of candidate pairs○ Filter results

● Annotators use Thesaurus Manager○ to determine the type of the relation between the candidates (synonym, broader, narrower, or

typo)○ eliminate false positives (about ⅔ of the generated data is false positive!)○ to extend the thesaurus with synonyms based on common sense

Pros & Cons

Pros

● It works (we’ve been doing this for years)● We are comfortable with such projects● Requires minimal maintenance (software

and data)● The client can use the Thesaurus Manager

to manage its data

Cons

● Building the first, basic thesaurus is extremely hard and slow

● Annotators are having hard time for weeks, they get frustrated, we need lots of QA

● Hard work means high costs

There is more than one way to do it

Why?

● Our annotators get TOO MUCH data (because of the false positives), so● they have to make LOTS OF DECISIONS.● The ‘Typo’ relation is the hardest to check during thesaurus building, we

want to do it automatically.● We don’t want to fire them, reducing the time and costs of search projects

means we can win more projects and provide them with a meaningful paid job

● We are doing spell checking, but we are interested in finding terms in the index

Finding semantic relations with word2vec

● Mikolov et al. 2010● Preprocessing: extracting

named entities (e.g. ), idioms, etc.

● Training with Gensim● Simple similarity measure

covers 96.7% of the relations in the original thesaurus

● 15% is the ratio of false positives

Spell checking with seq2seq

● Generate synthetic typos○ Brute force: generate ALL the strings within three edit distance, and filtering out real worlds

from the results○ Sophisticated: generate typos based on misspelling stats○ Mixed: generate n random typos within three edit distance + use stats

● Train a seq2seq network on your data and evaluate it:○ Brute force 63%○ Sophisticated 81.3%○ Mixed 97.66%

Pros and Cons

Pros

● The annotators get more structured data● We can reduce annotation costs by a third● We don’t have typos in our thesaurus!● We can start the project very quickly● It is not domain dependent● It is language independent● It can deal with “‘Hunglish“ terms like

“Szenior project manager”

Cons

● Requires LOTS OF DATA!!!● Resource intensive● It is hard to optimize your models

Useful stuffs

● Peter Norvig: How to Write a Spelling Corrector http://norvig.com/spell-correct.html

● Deep Spelling https://medium.com/@majortal/deep-spelling-9ffef96a24f6 ● Keras seq2seg https://github.com/farizrahman4u/seq2seq ● Gensim word2vec

https://radimrehurek.com/gensim/models/word2vec.html

http://norvig.com/spell-correct.html

http://norvig.com/spell-correct.html

https://medium.com/@majortal/deep-spelling-9ffef96a24f6

https://github.com/farizrahman4u/seq2seq



That’s All [email protected]

@zoltanvarjuhttps://www.linkedin.com/in/zol

tanvarju/

Company Blog https://blog.precognox.com/

Company Website

http://precognox.com/

https://blog.precognox.com/

https://blog.precognox.com/



Real-World Deep Learning for NLPbiconsulting.hu/letoltes/2017budapestdata/varju_zoltan... · 2017. 6. 13. · Java fejlesztő, Java programozó, Java Developer, Java Programmer, Junior

Documents