Real-World Deep Learning for NLP Zoltan Varju, Precognox 06.13.2017. Budapest Data Forum
Real-World Deep Learning for NLP
Zoltan Varju, Precognox06.13.2017.
Budapest Data Forum
About the title
● Real-World: it’s about an enterprise search problem, not about dogs and cats or interlingua
● Deep Learning: word2vec, seq2seq (no fancy new architectures)
● NLP: correcting typos, finding synonyms
The Problem
Semantic Search
● Job titles:○ We have to find all possible title for a given query○ Java fejlesztő, Java programozó, Java Developer, Java Programmer, Junior Java Ninja,
Rockstar Java Programmer
● Typos:○ Queries are full of typos, esp. on mobile○ Programmer: porgrammer, programer, programme
● Broader terms○ If you can’t find anything for a given term, try a query with its broader term○ Prolog Developer -> Developer, Senior Project Manager -> Project Manager
Search Backend @ Profession
● Job thesaurus:○ Used for query expansion○ Typos, synonyms, broader and narrower terms○ Flat categories (e.g. Bank and Finance, Health, Agriculture)
● Solr index● InfoHarvester (Hadoop-based crawler)
Thesaurus Building
● Use the query log to find similar queries○ Use string similarity (e.g. Levensthein distance) to generate list of candidate pairs○ Filter results
● Annotators use Thesaurus Manager○ to determine the type of the relation between the candidates (synonym, broader, narrower, or
typo)○ eliminate false positives (about ⅔ of the generated data is false positive!)○ to extend the thesaurus with synonyms based on common sense
Pros & Cons
Pros
● It works (we’ve been doing this for years)● We are comfortable with such projects● Requires minimal maintenance (software
and data)● The client can use the Thesaurus Manager
to manage its data
Cons
● Building the first, basic thesaurus is extremely hard and slow
● Annotators are having hard time for weeks, they get frustrated, we need lots of QA
● Hard work means high costs
There is more than one way to do it
Why?
● Our annotators get TOO MUCH data (because of the false positives), so● they have to make LOTS OF DECISIONS.● The ‘Typo’ relation is the hardest to check during thesaurus building, we
want to do it automatically.● We don’t want to fire them, reducing the time and costs of search projects
means we can win more projects and provide them with a meaningful paid job
● We are doing spell checking, but we are interested in finding terms in the index
Finding semantic relations with word2vec
● Mikolov et al. 2010● Preprocessing: extracting
named entities (e.g. ), idioms, etc.
● Training with Gensim● Simple similarity measure
covers 96.7% of the relations in the original thesaurus
● 15% is the ratio of false positives
Spell checking with seq2seq
● Generate synthetic typos○ Brute force: generate ALL the strings within three edit distance, and filtering out real worlds
from the results○ Sophisticated: generate typos based on misspelling stats○ Mixed: generate n random typos within three edit distance + use stats
● Train a seq2seq network on your data and evaluate it:○ Brute force 63%○ Sophisticated 81.3%○ Mixed 97.66%
Pros and Cons
Pros
● The annotators get more structured data● We can reduce annotation costs by a third● We don’t have typos in our thesaurus!● We can start the project very quickly● It is not domain dependent● It is language independent● It can deal with “‘Hunglish“ terms like
“Szenior project manager”
Cons
● Requires LOTS OF DATA!!!● Resource intensive● It is hard to optimize your models
Useful stuffs
● Peter Norvig: How to Write a Spelling Corrector http://norvig.com/spell-correct.html
● Deep Spelling https://medium.com/@majortal/deep-spelling-9ffef96a24f6 ● Keras seq2seg https://github.com/farizrahman4u/seq2seq ● Gensim word2vec
https://radimrehurek.com/gensim/models/word2vec.html
That’s All [email protected]
@zoltanvarjuhttps://www.linkedin.com/in/zol
tanvarju/
Company Blog https://blog.precognox.com/
Company Website
http://precognox.com/