Be on Science’s Pulse with NLP or “What are the serious men talking about?” TETIANA KODLIUK - SEPTEMBER 24th, 2016
Apr 16, 2017
Be on Science’s Pulse with NLP or “What are the serious men talking about?”
TETIANA KODLIUK - SEPTEMBER 24th, 2016
www.vitech.com.ua
Who are you? Why did we wake up so early?
Data Scientist
Lead of Data Scientists
Lecturer in Apache Spark
Lecturer in MathPh. D. in Math
Analyst
Active Wizards
www.vitech.com.ua
We are living in magnificent time!!! According to futuretimeline.net:
4000• Computer science is reaching its
ultimate potential
2050• Robots take 50% of our jobs
2100• Human intelligence is being
vastly amplified by AI
2150• Terraforming of Mars is
underway
2200• Traditional employment is
becoming obsolete
www.vitech.com.ua
Artificial Intelligence is everywhere. Isn't it?
www.digitalgov.gov
image recognition
knowledge managementprobabilistic reasoning
representation of human expression
robotics
speech to text
natural language processing
machine learning
www.vitech.com.ua
Why NLP?Main tasks in NLP:
● process text● order text● understand text● extract information● cluster text● generate text● summarize text ……...
http://milestoneseducation.com/
www.vitech.com.ua
What is NLP? Jumping NLP Curves (Stanford, 2014)
http://sentic.net/jumping-nlp-curves.pdf
www.vitech.com.ua
So our idea was born...
Extract the trendsfrom the scientific publications
https://www.ucl.ac.uk/human-evolution/
www.vitech.com.ua
Keywords extraction
Keyword (keyphrase) extraction is tasked with the automatic identification of terms (phrases)
that best describe the subject of a document
www.socialappshq.com
www.vitech.com.ua
Everyone needs it...
I want to know only KEY news.
I want to now KEY problems of customers.
I want to know KEY approaches in medicine.
I want to know KEY trends in Science.
www.vitech.com.ua
YOU need it, but you don’t know...
waittilyouhearthis.com
www.vitech.com.ua
Let’s talk about a rough way
www.123rf.com
www.vitech.com.ua
Main problems: native language
language la langue valoda لغة мова ভাষা ulimi ภาษา言語 语言 sprache unicode:)
wikipedia.org
www.vitech.com.ua
Main problems: polysemy
Водити за ніс Розвісити вуха Точити зуби
Клювати носом Робити з мухи слона
Прикусити язик
Чесати язики Пускати бісиківНе в своїй тарілці
facebook.com
www.vitech.com.ua
Corpus
wikipedia.org
www.vitech.com.ua
ArXiv Categories
Category Number of subcategories
1 Statistics 5
2 Quantitative Biology 10
3 Computer Science 36
4 Nonlinear Sciences 5
5 Mathematics 32
6 Physics 39
www.vitech.com.ua
ArXiv submissions per category
http://arxiv.org/stats
Year Year
Frac
tiona
l S
ubm
issi
ons
Sub
mis
sion
s
www.vitech.com.ua
Computer Science submissions
http://arxiv.org/stats
Year Year
Frac
tiona
l S
ubm
issi
ons
Sub
mis
sion
s
www.vitech.com.ua
How does the input data look like?
arXiv.org
www.vitech.com.ua
Atom structure looks like this<?xml version="1.0" encoding="UTF-8"?><feed xmlns="http://www.w3.org/2005/Atom"> <updated>2016-08-26T00:00:00-04:00</updated> <entry> <id>http://arxiv.org/abs/cond-mat/0102536v1</id> <updated>2001-02-28T20:12:09Z</updated> <published>2001-02-28T20:12:09Z</published> <title>Impact of Electron-Electron Cusp on Configuration Interaction Energies</title> <summary> The effect of the electron-electron cusp on the convergence of configurationinteraction (CI) wave functions is examined. By analogy with the pseudopotential approach for electron-ion interactions, an effective electron-electron interaction is developed which closely reproduces the scattering of the Coulomb interaction but is smooth and finite at zero electron-electron separation.</summary> <author><name>David Prendergast</name> <arxiv:affiliation xmlns:arxiv="http://arxiv.org/schemas/atom">Department of Physics</arxiv:affiliation> </author>
<author><name>M. Nolan</name> <arxiv:affiliation xmlns:arxiv="http://arxiv.org/schemas/atom">NMRC, University College, Cork, Ireland</arxiv:affiliation> </author> </entry>
arXiv.org
www.vitech.com.ua
Where you can find the Data?
Bulk Metadata Access
● ArXiv API
● OAI-PMH
● RSS
wikipedia.org
www.vitech.com.ua
You can use these queries
● ArXiv API
Url = “http://export.arxiv.org/api/query?id_list={}"● OAI
Url = “http://export.arxiv.org/oai2?verb=ListRecords&from={}&until={}&metadataPrefix=arXiv"
www.metalinjection.net
www.vitech.com.ua
It was pretty easy, but...
➔ The versions of the papers can change at any moment➔ Arxiv.org doesn’t allow to scrape
a lot of atoms at the same time➔ It can be the problem with an
internet➔ It can be the extra characters in
abstracts, author names, title etc.➔ The categories names can
change: "adap-org" = "nlin.AO“, "Q-alg" = "math.QA“➔ Ids of papers can have different format: 1606.04426, 160323
★ We update the versions of papers each month
★ We are waiting 20 second for scraping new atoms
★ We are retrying after 30 seconds each time.
★ We use encoding
★ We learn the changes and map them
Problems Solutions
www.vitech.com.ua
Actual methods for keyphrases extraction
Statistical methods
RAKE
TextRank, KeyRank
Supervised Machine Learning
Neural Networks
www.vitech.com.ua
Text for keyphrases
onmogul.com
How does Unsupervised ML work?
www.acfe.com
www.vitech.com.ua
Methods
WordsN-grams, phrasesPOS patternsNamed entities….
FrequenciesWeights: TF-IDF, BM25Rank….
ManualSupervisedDepended….
www.vitech.com.ua
Where
techchunks.com
Where am I?
I thought, she will speak about serious men...
Some strange tables and no word about men...
www.vitech.com.ua
Frequences
techchunks.com
Keyphrase Freq
in this paper, we propose 597
in this paper, we present 323
this paper, we present a 187
we consider the problem of 179
in this paper, we study 163
in this paper we present 149
in this paper we propose 143
the effectiveness of the proposed 134
to the best of our 128
Keyphrase Freq
позич менi сонце в моє 17
в моє вiконце є, є 13
я до тебе, я до 12
тебе нема... коли тебе нема... 12
маю то, я маю то, 12
до тебе, я до тебе, 12
я до тебе, я до 12
коли тебе нема... коли тебе 12
менi сонце в моє вiконце 11
www.vitech.com.ua
Frequences (deleted punctuations ans stopwords, removed duplicates)
techchunks.com
Keyphrase Freq Keyphrase Freq
simultaneous wireless information power transfer 24 позич менi сонце вiконце 10
alternating direction method multipliers admm 20 бебі онлайн твоє лібідо 7
cloud radio access network cran 19 вільний син полетить 7
orthogonal frequency division multiplexing ofdm 17 стіна впала 7
wireless information power transfer swipt 16 ховалися зими 18 хвилин листопад 7
deep convolutional neural networks cnn 14 небi сяє нова зоря 7
improve oncurrent gram modeling techniques 12 десь живе свiтло дискотек бiблiотек 7
polynomial depth quantum circuit effects 10 я не здамся без бою 7
maximum likelihood reestimation procedure belonging
10 тінь твого тіла 7
www.vitech.com.ua
TF-IDF
TF-IDF = TF*IDF
Term Frequency
TF(t) = Count of term t in a document
Inverse Document Frequency
IDF(t) = log[Count of term t in a document / Total count of terms in the documents]
www.vitech.com.ua
TF-IDF
techchunks.com
Keyphrase TF-IDF Keyphrase TF-IDF
complexity extension finite fields note 0.26 коли тебе нема 0.57
upper bounds symmetric bilinear complexity 0.25 так знай другом твоїм 0.54
note bounds asymptotical uniform discuss 0.25 обійми мене 0.50
bounds asymptotical uniform discuss validity 0.25 стіна впала між нами 0.46
fields note bounds asymptotical uniform 0.25 не йди 0.45
establish new upper bounds symmetric 0.25 позич менi сонце моє вiконце 0.44
large number works address rdf 0.24 більше не можу без тебе 0.43
reduced problem application propose heuristic 0.24 твого тіла тінь 0.40
encode data semantic web data 0.24 моя сьюзi мила 0.37
www.vitech.com.ua
Graph-based ranking method
TextRank
techchunks.com
www.vitech.com.ua
TextRank
techchunks.com
Key-sentencesIn this study, the historical roots of tribology are investigated using a newly developed scientometric method called Referenced Publication Years Spectroscopy.
It was found that for one group of contacts the oscillation amplitude nonmonotonously increases with the bias voltage increase.
Cortical Learning Algorithms based on the Hierarchical Temporal Memory, HTM have been developed by Numenta Incorporation from which variations and modifications are currently being investigated
While our Local Degree method is best for preserving connectivity and short distances, other newly introduced local variants are best for preserving the community structure
We propose a combined caching scheme where part of the available cache space is reserved for caching the most popular content in every SBS
Key-sentencesДоля таки зовсім не зла
Бо навіть мить для мене свято
Ми будуєм і ламаєм знов
Нам казали не чекайте, всі давно вже тут.
Там, де кожен день, там де кожен, як останній деньМи будуем і ламаєм знов, як не як свою любов.
І знову жива вода змиває старі сліди,І я думаю так буде завжди.
Так ся дивлю за тобою,Що й не мушу казати слів.
Не показуйте жаль, не впадайте в екстаз,Ми сьогодні залишимо більше для нас.
Ти там віддала більше ніж могла мені.
www.vitech.com.ua
Rapid Automatic Keyword Extraction (RAKE)
Author: Stuart Rose (2010)
● Unsupervised method for extracting keywords● Incorporate cooccurrence and frequency of words
www.vitech.com.ua
Candidate keywords selecting
RAKE partitions the text by using
stop words phrase delimiters
www.vitech.com.ua
Candidate keywords selectingCompatibility of systems of linear constraints over the set of natural numbers.Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types of systems and systems of mixed types.
Compatibility – systems – linear constraints – set – natural numbers – Criteria – compatibility – system – linear Diophantine equations – strict inequations – nonstrict inequations – Upper bounds – components – minimal set – solutions – algorithms – minimal generating sets – solutions – systems – criteria – corresponding algorithms – constructing – minimal supporting set – solving – systems – systems
www.vitech.com.ua
Candidates scoring
Score of phrase = SUM(score(word))
Metrics for calculating word scores:
1. word frequency: freq(w), 2. word degree: deg(w),3. ratio of degree to frequency: deg(w)/freq(w)
www.vitech.com.ua
Additional option: adjoining keywords
1. Find the same sequences2. Join keyphrases3. Score
www.vitech.com.ua
RAKE
techchunks.com
Keyphrase RAKE
spectral amplitude coding optical code division multiple access networks intelligent pinning based cooperative secondary control
51.64
service function localization enabling fine grained rdf data completeness assessment balanced ranking mechanisms convolutional neural networks
51.33
strongly magnetized neutron stars powering superluminous supernovae remarkable magnetostructural coupling
49.85
randomized version room temperature tetragonal noncollinear antiferromagnet ptmnga optimal system maneuver
49.62
Keyphrase RAKE
мiнiмум волi за мiнiмум долi 10.56
вона хотiла знiматись в кiно 10.04
вiн менi свої пiснi спiвав 8.83
я-а-а почую голос твiй 7.72
мало-мало-мало менi 7.72
вiсiм днiв вiн її шукав 7.07
усi знайомi ледь знайомi 7.07
ключ не пiдiйшов а може й не посмiв 7.07
я слухав звук дощу i Бiллi Холiдей 7.07
якби тодi сказала ти менi стати тiнню в ночi
7.07
але ж вiзьми собi з полицi молодiсть 7.07
www.vitech.com.ua
What is the best method?
Method Advantage Disadvantage
TF-IDF Important keyphrases extraction, n-grams possible Candidates extraction
TextRank Cooccurrences calculating Importance missing
RAKE Frequences, Coocurences calculating, candidates extraction Long phrases
Supervised ML High score of extraction Train dataset is needed
www.vitech.com.ua
RAKE
techchunks.com
Keyphrase Weight Keyphrase Weight
massive multiple input output 7,25 все буде добре 1
long short term memory architecture 5,12 коли тебе нема 1
Live action virtual reality games 3,15 небо над дніпром 1
low rank hankel matrix completion 3,04 хочу напитись тобою 0,78
multi point wireless energy transmission 3,01 жити без мети 0,78
tree augmented naive bayes classifier 2,89 мила моя сьюзі 0,78
long short term memorized fusion 2,15 тінь твого тіла 0,75
fine grained entity type classification 1,51 коли настане день 0,75
high speed railway communication systems 1,27 кожну хвилину життя 0,75
partially observable markov decision process 1,13 коли тобі важко 0,75
www.vitech.com.ua
How to join different methods?
Input text RAKE TF-IDF
Keyphrase 1Keyphrase 2Keyphrase 3Keyphrase 4Keyphrase 5
RAKE weight TF-IDF score
www.vitech.com.ua
Text for keyphrases selectingM
onth
\ Y
ear
Cat
egor
yTitle 1Abstract 1-------Title 2Abstract 2-------Title nAbstract n-------
Keyphrase 1Keyphrase 2----------------Keyphrase i----------------Keyphrase 100----------------
Dup
licat
es
rem
ovin
gS
imila
r ke
yphr
ases
www.vitech.com.ua
Text preprocessing
● Formulas deleting● Encoding● Some punctuations removing● Extra spaces removing
www.vitech.com.ua
Duplicates removing
Bag of Words
Keyphrase 1Keyphrase 2----------------Keyphrase n----------------Keyphrase 100
Occurence in Keyphrase1
Occurence in Keyphrase2
word1 1 1
word2 0 0
wordn 1 0
www.vitech.com.ua
Duplicates removing
cosine_similarity = cosine_sim(keyphrase1, keyphrase2)
filter(similarity >= 0.8)
www.vitech.com.ua
Trends Recommendations
Word2Vec
stats.stackexchange.com
www.vitech.com.ua
Recommendations: Word2Vec
vene.ro
www.vitech.com.ua
Recommendations: Word2Vec
Wikipedia+Gigaword 5 Number of dimensions : 300Windows size: 10
WikipediaNumber of dimensions : 1000Windows size: 10
ArXiv (abstracts)Number of dimensions : 300Windows size: 10
www.vitech.com.ua
Additional rules for similarity
1. The year was selected as period of similar statements searching
2. The cosine distance between sets of words is calculated.
3. The lowest cosine similarity between statements should be equal 0.70
www.vitech.com.ua
Text for keyphrases
onmogul.com
Science Pulse analytics
www.vitech.com.ua
Keyphrase-Atom relationships
Mean number of Atoms per Keyphrase = 1.6178478064Max number of Atoms per Keyphrase = 690Min number of Atoms per Keyphrase = 1
onmogul.com
www.vitech.com.ua
Keyphrase-Atom relationships
KeyphraseId Value AtomNumber CategoryId
2106584 low energy effective field theory 690 6
2250344 black hole mass bulge 641 6
2190498 star formation rate stellar mass 460 6
2099199 proposed algorithm outperforms state art 362 3
2106533 ultra high energy cosmic ray 336 6
2118614 high energy proton collisions 311 6
1534841 coupled dark matter energy 290 6
1534818 low mass standard model higgs 289 6
2118593 low energy heavy ion collisions 276 6
2153778 high energy gamma ray sources 270 6
www.vitech.com.ua
Atom-Keyphrases relationships
Mean number of Keyphrases per Atom = 1.69036455056Max number of Keyphrases per Atom = 36Min number of Keyphrases per Atom = 0
onmogul.com
www.vitech.com.ua
Atom-Keyphrases relationships
AtomId AtomTitle KeyphraseNumber
1145661 Active galactic nuclei at gamma-ray energies 36
774479Ultrahigh-Energy Cosmic Rays from the "En Caul" Birth of Magnetars
35
784859 The IBM 2016 Speaker Recognition System 32
777367Hybrid Digital and Analog Beamforming Design for Large-Scale Antenna Arrays
32
778646Chandra X-ray and Hubble Space Telescope Imaging of Optically Selected Kiloparsec-Scale Binary Active Galactic Nuclei II: Host Galaxy Morphology and AGN Activity
31
www.vitech.com.ua
Keyphrases Weights
Id Value WeightCalculation
IdCategory
Id
2250394 maximally supersymmetric yang mills theory 1.48377 829 6
2250395 silver coated cds quantum dots 1.48377 829 6
2250396 gauss bonnet ads black hole 1.44081 829 6
2250397exchange coupled ferri ferromagnetic composite
1.43178 829 6
2250398 hollow core photonic crystal fiber 1.43178 829 6
2250399 neutrinoless double beta minus decays 1.43178 829 6
2250404 cmos monolithic active pixel sensors 1.43178 829 6
2250405 high dimensional quantum key distribution 1.40149 829 6
2250406 body reduced density matrix functional 1.39866 829 6
2250407 hybrid halide perovskite solar cells 1.0000 829 6
2250408 spatially weighted generalized robertson walker 1.0000 829 6
www.vitech.com.ua
Processing timeThe duration of the general process is about 16 hours
Category: PhysicsYear: 2016Number of atoms: 34000Time of keyphrases calculating: 2 minutesTime of Atom searching: 4 minutesTime of results saving: 2 minutes AMD FX(tm)-6100 Six-Core Processor
RAM16GB
www.vitech.com.ua
● Harvester responsible for keyphrases extraction
● Visualization responsible for application
● MySQL is used as a storage
Components
www.vitech.com.ua
queries
search
REST API
Front end
Google Mail
ArxivAPl
cache
Visualization architecture
www.vitech.com.ua
The best V.I.Tech team
The best V.I.Tech team
www.talentedladiesclub.com
www.vitech.com.ua
Summary
SUMMARY
www.brainyquote.com
www.vitech.com.ua
Plans for future● Add trends, which people search● Add trends extraction per subcategory● Add trends analysis of other sources● Add Author’s analysis, H-index calculating● Add Google Analytics● Scoring
blazepress.com
www.vitech.com.ua
Задати це питання
What keyphrases could you extract from our talk?
http://blog.cleveland.com/
www.vitech.com.ua
Keyphrases Weight
natural language processing keywords extraction 1.0
method rapid automatic keyword extraction 1.0
scientific organizations explore artificial intelligence 0.7
scientists e-print repository arXiv 0.7
extracting hot topics 0.67
economize scientists time 0.67
human product generally text data 0.67
Keyphrases from our talk by SciencePulse
www.vitech.com.ua
Useful links
http://sciencepulse.vitech.com.ua/
http://ijarcsse.com/docs/papers/Volume_6/5_May2016/V6I5-0392.pdfhttps://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdfhttps://hassetukda.wordpress.com/2012/09/24/ukda-keyword-indexing-with-a-skos-version-of-hasset-thesaurus/