Life-Cycles and Mutual Effects of Scientific Communities V´ aclav Bel´ ak, Marcel Karnstedt, Conor Hayes Digital Enterprise Research Institute NUI Galway ASNA 2010, Z¨ urich
May 11, 2015
Life-Cycles and Mutual Effects of ScientificCommunities
Vaclav Belak, Marcel Karnstedt, Conor Hayes
Digital Enterprise Research InstituteNUI Galway
ASNA 2010, Zurich
Introduction Methodology Data-Set Results Conclusion and FW
Motivation
Progress in science is often measured by citation measures,which are relatively static
Detection and explanation of evolution and life-cycles providesbetter arguments for the progress
Previous approaches focused mainly on analysing co-citationgraphs or textual clustering
Little work on analysis of cross-community effects
Kuhn [5] claimed the development of scientific knowledgeproceeds in discrete steps:
Pre-paradigm periodParadigm period—normal scienceCrisisReaction to the crisis—paradigm shift
1 / 34
Introduction Methodology Data-Set Results Conclusion and FW
Cross-Community Effects I
Clique: Graph & Network Analysis Cluster
Expected Phenomena
Paradigm shift Paradigm merge (a) Community shift
Clique: Graph & Network Analysis Cluster
Expected Phenomena
Paradigm shift Paradigm merge (b) Community merge(with community shift)
2 / 34
Introduction Methodology Data-Set Results Conclusion and FW
Cross-Community Effects II
Although inspired by Kuhn, we expected evolution ofcommunities in rather an alleviated form
Instead of paradigm shift, we were looking for communityshift
Community merge is a complementary phenomenon, butrather uninteresting oneThus, rather combinations of shifts with subsequent merges,i.e. community merge/shifts, were investigated
Instead of paradigm articulation, we were looking forcommunity specialization
Co-citation networks of two big camps in CS were analysed:Semantic Web (solution-driven) and Information Retrieval(problem-driven) [1]
3 / 34
Introduction Methodology Data-Set Results Conclusion and FW
Outline
1 Methodology
2 Data-Sets
3 Results
4 Conclusion and Future Work
4 / 34
Introduction Methodology Data-Set Results Conclusion and FW
Initial Expectations&Requirements
The methodology was developed with a set of certain requirementsarising from the nature of the problem:
1 Dynamic data-set represented by snapshots of severalconsecutive time-steps
2 Communities have to be identified in the network in eachtime-step
3 Authors (nodes in general) have to be uniquely identifiedamong all time-steps
4 For topical analysis, meta-data (topics) describing the nodesare necessary
5 / 34
Introduction Methodology Data-Set Results Conclusion and FW
Community Detection
We identified communities using three popular algorithms:
Infomap [7]Louvain [2]WT [8]
All have publicly available implementations, are able tooperate over weighted networks, and produce non-overlappingcommunities
In each time-step t, we identified clustering C t of ncommunities: C t = {ct
1, ct2, ..., c
tn}, where n is determined
automatically for each time-step
6 / 34
Introduction Methodology Data-Set Results Conclusion and FW
Tracking of Dynamic Communities
Communities are identified independently for each time-step.It is thus necessary to track the evolution of each communityin further time-steps
Communities were matched according to the highest Jaccardcoefficient:
match(cti ) = arg max
ct+1j ∈C t+1
|cti ∩ ct+1
j ||cti ∪ ct+1
j |
Important ancestors and descendants were identified bymodified Jaccard coefficient:
ancestor(cti , c
t+1j ) =
|cti ∩ ct+1
j ||ct+1j |
, descendant(cti , c
t+1j ) =
|cti ∩ ct+1
j ||cti |
7 / 34
Introduction Methodology Data-Set Results Conclusion and FW
Visualization
To compare and inspect the state of the network in differenttime-steps, a proper visualization is very helpful
Nodes that appeared previously should have similar positionsColours denoting the affiliation of the node to its clustershould be preserved
As we have not found any existing tool implementing theserequirements, we built our own one based on JUNG
Another tool based on Graphviz was build to automaticallycreate diagrams of ancestors and descendants based onrespective relations
8 / 34
Introduction Methodology Data-Set Results Conclusion and FW
Topic Detection I
We mined keywords using NLP techniques [3] from theabstracts or full-texts for almost 70% of the underlying articles
Tokenised and stemmed [6] keywords were then assigned toeach author
Ability of keywords to discriminate authors was rankedaccording to their frequency (TF) and uniqueness in thecorpus (IAF): TF-IAF
Each author a in time-step t was thus described by abag-of-words vector kt
a
Topical description of cluster c was obtained by a centroid ofits members
Cosine similarity was used for determining topical similarity oftwo clusters
9 / 34
Introduction Methodology Data-Set Results Conclusion and FW
Topic Detection II
Interpretation of a cluster’s topic was based on characterizingkeywords—a union of:
20 highest ranked keywords20 most frequent keywords
We were particularly interested in cross-community activitybetween IR and SW camps
Definition what is IR- and what SW-related community wasbased on frequent patterns mined from the publications
Any event detected by community topic evolution measuresassociated with both IR- and SW-related communities wasthen considered as an inter-camp dynamics
Meta-data was used to assess the quality of clusterings—WTwas omitted from further analysis
10 / 34
Introduction Methodology Data-Set Results Conclusion and FW
Measures
Overlap measures induce huge number of inter-reactionsbetween communities
Solution is to apply more specific measures or to use thesimple ones in combination
We developed and/or used two categories of measures1 community life-cycle measures for measurement and
explanation the state and the evolution of the community2 community topic evolution measures for revealing of
cross-community phenomena like community shift
11 / 34
Introduction Methodology Data-Set Results Conclusion and FW
Community Life-Cycle Measures
Structural perspective:
size Saverage vertex betweenness B, RB ∈ R+
relative density ρ, Rρ ∈ [0, 1]
author entropy A, RA ∈ [0, 1]
Topical perspective:
topic drift T , RT ∈ [0, 1]
cluster content ratio H, RH ∈ R+
12 / 34
Introduction Methodology Data-Set Results Conclusion and FW
Community Topic Evolution Measures
We looked for parallel changes of structure and topic ofcommunities
Structural and topical measures were combined bymultiplication for simplicity and because the range remainswithin [0, 1]
Community shift PS may be detected as an emergence of anew community topically distinct from its ancestor:
PS(cti , c
t+1j ) = dissim(ct
i , ct+1j )× ancestor(ct
i , ct+1j )
13 / 34
Introduction Methodology Data-Set Results Conclusion and FW
Community Topic Evolution Measures II
Community shift/merge PS/M may be detected as a merge oftwo topically distinct community:
PS/M(cti , c
t+1j ) = dissim(ct
i , ct+1j )× descendant(ct
i , ct+1j )
Note that both PS and PS/M are defined only for twodifferent communities, i.e. only if i 6= j
Community topic change PC expresses a change of topic of astructurally stable community:
PC(cti ) = dissim(ct
i , ct+1i )× (1−A(ct+1
i ))
Only events with values > 0.5 and with a minimal overlap of10 authors were selected for deeper analysis
14 / 34
Introduction Methodology Data-Set Results Conclusion and FW
Data-Set
We first picked a set of major conferences in both fields
We then selected publications from these conferences fromDBLP for 2000–2009
Co-citation network of 5772 authors and 817642 edges overall years was extracted
3-year time-steps with 2-year overlap: 2000–2002,2001–2003, . . .
Total number of articles was 39314 for which we were able toscrape 22975 abstracts and 3740 full-texts
Nearly 70% coverage by content
We scraped 18313 author-provided keywords for 4102 distinctarticles
Coverage by these high-quality meta-data was 10%
We mined 263742 keywords from abstracts and full-texts
15 / 34
Introduction Methodology Data-Set Results Conclusion and FW
Shift of Louvain Community 26
Emergence of Louvain community 26 was identified as aninter-camp community shift PS
.= 0.62 in 2006
It was formed by 80% of community 6 “web IR” and by 20%of community 5 “SW”
The keywords in 2006 like “navigation”, “personalization”,and “semantic web” suggests transdisciplinary topics
Massive influence of community 15 “SW and IR” in 2007 anda change of topic towards “SW and business processes”
Observed as a low topic drift T .= 0.29
IR-related keywords appeared again among characterizingkeywords in 2008
Topic then stabilized: T .= 0.65
16 / 34
Introduction Methodology Data-Set Results Conclusion and FW
Evolution of Louvain Community 26
Communities 6 “web information retrieval”, 5 “semantic web”,15 “semantic web and information retrieval” and their descendantcommunity 26
90.6 8.3
c5
c15
c26c6
c5
c26 c26
c15 c15
c5
20
80 4.7
2.8 48.6
51.4
2005–2007 2006-2008 2007–2009 2008–2009
17 / 34
Introduction Methodology Data-Set Results Conclusion and FW
Position of Louvain Community 26 in 2006 and 2007
Communities 6 “web information retrieval” (pink), 5 “semanticweb” (red), 15 “semantic web and information retrieval” (violet)
and their descendant community 26 (green)
18 / 34
Introduction Methodology Data-Set Results Conclusion and FW
Specialization of Infomap Community 9
First oriented on general and core SW-related topics in 2000
Between 2002–2004 we identified 3 shifts
One of these shifts was community 99 “semantic desktop andpersonalization”
The community itself then specialized on “SW services”
S,T , and H provided valuable insights
ρ, B, and A did not seem to provide any further insights
19 / 34
Introduction Methodology Data-Set Results Conclusion and FW
Life-Cycle Measures of Infomap Community 9
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2000 2001 2002 2003 2004 2005 2006 2007 2008
0
500
1000
1500
2000
2500
3000
3500
4000
4500
H,T
,A
,ρ
B,S
time
ρHBSAT
20 / 34
Introduction Methodology Data-Set Results Conclusion and FW
Life-Cycle Measures of Infomap Community 99
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
2003 2004 2005 2006 2007 2008
0
100
200
300
400
500
600
700
800
900
1000
H,T
,A
,ρ
B,S
time
ρHBSAT
21 / 34
Introduction Methodology Data-Set Results Conclusion and FW
Shift/Merge of Community 86
We identified shift/merge PS/M.
= 0.91 of community 86with community 0
Both communities were concerned with IR-related topics, buteach had its specific theme:
86 being more focused on “development”, “engine”, and“system”0 being more focused on “question answering”
90.9% of authors from 86 moved to community 0
Relative density ρ.
= 0.47 and high cluster content ratioH .
= 1.91 suggests it was topically coherent, but structurallyweak
It is not possible to generalize the suitability of any life-cyclemeasures as we have identified only one shift/merge
22 / 34
Introduction Methodology Data-Set Results Conclusion and FW
Tag Clouds of Communities 86 and 0
community characterising keywordsc2002
86 intuitive, development, ir, retrieval, control, imple-
mented, describing, high-dimensional, reducing, engine, execu-
tion, advanced, information, system, multi-
dimensional, image, usin, accurate, time, precise, features,queries, service, dataset, document, analysis, large, structure,cluster, and, web, processing
c20030 resolution, evaluation, passages, architecture, question, qa,
patterns, definitions, development, trec, mit, candidates, linguis-
tic, retrieval, answering, system, analysis, javelin,modules, advanced, methods, science, information, approaches, pro-
cessing, using, computer, language, techniques
23 / 34
Introduction Methodology Data-Set Results Conclusion and FW
Change of topic of Infomap community 54
Inter-camp community topic change PC.
= 0.58 was identifiedfor Infomap community 54 between 2005 and 2006
The topic changed from “knowledge management” and“information extraction” towards “knowledge querying” and“semantic web”
Zero author entropy A suggests this might have been causedby new members joining the community
34.5% were completely new, i.e. they did not come from anyprevious community20.7% coming from 54 “knowledge management andinformation extraction”17.2% coming from 29 “ontologies and SW”6.9% coming from 70 “ontologies and folksonomies”6.9% coming from 112 “semantic web services”
24 / 34
Introduction Methodology Data-Set Results Conclusion and FW
Tag Clouds of Infomap community 54
community characterising keywords
c200554 organizational, kms, sw, capturing, environment, working, ie,
acquisition, wikifactory, legacy, manager, goal, seman-
tic, tool, cooperative, layers, healthier, defining, quantitative,
knowledge, web, text, learning, techniques, computer, sup-
porting, science, machine, documents, information, system
c200654 ontologies, language, query, specification, knowl-
edge, manager, semantic, pure, capturing, data,
search, keyword, layers, keyword-based, hybrid, archi-tecture, spreadsheet, web, ie, application, informa-tion, modelling, approach, algorithm, using, methodic, retrieval,service, system, structures
25 / 34
Introduction Methodology Data-Set Results Conclusion and FW
Emergence of Intermediary Louvain Community 15
The most complex scenario we investigated
It first emerged as a descendant of community 4 “IR” withtopic “cross-language IR”, which was identified as acommunity shift PS
.= 0.55 in 2003
Since 2004, this community was under a massive influence ofcommunity 5 “SW”, which caused a change towardsSW-related topics PC
.= 0.31
Since 2005, IR-related keywords appeared again amongcharacterizing keywords, while those keywords disappeared incommunity 5
Therefore, whereas community 5 kept its focus on the coreSW-related topics, it largely participated in forming of a newinterdisciplinary community
26 / 34
Introduction Methodology Data-Set Results Conclusion and FW
Betweenness of Louvain Community 15
Despite of being still focused on mainly SW-related topics,community 15 worked as an intermediary of both camps
This hypothesis is supported by high average authorbetweenness B
2004–2006 2007–2009S 〈〈B〉〉 S 〈〈B〉〉
c15 444 1591.01659 445 2535.02
entire network 2776 2066.70764 2190 2192.85117
27 / 34
Introduction Methodology Data-Set Results Conclusion and FW
Position of Louvain Community 15 in 2004 and 2007
Community 5 “SW” (red—left side), “IR” communities 0, 4, 6 and9 (grey, beige, pink and red—right side, respectively) and theirintermediary community 15 (violet)
28 / 34
Introduction Methodology Data-Set Results Conclusion and FW
Conclusion and Future Work I
We presented a general and scalable methodology for analysisof cross-community phenomena uniquely combiningtopological and content analysis and supported by specialvisualization techniques
Three community topic evolution measures tailored foridentifying phenomena like community shift, shift/merge, andchange of topic were proposed and successfully assessed
Community shift and topic change were detected quitecommonly, which suggests that they are part of manycommunity life-cyclesCommunity shift/merge was detected very rarely, which eithermeans we have to improve the measure or that this is simply arare phenomenon
We proposed life-cycle measures characterising the states andevolution of communities
29 / 34
Introduction Methodology Data-Set Results Conclusion and FW
Conclusion and Future Work II
The assessment showed that average vertex betweenness,relative density, cluster content ratio, and topic drift offeredvaluable insights into the phenomena revealed by communitytopic evolution measures
We observed strong shifts PS → 1, when the shiftedcommunity disappeared in the next time-step
These strong shifts had usually very different but coherenttopicsThey might have been the initial sources of new topics or evenresearch streams
Frequently, a newly emerged community had quite weakstructure (low ρ, high A) and/or topic (low T ), while thesecharacteristics then improved in the subsequent time-steps
B seems to be a good measure for identification ofintermediary communities
30 / 34
Introduction Methodology Data-Set Results Conclusion and FW
Conclusion and Future Work III
We intend to cluster the community life-cycles by thecharacteristic events expressed by all the measures
We expect this to provide an automated way of extractinglife-cycle taxonomies
The combination of content and structural analysis allowed usto assess the quality of clustering revealed only by inspectionof structure of the network
We consider this original approach as a fertile ground forfuture research
We plan to use other algorithms—e.g. co-clustering algorithmof both content and objects [4]
We will extend the whole work to a larger data-set
31 / 34
Introduction Methodology Data-Set Results Conclusion and FW
References I
R. Baeza-Yates, P. Mika, and H. Zaragoza.Search, Web 2.0, and the Semantic Web.IEEE Intelligent Systems, 23(1):80–82, 2008.
Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte,and Etienne Lefebvre.Fast unfolding of communities in large networks.Journal of Statistical Mechanics: Theory and Experiment,P10008, 2008.
Georgeta Bordea.The Semantic Web: Research and Applications, chapterConcept Extraction Applied to the Task of Expert Finding ,pages 451–456.Springer, 2010.
32 / 34
Introduction Methodology Data-Set Results Conclusion and FW
References II
Derek Greene and Padraig Cunningham.Spectral Co-Clustering for Dynamic Bipartite Graphs.Technical report, School of Computer Science & Informatics,UCD, 2010.
Th. S. Kuhn.The Structure of Scientific Revolutions.University Of Chicago Press, December 1996.
Martin F. Porter.An algorithm for suffix stripping.Program, 14:130–137, 1980.
33 / 34
Introduction Methodology Data-Set Results Conclusion and FW
References III
Martin Rosvall and Carl T. Bergstrom.Maps of random walks on complex networks reveal communitystructure.In National Academy of Sciences USA, volume 105, pages1118–1123, 2008.
Ken Wakita and Toshiyuki Tsurumi.Finding community structure in a mega-scale social networkingservice.In IADIS international conference on WWW/Internet 2007,pages 153–162, 2007.
34 / 34