Life-Cycles and Mutual Effects of Scientific Communities: ASNA 2010

Life-Cycles and Mutual Effects of ScientificCommunities

Vaclav Belak, Marcel Karnstedt, Conor Hayes

Digital Enterprise Research InstituteNUI Galway

ASNA 2010, Zurich

Introduction Methodology Data-Set Results Conclusion and FW

Motivation

Progress in science is often measured by citation measures,which are relatively static

Detection and explanation of evolution and life-cycles providesbetter arguments for the progress

Previous approaches focused mainly on analysing co-citationgraphs or textual clustering

Little work on analysis of cross-community effects

Kuhn [5] claimed the development of scientific knowledgeproceeds in discrete steps:

Pre-paradigm periodParadigm period—normal scienceCrisisReaction to the crisis—paradigm shift

1 / 34


Cross-Community Effects I

Clique: Graph & Network Analysis Cluster

Expected Phenomena

Paradigm shift Paradigm merge (a) Community shift

Clique: Graph & Network Analysis Cluster

Expected Phenomena

Paradigm shift Paradigm merge (b) Community merge(with community shift)

2 / 34


Cross-Community Effects II

Although inspired by Kuhn, we expected evolution ofcommunities in rather an alleviated form

Instead of paradigm shift, we were looking for communityshift

Community merge is a complementary phenomenon, butrather uninteresting oneThus, rather combinations of shifts with subsequent merges,i.e. community merge/shifts, were investigated

Instead of paradigm articulation, we were looking forcommunity specialization

Co-citation networks of two big camps in CS were analysed:Semantic Web (solution-driven) and Information Retrieval(problem-driven) [1]

3 / 34


Outline

1 Methodology

2 Data-Sets

3 Results

4 Conclusion and Future Work

4 / 34


Initial Expectations&Requirements

The methodology was developed with a set of certain requirementsarising from the nature of the problem:

1 Dynamic data-set represented by snapshots of severalconsecutive time-steps

2 Communities have to be identified in the network in eachtime-step

3 Authors (nodes in general) have to be uniquely identifiedamong all time-steps

4 For topical analysis, meta-data (topics) describing the nodesare necessary

5 / 34


Community Detection

We identified communities using three popular algorithms:

Infomap [7]Louvain [2]WT [8]

All have publicly available implementations, are able tooperate over weighted networks, and produce non-overlappingcommunities

In each time-step t, we identified clustering C t of ncommunities: C t = {ct

1, ct2, ..., c

tn}, where n is determined

automatically for each time-step

6 / 34


Tracking of Dynamic Communities

Communities are identified independently for each time-step.It is thus necessary to track the evolution of each communityin further time-steps

Communities were matched according to the highest Jaccardcoefficient:

match(cti ) = arg max

ct+1j ∈C t+1

|cti ∩ ct+1

j ||cti ∪ ct+1

j |

Important ancestors and descendants were identified bymodified Jaccard coefficient:

ancestor(cti , c

t+1j ) =

|cti ∩ ct+1

j ||ct+1j |

, descendant(cti , c

t+1j ) =

|cti ∩ ct+1

j ||cti |

7 / 34


Visualization

To compare and inspect the state of the network in differenttime-steps, a proper visualization is very helpful

Nodes that appeared previously should have similar positionsColours denoting the affiliation of the node to its clustershould be preserved

As we have not found any existing tool implementing theserequirements, we built our own one based on JUNG

Another tool based on Graphviz was build to automaticallycreate diagrams of ancestors and descendants based onrespective relations

8 / 34


Topic Detection I

We mined keywords using NLP techniques [3] from theabstracts or full-texts for almost 70% of the underlying articles

Tokenised and stemmed [6] keywords were then assigned toeach author

Ability of keywords to discriminate authors was rankedaccording to their frequency (TF) and uniqueness in thecorpus (IAF): TF-IAF

Each author a in time-step t was thus described by abag-of-words vector kt

a

Topical description of cluster c was obtained by a centroid ofits members

Cosine similarity was used for determining topical similarity oftwo clusters

9 / 34


Topic Detection II

Interpretation of a cluster’s topic was based on characterizingkeywords—a union of:

20 highest ranked keywords20 most frequent keywords

We were particularly interested in cross-community activitybetween IR and SW camps

Definition what is IR- and what SW-related community wasbased on frequent patterns mined from the publications

Any event detected by community topic evolution measuresassociated with both IR- and SW-related communities wasthen considered as an inter-camp dynamics

Meta-data was used to assess the quality of clusterings—WTwas omitted from further analysis

10 / 34


Measures

Overlap measures induce huge number of inter-reactionsbetween communities

Solution is to apply more specific measures or to use thesimple ones in combination

We developed and/or used two categories of measures1 community life-cycle measures for measurement and

explanation the state and the evolution of the community2 community topic evolution measures for revealing of

cross-community phenomena like community shift

11 / 34


Community Life-Cycle Measures

Structural perspective:

size Saverage vertex betweenness B, RB ∈ R+

relative density ρ, Rρ ∈ [0, 1]

author entropy A, RA ∈ [0, 1]

Topical perspective:

topic drift T , RT ∈ [0, 1]

cluster content ratio H, RH ∈ R+

12 / 34


Community Topic Evolution Measures

We looked for parallel changes of structure and topic ofcommunities

Structural and topical measures were combined bymultiplication for simplicity and because the range remainswithin [0, 1]

Community shift PS may be detected as an emergence of anew community topically distinct from its ancestor:

PS(cti , c

t+1j ) = dissim(ct

i , ct+1j )× ancestor(ct

i , ct+1j )

13 / 34


Community Topic Evolution Measures II

Community shift/merge PS/M may be detected as a merge oftwo topically distinct community:

PS/M(cti , c

t+1j ) = dissim(ct

i , ct+1j )× descendant(ct

i , ct+1j )

Note that both PS and PS/M are defined only for twodifferent communities, i.e. only if i 6= j

Community topic change PC expresses a change of topic of astructurally stable community:

PC(cti ) = dissim(ct

i , ct+1i )× (1−A(ct+1

i ))

Only events with values > 0.5 and with a minimal overlap of10 authors were selected for deeper analysis

14 / 34


Data-Set

We first picked a set of major conferences in both fields

We then selected publications from these conferences fromDBLP for 2000–2009

Co-citation network of 5772 authors and 817642 edges overall years was extracted

3-year time-steps with 2-year overlap: 2000–2002,2001–2003, . . .

Total number of articles was 39314 for which we were able toscrape 22975 abstracts and 3740 full-texts

Nearly 70% coverage by content

We scraped 18313 author-provided keywords for 4102 distinctarticles

Coverage by these high-quality meta-data was 10%

We mined 263742 keywords from abstracts and full-texts

15 / 34


Shift of Louvain Community 26

Emergence of Louvain community 26 was identified as aninter-camp community shift PS

.= 0.62 in 2006

It was formed by 80% of community 6 “web IR” and by 20%of community 5 “SW”

The keywords in 2006 like “navigation”, “personalization”,and “semantic web” suggests transdisciplinary topics

Massive influence of community 15 “SW and IR” in 2007 anda change of topic towards “SW and business processes”

Observed as a low topic drift T .= 0.29

IR-related keywords appeared again among characterizingkeywords in 2008

Topic then stabilized: T .= 0.65

16 / 34


Evolution of Louvain Community 26

Communities 6 “web information retrieval”, 5 “semantic web”,15 “semantic web and information retrieval” and their descendantcommunity 26

90.6 8.3

c5

c15

c26c6

c5

c26 c26

c15 c15

c5

20

80 4.7

2.8 48.6

51.4

2005–2007 2006-2008 2007–2009 2008–2009

17 / 34


Position of Louvain Community 26 in 2006 and 2007

Communities 6 “web information retrieval” (pink), 5 “semanticweb” (red), 15 “semantic web and information retrieval” (violet)

and their descendant community 26 (green)

18 / 34


Specialization of Infomap Community 9

First oriented on general and core SW-related topics in 2000

Between 2002–2004 we identified 3 shifts

One of these shifts was community 99 “semantic desktop andpersonalization”

The community itself then specialized on “SW services”

S,T , and H provided valuable insights

ρ, B, and A did not seem to provide any further insights

19 / 34


Life-Cycle Measures of Infomap Community 9

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2000 2001 2002 2003 2004 2005 2006 2007 2008

0

500

1000

1500

2000

2500

3000

3500

4000

4500

H,T

,A

,ρ

B,S

time

ρHBSAT

20 / 34


Life-Cycle Measures of Infomap Community 99

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

2003 2004 2005 2006 2007 2008

0

100

200

300

400

500

600

700

800

900

1000

H,T

,A

,ρ

B,S

time

ρHBSAT

21 / 34


Shift/Merge of Community 86

We identified shift/merge PS/M.

= 0.91 of community 86with community 0

Both communities were concerned with IR-related topics, buteach had its specific theme:

86 being more focused on “development”, “engine”, and“system”0 being more focused on “question answering”

90.9% of authors from 86 moved to community 0

Relative density ρ.

= 0.47 and high cluster content ratioH .

= 1.91 suggests it was topically coherent, but structurallyweak

It is not possible to generalize the suitability of any life-cyclemeasures as we have identified only one shift/merge

22 / 34


Tag Clouds of Communities 86 and 0

community characterising keywordsc2002

86 intuitive, development, ir, retrieval, control, imple-

mented, describing, high-dimensional, reducing, engine, execu-

tion, advanced, information, system, multi-

dimensional, image, usin, accurate, time, precise, features,queries, service, dataset, document, analysis, large, structure,cluster, and, web, processing

c20030 resolution, evaluation, passages, architecture, question, qa,

patterns, definitions, development, trec, mit, candidates, linguis-

tic, retrieval, answering, system, analysis, javelin,modules, advanced, methods, science, information, approaches, pro-

cessing, using, computer, language, techniques

23 / 34


Change of topic of Infomap community 54

Inter-camp community topic change PC.

= 0.58 was identifiedfor Infomap community 54 between 2005 and 2006

The topic changed from “knowledge management” and“information extraction” towards “knowledge querying” and“semantic web”

Zero author entropy A suggests this might have been causedby new members joining the community

34.5% were completely new, i.e. they did not come from anyprevious community20.7% coming from 54 “knowledge management andinformation extraction”17.2% coming from 29 “ontologies and SW”6.9% coming from 70 “ontologies and folksonomies”6.9% coming from 112 “semantic web services”

24 / 34


Tag Clouds of Infomap community 54

community characterising keywords

c200554 organizational, kms, sw, capturing, environment, working, ie,

acquisition, wikifactory, legacy, manager, goal, seman-

tic, tool, cooperative, layers, healthier, defining, quantitative,

knowledge, web, text, learning, techniques, computer, sup-

porting, science, machine, documents, information, system

c200654 ontologies, language, query, specification, knowl-

edge, manager, semantic, pure, capturing, data,

search, keyword, layers, keyword-based, hybrid, archi-tecture, spreadsheet, web, ie, application, informa-tion, modelling, approach, algorithm, using, methodic, retrieval,service, system, structures

25 / 34


Emergence of Intermediary Louvain Community 15

The most complex scenario we investigated

It first emerged as a descendant of community 4 “IR” withtopic “cross-language IR”, which was identified as acommunity shift PS

.= 0.55 in 2003

Since 2004, this community was under a massive influence ofcommunity 5 “SW”, which caused a change towardsSW-related topics PC

.= 0.31

Since 2005, IR-related keywords appeared again amongcharacterizing keywords, while those keywords disappeared incommunity 5

Therefore, whereas community 5 kept its focus on the coreSW-related topics, it largely participated in forming of a newinterdisciplinary community

26 / 34


Betweenness of Louvain Community 15

Despite of being still focused on mainly SW-related topics,community 15 worked as an intermediary of both camps

This hypothesis is supported by high average authorbetweenness B

2004–2006 2007–2009S 〈〈B〉〉 S 〈〈B〉〉

c15 444 1591.01659 445 2535.02

entire network 2776 2066.70764 2190 2192.85117

27 / 34


Position of Louvain Community 15 in 2004 and 2007

Community 5 “SW” (red—left side), “IR” communities 0, 4, 6 and9 (grey, beige, pink and red—right side, respectively) and theirintermediary community 15 (violet)

28 / 34


Conclusion and Future Work I

We presented a general and scalable methodology for analysisof cross-community phenomena uniquely combiningtopological and content analysis and supported by specialvisualization techniques

Three community topic evolution measures tailored foridentifying phenomena like community shift, shift/merge, andchange of topic were proposed and successfully assessed

Community shift and topic change were detected quitecommonly, which suggests that they are part of manycommunity life-cyclesCommunity shift/merge was detected very rarely, which eithermeans we have to improve the measure or that this is simply arare phenomenon

We proposed life-cycle measures characterising the states andevolution of communities

29 / 34


Conclusion and Future Work II

The assessment showed that average vertex betweenness,relative density, cluster content ratio, and topic drift offeredvaluable insights into the phenomena revealed by communitytopic evolution measures

We observed strong shifts PS → 1, when the shiftedcommunity disappeared in the next time-step

These strong shifts had usually very different but coherenttopicsThey might have been the initial sources of new topics or evenresearch streams

Frequently, a newly emerged community had quite weakstructure (low ρ, high A) and/or topic (low T ), while thesecharacteristics then improved in the subsequent time-steps

B seems to be a good measure for identification ofintermediary communities

30 / 34


Conclusion and Future Work III

We intend to cluster the community life-cycles by thecharacteristic events expressed by all the measures

We expect this to provide an automated way of extractinglife-cycle taxonomies

The combination of content and structural analysis allowed usto assess the quality of clustering revealed only by inspectionof structure of the network

We consider this original approach as a fertile ground forfuture research

We plan to use other algorithms—e.g. co-clustering algorithmof both content and objects [4]

We will extend the whole work to a larger data-set

31 / 34


References I

R. Baeza-Yates, P. Mika, and H. Zaragoza.Search, Web 2.0, and the Semantic Web.IEEE Intelligent Systems, 23(1):80–82, 2008.

Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte,and Etienne Lefebvre.Fast unfolding of communities in large networks.Journal of Statistical Mechanics: Theory and Experiment,P10008, 2008.

Georgeta Bordea.The Semantic Web: Research and Applications, chapterConcept Extraction Applied to the Task of Expert Finding ,pages 451–456.Springer, 2010.

32 / 34


References II

Derek Greene and Padraig Cunningham.Spectral Co-Clustering for Dynamic Bipartite Graphs.Technical report, School of Computer Science & Informatics,UCD, 2010.

Th. S. Kuhn.The Structure of Scientific Revolutions.University Of Chicago Press, December 1996.

Martin F. Porter.An algorithm for suffix stripping.Program, 14:130–137, 1980.

33 / 34


References III

Martin Rosvall and Carl T. Bergstrom.Maps of random walks on complex networks reveal communitystructure.In National Academy of Sciences USA, volume 105, pages1118–1123, 2008.

Ken Wakita and Toshiyuki Tsurumi.Finding community structure in a mega-scale social networkingservice.In IADIS international conference on WWW/Internet 2007,pages 153–162, 2007.

34 / 34

Life-Cycles and Mutual Effects of Scientific Communities: ASNA 2010

Technology

community topic evolution

community mergeshifts

community eectskuhn

cit cjt

timestep t

fw outline

fw visualizationto

dynamic data