Research Interests : Their Dynamics, Structures and Applications in Personalized Web Search

Research Interests : Research Interests : Their Dynamics, Structures andTheir Dynamics, Structures and

Applications in Personalized Web SearchApplications in Personalized Web Search

Yi Zeng1, Erzhong Zhou1, Xu Ren1, Yulin Qin1,3, Ning Zhong1,2 , Zhisheng Huang4

1. International WIC Institute, Beijing University of Technology, China

2. Maebashi Institute of Technology, Japan

3. Carnegie Mellon University, USA

4. Vrije University Amsterdam, the Netherlands

Web Intelligence Consortium

The Large Knowledge Collider Project

33

13 partner institutions (from 11 countries, 2 from Asia)

a platform for infinitely scalable querying and reasoning on the linked data-web.

Motivation

Vague/Incomplete queries over large scale data.

(How to get more refined queries to reduce the size of the result set?).

Large scale data vs most relevant data for a specific user.

Diversity for different users in the context of large scale data.

Realizing Diversity of Users by user interests.

Understanding the structural and dynamical characteristics of user interests is the foundation for its utilization in Web search refinement.

The Acquisition, Structure and Dynamics of Research Interests

Why?

Human Learning Theory [Bransford 2000]

Basic Level Advantage [Rogers 2007]

How?

Identifying key interests

Utilizing interests for the unification of knowledge retrieval and reasoning.

What if the interests are dynamic changing? And is it really changing all the time? And how?

Different Interests Evaluation Functions

(Frequency) Cumulative Interest :

1( ( ), ) ( ), .

n

jCI t i n yt i j

An analysis of cumulative interests in different time intervals. (Paul Erdos, with more than 1400 papers involved)

Statistical Characteristics analysis: All the plots are distributed around a strait line, and by Shapiro wilks measurement, the significance value is 0.058, which is greater than 0.05, hence the distribution of Erdos’s publication number over years is a normal distribution. Cumulative Interests of an author may follow different kinds of distributions.

The “ Basic level advantage ” [Rogers2007]. Concepts in a basic level -- > more frequently than other terms [Wisniewski1989].

Weights of Interests

Users’ interests will be distracted if they hold various interests at the same time.

For each of the interests, they have ups and downs. It can be discovered by the change of relative weights of the interests compared to other interests.

An analysis of Ricardo Baeza-Yates’ weighted interests w(t(i), j).

( ),

( ),1

( ( ), )t i j

nt i j

i

yw t i j

y

Obtaining the Retained Interests

Except for frequency, what else is important to correctly obtain retained interests?

Forgetting mechanism in cognitive memory retention

(exponential function model, power function model) [Anderson, Schooler 1991].

Pictures from: [Schooler 1993] Schooler, L. J. & Anderson, J. R.: Recency and Context: An Environmental Analysis of Memory. In Proceedings of the Fifteenth Annual Conference of the Cognitive Science Society, pp. 889-894, 1993.

(Frequency and Recency) Memory Retention:

;bT bP Ae P AT

Obtaining the Retained Interests (cont.)

[Zeng 2009a] Cognitive Memory Retention Based Starting Point for Query Extension and Granular Selection, Yi Zeng, Haiyan Zhou, Ning Zhong, Yulin Qin, Shengfu Lu, Yiyu Yao, Yang Gao. In: Cognitive Memory Component (v1), LarKC deliverable 2-3-1, Coordinated by Jose Quesada and Yi Zeng, March 30, 2009.[Zeng 2009b] Yi Zeng, Yiyu Yao, Ning Zhong. DBLP-SSE: A DBLP Search Support Engine, In: Proceedings of the 2009 IEEE/WIC/ACM International Conference on Web Intelligence, IEEE Computer Society, Milan, Italy, September 15-18, 2009.[Maanen 2009] Leendert van Maanen, Julian N. Marewski.: Recommender Systems for Literature Selection: A Competition between Decision Making and Memory Models, CogSci 2009, July 31-August 1, 2009.

(Frequency and Recency) Exponential Model for Interest Retention :

(Frequency and Recency) Power Model for Interest Retention :

,

1( ) ( , ) i j

n bT

jEIR i m i j Ae

,1( ) ( , )

n bi jj

PIR i m i j AT

Obtaining the Top N Interests

A comparative study of total research interests from 1990 to 2008 and retained interests in 2009 (based on both the power law and exponential law models).

Difference on the contribution values from papers published in different years.

• Retained interest vs future interests.• publication numbers are within [200, 300]• top 9 interests • 2001 to 2008 • 140 persons• 51.14% predict 5 out of 9 interests. • Spearman rank correlation: rho = 0.66• 1-tail t-test: 0.02 (close to statistically

significant)

Building and Analyzing the Structure of Research Interests

Observed Phenomenon:

[1] main research interests (pivotal nodes) are dynamically changing all the time. With older ones disappear and new ones emerged.

[2] Relations among research interests varies as time passed (strengthen or weaken).

[3] main research interests are closely related to each other. (The closeness is getting stronger from time to time, which made the degree of separation around 2-3. It indicates that for an author, research interests are not isolated but highly relevant.

[4] Many top research interests (pivotal nodes) remain active in the interest network (e.g. search, analysis, match).

Figure 7. Ricardos research interest dynamic evolution network from 1991 to 2009. (Based on DBLP publication list, with 232 papers involved). The network is a graph with weighted edges and weighted vertices.

An Author’s Research Interest Evolution Network

Statistical Characteristics on the Dynamics of Total Research Interests

Not a pure random Process !

There might be some universal characteristics and hidden rules!

Pictures from math.ucsd.edu. and math.tsukuba.ac.jp

Figure 2: Power-law distribution on weights of research interests for Leonhard Euler (Publication list is from Euler's Archive, with 856 papers), Paul Erdos (publication list is from Erdos' publication collection project (1929-1989) and MathSciNet (1990-2004), with 1437 papers involved. Translation of titles from German, French, Hungarian has been made by google translation and Babylon translation), and Ricardo Baeza-Yates (from DBLP). (With processing on meaningless words, tense, singular, plural form, third person, etc.

Interests with Self Organized Criticality

Self organized criticality [Barabasi 2002]

(The winner takes all)

Figure 11. Zdzislaw Pawlak’s Interest statistics showing Self organized criticality.

Rough Set

Figure 12. Zdzislaw Pawlak’s Interest connection network (1984-2008, with 62.1% interests directly connected to “rough”).

Timing characteristics of research interests

Dynamic characteristics of lnterest Longest Duration and Interest Cumulative Duration.

Figure 9: Ricardo's research interest lasting time and appear time distribution statistics.

• a few large spikes in the plot, corresponding to very long interest longest duration and interest cumulative duration for some research interests : non-Poisson process;

Figure 9(b) : the probability of having n research interests whose lasting time is a fixed time interval .

( )

statistical distribution approximation: ( )P 2.30( 0.26) ' 1.64( 0.18) (by linear

fit)

Inspired by Human Dynamics [Barabasi 2005]

Explanations on the Observed Power Law Distributions

What causes the “Scale-free Phenomenon” in research interests?

‘the rich get richer’ effect [Simon 1955] (preferential attachment [Barabasi 1999])

Researchers are likely to work around a few more general topics and the more specific topics are changing from time to time, but around or very related with these general topics.

The picture is from: Peter Csermely. Weak Links: Stabilizers of Complex Systems from Proteins to Social Networks, Springer, 2006.

[Simon 1955] Simon, H.: On a class of skew distribution functions. Biometrika 42, 425–440, 1955.[Barabasi 1999] Barabasi, A.L. and Albert, R. : Emergence of scaling in random networks. Science 286, 509–512.‘the rich get richer’ effect [Simon1955]

A Comparative Study of Different Interest Evaluation Methods

'

1

( ), ( ( ( )), ).n

n

ICD(t i n) ID t i n

Interests Longest Duration

Interests Cumulative Duration

( ), max( ( ( )), ).ILD(t i n) ID t i n

Zhisheng Huang’s Interests Evaluation from CI, ILD and ICD

Social Network based Group Interests Models

Carlos Castillo

Ricardo A. Baeza-Yates

Web

PageRank

Network

Spam

Search

DetectionAnalysis

Link

ContentWeb

Search

RetrievalInformation

Query

Analysis

Challenge

Engine Mining

1( ( ), ) ( ( ), , ),

1 ( (i) I )( ( ), , ) .

0 ( (i) I )

m

c

topNc

topNc

GI t i u E t i u c

tE t i u c

t

• An example of Group Interest.

How to acquire the top N interests?

• Group Interest Function:

Overlap of User Interests and Group Interests

Top 9

Retained Interests

Top 9 Group Retained Interests

Web 7.81 Search 35

Search 5.59 Retrieval 30

Retrieval 3.19 Web 28

Information 2.27 Information 26

Query 2.14 System 19

Engine 2.10 Query 18

Minining 1.26 Analysis 14

Challenge … Text …

Analysis … Model …

Top 9 interests retention of a user and his group interests retention. (Ricardo A. Baeza-Yates, based on May 2008 version of SwetoDBLP).

A Step Forward : Semantic Similarity---- Obtaining More Accurate Interest Descriptions

Consistent interests without consideration of semantic similarity.

Carlos Castillo


Web

PageRank

Network

Spam

Search

DetectionAnalysis

Link

ContentWeb

Search


Query

Analysis

Challenge

Engine Mining

Consistent interests with consideration of semantic similarity.

Carlos Castillo


Web

PageRank

Network

Spam

Search

DetectionAnalysis

Link

ContentWeb

Search


Query

Analysis

Challenge

Engine Mining

Semantic Similarity and Interests Re-ranking

Semantic Similarity judges by Normalized Google Distance

[Rudi and Paul 2007]max{log ( ), log ( )} log ( , )

( , ) ,log min{log ( ), log ( )}

f x f y f x yNGD x y

M f x f y

Normalized Google Distance

interest x

interest y

NGD interest x interest y NGD

search retrieval 0.529 logic reasoning 0.239

search query 0.483 logic semantic 0.276

search pagerank 0.490 ontology semantic -0.003

retrieval query 0.403 reasoning semantic 0.050

retrieval pagerank 0.497 logic ontology 0.332

Query pagerank 0.460 ontology reasoning 0.080

( , ) 0.3NGD x y

Google, Bing as the Knowledge base.

A comparative study of interests ranking without and with re-ranking strategy

Semantic Similarity and Interests Re-ranking (cont.)

( ), '( ) ( ) 1 ( ( ) ( ))'( ) .

, '( ) ( ( ) ( ))

rank x rank y rank x rank x rank yrank x

rank(y)+1 rank y rank(y) rank y rank x

Interests Re-ranking Function

( ( ), )CI t i n ( ( ), )PRI t i n ( ( ), )ERI t i n ( ( ), )CI t i n ( ( ), )PRI t i n ( ( ), )ERI t i n

Without semantic similarity based re-ranking (a) With semantic similarity based re-ranking (b)

Perspectives

Interests Ranking

Agent Ontology Ontology Agent Ontology Ontology

Web Web Semantic Web Semantic Semantic

Ontology Semantic Reasoning Ontology Reasoning Reasoning

Logic Reasoning Web Semantic Logic Logic

Semantic Inconsistent Inconsistent Reasoning Web Web

Reasoning Logic Prolog Logic Inconsistent Inconsistent

Dynamic Prolog Logic Dynamic Prolog Prolog

Inconsistent Dynamic Agent Inconsistent Dynamic Agent

Prolog Agent Dynamic Prolog Agent Dynamic

Similarity Measures

Google Similarity Distance

The whole Google file system as the knowledge base

It is simply not accurate since there are many noisy data from different field.

max{log ( ), log ( )} log ( , )( , ) ,

log min{log ( ), log ( )}

f x f y f x yNGD x y

M f x f y

Medline Similarity Distance

We ask the right person the right question!

Domain specific knowledge source is needed to acquire more accurate and professional answers.

Just ask me! I know everything and

believe me!

I know Chemistr

yI know Medical Science

I know Cognitio

n

I know Mathema

tics

My Question is about Medical Science

Evaluations on Normalized Medline Distance (NMD)

23

Experts evaluated 30 medline term pairsPearson Correlation:

NMD gets the highest value among the measures, 0.792T-test significance: 0.995

Experts from AstraZeneca evaluated 90 randomly generated pairsPearson Correlation:

NMD: 0.736 vs NGD:0.531Average:

Experts:0.590, NMD:0.390, NGD:0.289 NMD is closer to experts’ evaluation

Motivation for User Interests Description

Based on the idea of Linked data , it will be very useful if user interests data can be shared across various applications.

Consistent description and representation of user interests are needed so that the integration and sharing of user interests data will be easier.

The Linked Open Data figure is from http://richard.cyganiak.de/2007/10/lod/

Large scale data vs most relevant data for a specific user.

User interests serve as the foundation for removing scalability problems by diversity of user backgrounds and needs.

Defining User Interests

Interest: the activities that you enjoy doing and the subjects that you like to spend time learning about.

------ Cambridge Advanced Learner's Dictionary.

A user interest is the subject that an agent wants to participate, get to know,learn about, or be involved.

User interests need to be described from various perspectives. It is better that each perspective can be quantitatively evaluated.

User Interest can be described as a five tuple:

< Interest URI, AgentURI, Property(i), Vaule(i), Time(i) >

The e-FOAF: interest Vocabulary

Vocabulary Branch Vocabulary Type

E-foaf:interest

complete

E-foaf:interest

Basic

e-foaf:interest Class

e-foaf:interest_value Attribute

e-foaf:interest_value_updatetime Attribute

e-foaf:interest_appeared_in Attribute

e-foaf:interest_appeare_time Attribute

e-foaf:interest_has_synonym Attribute

e-foaf:interest_co-occur_with Attribute

E-foaf:interest

Complement

e-foaf:cumulative_interest_value Attribute

e-foaf:retained_interest_value Attribute

e-foaf:interest_longest_duration Attribute

e-foaf:interest_cumulative_duration Attribute

An extension of current FOAF vocabulary in the Semantic Web community. Following the definition of “user interests” in the above slide. Describe user interests quantitatively from various perspectives.

Integration of WI and e-FOAF:interests by FOAF community

By Balthasar A.C. Schopman from Vrije University Amsterdam

Integration of WI and e-FOAF:interests by FOAF community (cont.)

Based on Vocamp 2010

By Bob Ferris (SMI), the picture is from:

http://smiy.sourceforge.net/wi/spec/weightedinterests.html

The wi:ComplexInterest concept as graph with relations:

This photo is taken by Professor Lora Aroyo from Vrije University Amsterdam at Vocamp 2010.

Computer Scientists’ Research Interests Dataset

We analyzed research interests of all the computer scientists in DBLP from different perspectives.

We released the “computer scientists’ research interest RDF dataset : http://wiki.larkc.eu/csri-rdf ” (0.19 billion triples)

The Utilization of e-FOAF:interests Vocabulary

Accessing user interests and downloading them as an RDF file.

The utilization of the interests dataset.

The SPARQL endpoint for DBLP user interests is available at http://www.wici-lab.org/wici/dblp-sse/

Dieter & Frank 2007

Bring User Interests to Literature Search Refinement

User interests

“ They come to formal education with a range of prior knowledge, skills, beliefs, and concepts that significantly influence what they notice about the environment and how they organize and interpret it. This, in turn, affects their abilities to remember, reason, solve problems, and acquire new knowledge. ” [Bransford 2000]

Human acquire new knowledge based on pre-existing knowledge. People with different background knowledge will have various personal understanding of the same knowledge source.

Literature search systems are for researchers to acquire knowledge for their needs based on their queries.

Pre-existing Knowledge Search+ Acquired

Knowledge

Useful literatures that are relevant to the query and authors’ research interests

Search Refinement by Interests from Different Perspectives

Vague/incomplete queries may produce too many results that the users have to wade through.

Research interests may be very related with search tasks.

Research interests can be evaluated from various perspectives.

(1) Cumulative Interests;

(2) Retained Interests;

(3) Interests Longest Duration;

(4) Interests Cumulative Duration;

(5) Group interests;

DBLP-SSE : DBLP Search Support Engine

* Web Intelligence and Artificial Intelligence in Education. * Artificial Intelligence Exchange and Service Tie to All Test Environments (AI-ESTATE)-A New Standard for System Diagnostics. * Semantic Model for Artificial Intelligence Based on Molecular Computing. * Open Information Systems Semantics for Distributed Artificial Intelligence. * Artificial Intelligence and Financial Services.* …

with current interests constraints (Top 5 results)List 2 :

* PROLOG Programming for Artificial Intelligence, Second Edition. * Artificial Intelligence Architectures for Composition and Performance Environment. * Artificial Intelligence in Music Education: A Critical Review. * Music, Intelligence and Artificiality. Artificial Intelligence and Music Education. * Musical Knowledge: What can Artificial Intelligence Bring to the Musician?* ...

without current interests constraints (Top 5 results)List 1 :

Artificial Intelligence Query :

Web, Service, Semantic, Architecture, Model, Ontology, Knowledge, Computing, Language

Top 9 interests

Dieter Fensel Log in

The DBLP dataset

Web Semantic

Knowledge

Sub datasets pre-selection

Search Results without any Refinement

Search Results with Interests-based Refinement

http://www.wici-lab.org/wici/dblp-sse/

User Evaluation of Refinement Strategy

Participants 7 DBLP authors:

Preference order 100% :

Preference order 100% :

Preference order 83.3% :

Preference order 16.7% :

2, 3 1List List List

2 3List List

2 3 1List List List

3 2 1List List List

Social Relation Based Search Refinement: Let Your Friends Help You!. Xu Ren, Yi Zeng, Yulin Qin, Ning Zhong, Zhisheng Huang, Yan Wang, and Cong Wang. Proceedings of the 2010 International Conference on Active Media Technology, Lecture Notes in Computer Science 6335, 475-485, 2010.

Scalability for Query Time

Unrefined query

Refined query based on interests

Interest based selection before querying

Query Time

medium much slower the fastest

Results may be very far from user needs

much closer to user needs

equivalent to Refined query based on interests

With selection: approximately 80% of the time can be saved.

The Effect of Query Constraints Numbers

Recall and Spent Time(Unrefined queries vs Interest-based Selection

As the data goes to larger scale, getting almost the same recall compared to unrefined queries, the ratio of spent time is almost linear growing.

Some times one can get bigger recall while the ratio of spent time is lower.

Context-Aware Linked Life Data Search

Utilizing user interests to refine vague and incomplete search

Publications related to this talk

Research Interests : Their Dynamics, Structures and Applications in Web Search Refinement. Yi Zeng, Erzhong Zhou, Yulin Qin, and Ning Zhong. Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence, IEEE Computer Society, Toronto, Canada, August 31- September 3, 2010.

User Interests: Definition, Vocabulary, and Utilization in Unifying Search and Reasoning. Yi Zeng, Yan Wang, Zhisheng Huang, Danica Damljanovic, Ning Zhong, and Cong Wang. Proceedings of the 2010 International Conference on Active Media Technology, Lecture Notes in Computer Science 6335, 98-107, 2010.

Social Relation Based Search Refinement: Let Your Friends Help You!. Xu Ren, Yi Zeng, Yulin Qin, Ning Zhong, Zhisheng Huang, Yan Wang, and Cong Wang. Proceedings of the 2010 International Conference on Active Media Technology, Lecture Notes in Computer Science 6335, 475-485, 2010.

Normalized Medline Distance and Its Utilization in Context-aware Life Science Literature Search. Yan Wang, Cong Wang, Yi Zeng, Zhisheng Huang, Vassil Momtchev, Bo Andersson, Xu Ren, and Ning Zhong. Proceedings of the 4th Chinese Semantic Web Symposium, August 19-21, 2010 (Recommended to Tsinghua Science and Technology, Elsevier).

User-centric Query Refinement and Processing Using Granularity Based Strategies. Yi Zeng, Ning Zhong, Yan Wang, Yulin Qin, Zhisheng Huang, Haiyan Zhou, Yiyu Yao, and Frank van Harmelen. Knowledge and Information Systems, Springer.

DBLP-SSE: A DBLP Search Support Engine, Yi Zeng, Yiyu Yao, Ning Zhong. In: Proceedings of the 2009 IEEE/WIC/ACM International Conference on Web Intelligence, IEEE Computer Society, Milan, Italy, September 15-18, 2009.

http://www.wici-lab.org/wici/~yizeng/papers/WI-2010-06-26.pdf

http://www.wici-lab.org/wici/~yizeng/papers/AMT-Yi-Yan-Danica-Zhisheng-0623.pdf

http://www.wici-lab.org/wici/~yizeng/papers/AMT-2010-Xu-Yi.pdf

http://www.wici-lab.org/wici/~yizeng/papers/CSWS2010-Yi.pdf

http://springerlink.com/content/r855r106r0577750/?p=481a12ba43c94f09860225dcea74668f&pi=0

http://www.wici-lab.org/wici/~yizeng/papers/WI2009-camera-ready.pdf

42

Thank you!

URL: http://www.wici-lab.org/wici/~yizeng Email: [email protected]

Research Interests : Their Dynamics, Structures and Applications in Personalized Web Search

Documents

interests 2001to2008

key interests

various interests

interestvsfuture interests

research interests arenot

research interestsvaries

authors research

recency memory retention