PeARS: a Peer-to-peer Agent for Reciprocated Search · 2018-08-10 · Kruszewski. Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic

PeARS: a Peer-to-peer Agent for Reciprocated Search

Aurélie HerbelotUniversity of Trento, Centre for Mind/Brain Sciences

Palazzo Fedrigotti, Corso Bettini 3138068 Rovereto, Italy

[email protected]

ABSTRACTThis paper presents PeARS (Peer-to-peer Agent for Recipro-cated Search), an algorithm for distributed Web search thatemulates the offline behaviour of a human with an informa-tion need. Specifically, the algorithm models the process of‘calling a friend’, i.e. directing one’s query to the knowledgeholder most likely to answer it. The system allows networkusers to index and share part of their browsing history ina way that makes them ‘experts’ on some topics. A layerof distributional semantics agents then performs targetedinformation retrieval in response to user queries, in a fullyautomated fashion.

KeywordsWeb search; distributional semantics; distributed systems

1. INTRODUCTIONThe Web hosts billions of documents. Recording the con-

tent of those documents and providing the ability to searchthem in response to a specific information need within amere couple of seconds is considered a challenging big dataproblem. In this paper, we propose that searching the Webdoes not necessarily require a large infrastructure. We re-design the notion of search as a distributed process mirroringthe offline behaviour of a human agent with an informationneed and access to a community of knowledge holders.

Our implementation deals with the size of the search spacein the way humans deal with the complexity of their environ-ment: by focusing their attention on relevant sources of in-formation. In our scenario, where representations of a largenumber of Web documents are spread out across a networkof human users with specific browsing habits, queries arematched with those user records that are most likely to holdrelevant information, thus considerably reducing the searchspace. Every semantic element in the network, from wordsto documents to user profiles, is modelled as a distributionalsemantics (DS) vector in a shared space.

2. INTUITIONLet’s imagine a person with a particular information need,

for instance, getting sightseeing tips for a holiday destina-tion. Let’s also imagine that this person does not have access

Copyright is held by the author/owner(s).WWW’16 Companion, April 11–15, 2016, Montréal, Québec, Canada.ACM 978-1-4503-4144-8/16/04.http://dx.doi.org/10.1145/2872518.2889369 .

to the Internet. How might she gain access to the informa-tion she seeks? Typically, she will identify the actors thatmay hold the answer to her query within her community:she might call a friend who has done the trip before, or herlocal travel agent. The relevance criterion used in this pro-cess ensures that she does not waste time talking to actorswho are less likely to be able to help, such as the local bakeror a poorly-travelled uncle.

PeARS reproduces this process by creating a distributednetwork layer over a community of real Internet users. Eachpeer in the network corresponds to a user and models thathuman’s online behaviour – in particular their primary in-terests. A keen traveller is likely to know about the most in-formative travelling sites while a dog trainer may repeatedlyvisit the online shops which they consider reliable for buyingtraining equipment. PeARS creates artificial agents – oneper human user – which can query each other about the top-ics they are ‘specialists’ for (see §6 for details on respectingthe user’s privacy). These agents make decisions using dis-tributional semantics models of both documents and users.

3. DISTRIBUTIONAL SEMANTICS (DS)DS [2] is an approach to computational semantics actively

researched within linguistics, cognitive science and neuro-science. A DS system analyses large corpora to build wordmeaning representations in the form of ‘distributions’. Intheir most basic form, such distributions are vectors in aso-called semantic space where each dimension represents aterm in the overall system vocabulary. The value of a vectoralong a particular dimension expresses how characteristicthe dimension is for the word modelled by the vector (ascalculated using e.g. Pointwise Mutual Information). It willbe found, typically, that the vector cat has high weight alongthe dimension meow but low weight along politics. Actualimplementations vary, from the basic setup described hereto multi-modal, dimensionality-reduced models. DS relatesto, but is distinct from, the use of vector spaces in classicinformation retrieval – in particular in its focus on cognitiveplausibility.

One of the research areas in DS concerns compositional-ity, i.e., how words combine together to form phrases andsentences. It has repeatedly been found that simple addi-tion of vectors performs well in modelling the meaning oflarger constituents (i.e., we express the meaning of black catby simply summing the vectors for black and cat). Thispaper expands on this result by positing that the (rough)meaning of a document is similarly the addition of all char-acteristic words for that document. Further, we can sum

41

the distributions of the documents in a user’s search historyto get a single vector modelling that user. So in a singlesemantic space, we may model that cat and dog are simi-lar, that two documents on classic cars belong to the sametopic, and that two users who browse programming forumsmay have relevant information for each other (even if theydo not necessarily browse the same sites).

4. SYSTEM ARCHITECTUREA PeARS network consists of n peers {p1...pn}, corre-

sponding to n users {u1...un} connected in a distributedtypology (all peers are connected to all other peers). Eachpeer pk has two components: a) an indexing component Ik;b) a query component Qk. All peers also share a common se-mantic space S which gives DS representations for words inthe system’s vocabulary. In our current implementation, Sis given by the CBOW semantic space of [1], a 400-dimensionvector space of 300,000 lexical items built using a state-of-the-art neural network language model.

Indexing: Ik builds vector representations for each doc-ument in uk’s browsing history. For instance, if uk visitsthe Wikipedia page on Bangalore, the URL of that pagebecomes associated with a 400-dimension vector producedby summing the distributions of the 10 most characteristicwords for the document (these are identified by comparingtheir document frequency to their entropy in a large cor-pus). At regular interval, Ik also updates uk’s profile bysumming the vectors of all indexed documents, outputtinga 400-dimension vector ~uk which inhabits an area of thesemantic space related to their interests (i.e., the type ofinformation they have browsed).

As a result of the indexing process, two types of infor-mation are made freely available across the network: theuser profile ~uk and the individual document vectors Dk ={d1...dn} used to build ~uk (at a particular URI, or in theform of a distributed hash table). Periodically, each peerp1...pn scans the network to collect all profiles ~u1... ~un andstores them locally.Querying: Qk takes a query q and translates it into a

vector ~q by summing the words in the query. It then goesthrough a 2-stage process: 1) find relevant peers amongstp1...pn by calculating the distance between ~q and all users’profiles ~u1... ~un (vector distance is operationalised as cosinesimilarity); 2) on the m relevant peers, calculate the distancebetween ~q and all documents indexed by the peer. Returnthe URLs corresponding to the smallest distances, in sortedorder.

5. PERFORMANCESpeed: Qk involves two stages: 1) the computation of

cosine similarities between a query (one vector with 400 di-mensions) and all the peers on the network (a matrix withdimensionality n × 400); 2) calculating cosine between thequery and the documents hosted by the most relevant peers,as identified in the first stage. For the purpose of assessingsystem speed, we generate random vectors and perform co-sine over the generated set. Our current implementation,running on a 4GB Ubuntu laptop under normal load, per-forms the calculation over batches of n=100,000 peers atstage 1. Each batch is computed in around 350ms. At stage2, assuming an average of 10,000 documents per node, thecomputation time is 45ms for each peer.

This preliminary investigation indicates that on a homemachine, the system covers up to 200,000 peers × 10,000= 2 billion documents in around a second (we must sub-tract potential redundancies between peers from this figure).Note that in a ‘real-life’ system, we would need to includeadditional time to retrieve the indices of the remote peers.However, we can also increase efficiency by sorting the listof known peers as a function of their similarity to the user’sprofile and caching the most similar nodes. The premise isthat a user will very often search for information related totheir interests and thus require access to peers that are liketheir own profile.

Accuracy: Measuring the search accuracy of the systemis ongoing work. We are testing the system’s architecture onreal user queries from the search engine Bing, as available– together with the Wikipedia page users found relevant forthe respective queries – from the WikiQA corpus [3]. Ourcurrent simulation is a network of around 4000 peers cov-ering 1M documents, modelled after the publicly availableprofiles of Wikipedia contributors. Preliminary results in-dicate that our system, when consulting the m = 5 mostpromising peers for each query, outperforms a centralisedsolution, as implemented by the Apache Lucene search en-gine1 (Herbelot & QasemiZadeh, in prep.).

6. CONCLUSIONWe have presented an architecture for a user-centric, dis-

tributed Web search algorithm that utilises the inherent‘specialisms’ of individuals as they browse the Internet.

We should note that our system relies on the willingnessof its users to share some of their search history with oth-ers. We alleviate the privacy concerns associated with thisrequirement in three ways: a) the user can create a blacklistof sites that will never be indexed by the system; b) beforemaking an index available, the agent clusters documents intolabelled topics and presents them to the user, who can de-cide to exclude certain topics from the index; c) there is norequirement for the shared index to be linked to a namedand known user.

PeARS is under active development and code is regularlymade available at https://github.com/PeARSearch.

7. ACKNOWLEDGMENTSGrateful thanks to Hrishikesh K.B., Veesa Norman, Shobha

Tyagi, Nandaja Varma and Behrang QasemiZadeh for theirtechnical contributions to the project, and to the anonymousreviewers for their helpful feedback. The author acknowl-edges support from the ERC Grant 283554 (COMPOSES).

8. REFERENCES[1] Marco Baroni, Georgiana Dinu, and German

Kruszewski. Don’t count, predict! a systematiccomparison of context-counting vs. context-predictingsemantic vectors. In ACL2014, pages 238–247, 2014.

[2] Katrin Erk. Vector space models of word meaning andphrase meaning: A survey. Language and LinguisticsCompass, 6(10):635–653, 2012.

[3] Yi Yang, Wen-tau Yih, and Christopher Meek.WIKIQA: A Challenge Dataset for Open-DomainQuestion Answering. In EMNLP2015, 2015.

1http://lucene.apache.org/

42

https://github.com/PeARSearch

http://lucene.apache.org/

PeARS: a Peer-to-peer Agent for Reciprocated Search · 2018-08-10 · Kruszewski. Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic

Documents