Searching for patterns in crowdsourced information

Searching for patterns in crowdsourced Information

Silvia Puglisi

- Let me introduce myself..- What is crowdsourcing?- Discovering network dynamics and patterns in unstructured data.- Where to go from here..

Table of content

Let me introduce myself..

2007: Graduated in Computer Engineering from Polimi [Politecnico di Milano].

Thesis on applications in robotics of a model of the hippocampal spatial function.

The project involved applying a path-planning algorithm based on neural networks on a e-puck robot.

http://www.e-puck.org for more info on e-puck

http://www.e-puck.org

http://www.e-puck.org


2007: Joined Google as Corporate Operations Engineer.

My responsibilities included maintaining, designing, diagnosing, troubleshooting and/or updating Google corporate IT infrastructure and user-facing services.


2010: Joined Google Enterprise team as Technical Account Manager for Gmail and Postini.

My responsibilities included: - Develop creative solutions to maximize the adoption of Google Apps in organisations.- Work with product and engineering teams to translate customer needs into a better product experience.- Develop and implement processes and infrastructure to scale customer-facing operations.


2012: Left Google to finish M.Sc. Thesis and prepare for Ph.D.

2012: Graduated from Trinity College Dublin in M.Sc. program in Management of Information Systems.

Final Thesis: Proposing a method for evaluating the quality of crowdsourced geographical information.

What is crowdsourcing?

Crowdsourcing can be defined as the application of Open Source principles to fields outside of software.

Howe, 2006.

What is crowdsourcing?

Crowdsourcing takes a decentralized approach to problem solving, sourcing tasks that have been performed traditionally by individuals, to a group of people:

the crowd.

From crowdsourcing to spontaneous collaboration.

Crowdsourcing initiatives usually starts with a call for solutions from an organization or an entity.

Although..Networks dynamics sometimes are also an indirect source for data and answers to specific problems.

Wikipedia is maybe the most striking example of this phenomenon, for which people decide to collaborate spontaneously towards a task.

Discovering networks dynamics and patterns in unstructured data.

“Some twenty years ago I saw, or thought I saw, a synchronal or simultaneous flashing of fireflies. I could

hardly believe my eyes, for such a thing to occur among insects is certainly contrary to all natural laws.”

Philip Laurent, Science Journal 1917

Discovering networks dynamics and patterns in unstructured data.

Complex network structures describe a wide variety of systems, of technological and biological importance.

The web itself is an example of a complex network of pages linked by their hyperlinks.

A social network is instead an idea of a network whose nodes are the human beings and whose edge are the various human relationships that occur between them.

The web is a giant bobble of unstructured data.

The web has hence been developing as an open environment with infinite possibilities for collaboration and information sharing.

Users activity on the web now generates content which provides a variety of diverse information regarding the interaction between different entities and the world around them.

This is enhanced in Social Networks where people voluntarily share information about anything.

http://en.wikipedia.org/wiki/Bobble_%28knitting%29

Volunteered Information VS web pages.

Volunteered information constitute snippets of text, most of the times just a few words, with other media attached: photos, videos, sounds.

Volunteered information are to web pages what post-its or snippets are to books.

Volunteered Information VS web pages.

Volunteer information do not exhibits an explicit network structure constituted by the explicit link between them.

In the case of a web page, this structure is evident, since one page can link to other pages explicitly.

Links between volunteered information are instead created by the relationships between the context of a document.

The context of a document is made of the surrounding circumstances and facts that influence the meaning of a sentence, a passage, or even just a picture, a video or an audio file.

Understanding the context is the key point towards understand the semantic of a document and hence how much valuable information is actually contained in it.

Defining context..

Defining context..

Defining context hence means trying to figure out what can be automatically inferred regarding:

- Where the document was created?- Who created the document and shared it?- What does the document describe?- When was it shared?

Context is the key ingredient.

Context is then the ingredient that adds value to information.

If a document can be contextually linked to other documents it becomes more relevant.

It means more information can be inferred regarding that document.

Which context?

Regarding volunteer information, five types of context can be identified for a given object:

1) personal, 2) social, 3) geographical, 4) temporal, 5) linguistic.

A network model.

If context is interpreted as a property for a given object, we find out that at every level, each attribute will define a derived hierarchy in which an element “belongs” or is a “child” of another element higher or lower in the hierarchy.

A network model.

Let's imagine the following - followed relationship in a social network..

John Stewart follows Dave Matthews and Stephen ColbertTim Reynolds follows Dave Matthews and Stephen ColbertStephen Colbert follows John StewartDave Matthews follows John Stewart and Tim Reynolds

A network model.

A network model.

Let's now concentrate on attributes for volunteered information.

Every attribute could describe a node in our system.

Every edge describes with which frequency (or probability) two attributes are most likely to appear together.

This behaviour can be particularly true for tags networks.

A network model.

Such a model consist hence of N nodes, connected with probability p between one another, creating a graph with approximately p N (N-1) / 2 edges distributed randomly.

This is what is called a random graph model, and it is among the most used models in complex networks theory.

Small world networks.

It is agreed that the relationships between a node and another in such networks it is not entirely random, but displays some hints of the underlying organizing principles.

One of such principle is the small-world concept, which describes how despite their often large size, in complex networks there is a relatively short path between any two nodes (Watts, D. J., & Strogatz, S. H., 1998).

Properties of small world networks.

A common property of such networks is that the relationships between the nodes tend to form cliques.

Cliques may represent circle of acquaintances at a social level, they can even describe all the users of an online community that tend to communicate together, or they can describes relationships between words in different documents.

Properties of small world networks.

Another important aspect of complex networks to better understand their properties and dynamics is the degree distribution, i.e. a measurement of the number of edges at a given node in the network.

In fact, we would expect that not all nodes in the network would have the same node degree, but this would be characterized by a probability distribution function P(k), which give the probability that a randomly selected node has exactly k edges.

Where to go from here?

Search and Quality Ranking.

In Page and Brin PageRank algorithm the Rank of a node in the network (i.e. a web page), could be calculated as follow:


Where Bi is the set of documents connected to i, R(i) is the rank of the given document i, R(j) is the rank of a document j connected to i, and N(j) is the number of connections from j.


Both the local clustering coefficient and the degree distribution for a given node in the network give an estimate of how much a given node is connected to other nodes nearby.

Because the model used is built on the document context, more connections are therefore an indication of a richer content and a better quality of the information contained in the document itself.

Privacy and Security.. just some food for thoughts.

We said that a common property of small world networks is that the relationships between the nodes tend to form cliques.

What if this could be applied to the rules in a stateful firewall?

What if we want to find out which data we are most likely to share with which people on a social network?

Questions and Answers.

?

Searching for patterns in crowdsourced information

Technology

small world

john stewart

dave matthews

unstructured

volunteered

volunteer

network model

quality ranking