S4

The S4 project & code OWNER: Raphael Bonaque PRESENTER: Juan Álvaro Muñoz Naranjo

OAK Code Days 16-18 October 2014

(Very) general overview �  S4 stands for “Social Semantic Structured Search”

�  Goal: RDF-based keyword search engine in social and structured environments (currently Twitter)

�  Keywords to be searched are defined by RDF semantics

�  Results are ranked by proximity and position of the (extended) keywords within the documents and their comments

�  Examples: searching for “animal” should return tweets containing “cat”, “dog”, “eagle” sorted by the ranking criteria

�  Keywords are currently taken from DBPedia

Programming language � 

History, versions �  Recent project

�  Two branches: �  Storage through serialization �  Storage via PostgreSQL

�  No code is reused from or into other projects

Code size

Contributors, users �  Raphael Bonaque

Lines Python Classes

Python scripts (.py)

Serialization ~1240 12 9

PostgreSQL ~1480 9 7

Code repository �  https://gforge.inria.fr/scm/viewvc.php/?root=xrp

Folder “postgres4”: version for the PostgreSQL DB.

(permission needed)

Papers �  R. Bonaque, B. Cautis, F. Goasdoué, I. Manolescu. Toward

Social, Structured and Semantic Search. SDSW’14, co-located with ISWC’14.

�  R. Bonaque, B. Cautis, F. Goasdoué, I. Manolescu. S4 Structured Social and Semantic Search (working draft).

Overview of the software �  Input:

�  User query

�  Twitter (static) database

�  RDF semantics

�  Output:

�  A ranked collection of tweets

Main modules

1. Tweets retrieval •  Use of Twitter API through the

TweetPy library •  Compresses retrieved data

•  receiving.py: tweet retriever through TweetPy •  archiving.py: data compression and management •  secrets.py: API key (not in the repo)

2. Semantics retrieval & storage •  RDF semantics creation & storage

from DBPedia

•  rdf_db.py: PostgreSQL I/O wrapper

3. Tweets storage •  Decompresses tweets •  Parses tweets according to RDF sems. •  Stores parsed tweets in DB

•  twitter_database.py: (old ver.) object serialization •  social_db.py: (new ver.) PostgreSQL I/O wrapper •  archiving.py •  config.py: database parameters (conn. string, etc)

4. Search engine •  Search algorithm

•  algorithm.py: interface for algorithms •  baseline_algorithm.py: actual algorithm and entry

point

Workflow

Module 4

baseline_algorithm.py Entry point: baseline_algorithm.top()

Module 1

receiving.py archiving.py

Module 3

social_db.py archiving.py

Module 2

rdf_db.py

tweets

compressed tweets

uncompressed tweets

keywords from DB

User query

Top-k tweets

Offlin

e

External software �  TweetPy: twitter API interface for Python

http://github.com/tweepy/tweepy

�  Twitter_nlp: Tweet natural language processing for Python

http://github.com/aritter/twitter_nlp

�  Psycopg: PostgreSQL adapter for Python

http://initd.org/psycopg

�  Scipy: scientific calculations library for Python

http://www.scipy.org

�  XZ: data compression tool

http://tukaani.org/xz

�  Matplotlib: (soon) plotting library for Python

http://matplotlib.org

TODO �  Implement execution scripts

�  Testing, benchmarking

�  Graph drawing

�  Optimization: query rewriting

�  Use of the “RDF loader into PostgreSQL” project

�  Alternatives to the baseline algorithm

Known bugs �  TweetPy crashes randomly , so Raphael had to make a

wrapper to restart it when needed

Merci!

S4

Data & Analytics

postgresql db

tweets storage

search engine search

semantic search

tweets parses tweets

rdf loader

rdf semantics results

rdf sems