The S4 project & code OWNER: Raphael Bonaque PRESENTER: Juan Álvaro Muñoz Naranjo OAK Code Days 16-18 October 2014
The S4 project & code OWNER: Raphael Bonaque PRESENTER: Juan Álvaro Muñoz Naranjo
OAK Code Days 16-18 October 2014
(Very) general overview � S4 stands for “Social Semantic Structured Search”
� Goal: RDF-based keyword search engine in social and structured environments (currently Twitter)
� Keywords to be searched are defined by RDF semantics
� Results are ranked by proximity and position of the (extended) keywords within the documents and their comments
� Examples: searching for “animal” should return tweets containing “cat”, “dog”, “eagle” sorted by the ranking criteria
� Keywords are currently taken from DBPedia
Programming language �
History, versions � Recent project
� Two branches: � Storage through serialization � Storage via PostgreSQL
� No code is reused from or into other projects
Code size
Contributors, users � Raphael Bonaque
Lines Python Classes
Python scripts (.py)
Serialization ~1240 12 9
PostgreSQL ~1480 9 7
Code repository � https://gforge.inria.fr/scm/viewvc.php/?root=xrp
Folder “postgres4”: version for the PostgreSQL DB.
(permission needed)
Papers � R. Bonaque, B. Cautis, F. Goasdoué, I. Manolescu. Toward
Social, Structured and Semantic Search. SDSW’14, co-located with ISWC’14.
� R. Bonaque, B. Cautis, F. Goasdoué, I. Manolescu. S4 Structured Social and Semantic Search (working draft).
Overview of the software � Input:
� User query
� Twitter (static) database
� RDF semantics
� Output:
� A ranked collection of tweets
Main modules
1. Tweets retrieval • Use of Twitter API through the
TweetPy library • Compresses retrieved data
• receiving.py: tweet retriever through TweetPy • archiving.py: data compression and management • secrets.py: API key (not in the repo)
2. Semantics retrieval & storage • RDF semantics creation & storage
from DBPedia
• rdf_db.py: PostgreSQL I/O wrapper
3. Tweets storage • Decompresses tweets • Parses tweets according to RDF sems. • Stores parsed tweets in DB
• twitter_database.py: (old ver.) object serialization • social_db.py: (new ver.) PostgreSQL I/O wrapper • archiving.py • config.py: database parameters (conn. string, etc)
4. Search engine • Search algorithm
• algorithm.py: interface for algorithms • baseline_algorithm.py: actual algorithm and entry
point
Workflow
Module 4
baseline_algorithm.py Entry point: baseline_algorithm.top()
Module 1
receiving.py archiving.py
Module 3
social_db.py archiving.py
Module 2
rdf_db.py
tweets
compressed tweets
uncompressed tweets
keywords from DB
User query
Top-k tweets
Offlin
e
External software � TweetPy: twitter API interface for Python
http://github.com/tweepy/tweepy
� Twitter_nlp: Tweet natural language processing for Python
http://github.com/aritter/twitter_nlp
� Psycopg: PostgreSQL adapter for Python
http://initd.org/psycopg
� Scipy: scientific calculations library for Python
http://www.scipy.org
� XZ: data compression tool
http://tukaani.org/xz
� Matplotlib: (soon) plotting library for Python
http://matplotlib.org
TODO � Implement execution scripts
� Testing, benchmarking
� Graph drawing
� Optimization: query rewriting
� Use of the “RDF loader into PostgreSQL” project
� Alternatives to the baseline algorithm
Known bugs � TweetPy crashes randomly , so Raphael had to make a
wrapper to restart it when needed
Merci!