Top Banner
The S4 project & code OWNER: Raphael Bonaque PRESENTER: Juan Álvaro Muñoz Naranjo OAK Code Days 16-18 October 2014
11
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: S4

The S4 project & code OWNER: Raphael Bonaque PRESENTER: Juan Álvaro Muñoz Naranjo

OAK Code Days 16-18 October 2014

Page 2: S4

(Very) general overview �  S4 stands for “Social Semantic Structured Search”

�  Goal: RDF-based keyword search engine in social and structured environments (currently Twitter)

�  Keywords to be searched are defined by RDF semantics

�  Results are ranked by proximity and position of the (extended) keywords within the documents and their comments

�  Examples: searching for “animal” should return tweets containing “cat”, “dog”, “eagle” sorted by the ranking criteria

�  Keywords are currently taken from DBPedia

Page 3: S4

Programming language � 

History, versions �  Recent project

�  Two branches: �  Storage through serialization �  Storage via PostgreSQL

�  No code is reused from or into other projects

Page 4: S4

Code size

Contributors, users �  Raphael Bonaque

Lines Python Classes

Python scripts (.py)

Serialization ~1240 12 9

PostgreSQL ~1480 9 7

Page 5: S4

Code repository �  https://gforge.inria.fr/scm/viewvc.php/?root=xrp

Folder “postgres4”: version for the PostgreSQL DB.

(permission needed)

Papers �  R. Bonaque, B. Cautis, F. Goasdoué, I. Manolescu. Toward

Social, Structured and Semantic Search. SDSW’14, co-located with ISWC’14.

�  R. Bonaque, B. Cautis, F. Goasdoué, I. Manolescu. S4 Structured Social and Semantic Search (working draft).

Page 6: S4

Overview of the software �  Input:

�  User query

�  Twitter (static) database

�  RDF semantics

�  Output:

�  A ranked collection of tweets

Page 7: S4

Main modules

1. Tweets retrieval •  Use of Twitter API through the

TweetPy library •  Compresses retrieved data

•  receiving.py: tweet retriever through TweetPy •  archiving.py: data compression and management •  secrets.py: API key (not in the repo)

2. Semantics retrieval & storage •  RDF semantics creation & storage

from DBPedia

•  rdf_db.py: PostgreSQL I/O wrapper

3. Tweets storage •  Decompresses tweets •  Parses tweets according to RDF sems. •  Stores parsed tweets in DB

•  twitter_database.py: (old ver.) object serialization •  social_db.py: (new ver.) PostgreSQL I/O wrapper •  archiving.py •  config.py: database parameters (conn. string, etc)

4. Search engine •  Search algorithm

•  algorithm.py: interface for algorithms •  baseline_algorithm.py: actual algorithm and entry

point

Page 8: S4

Workflow

Module 4

baseline_algorithm.py Entry point: baseline_algorithm.top()

Module 1

receiving.py archiving.py

Module 3

social_db.py archiving.py

Module 2

rdf_db.py

tweets

compressed tweets

uncompressed tweets

keywords from DB

User query

Top-k tweets

Offlin

e

Page 9: S4

External software �  TweetPy: twitter API interface for Python

http://github.com/tweepy/tweepy

�  Twitter_nlp: Tweet natural language processing for Python

http://github.com/aritter/twitter_nlp

�  Psycopg: PostgreSQL adapter for Python

http://initd.org/psycopg

�  Scipy: scientific calculations library for Python

http://www.scipy.org

�  XZ: data compression tool

http://tukaani.org/xz

�  Matplotlib: (soon) plotting library for Python

http://matplotlib.org

Page 10: S4

TODO �  Implement execution scripts

�  Testing, benchmarking

�  Graph drawing

�  Optimization: query rewriting

�  Use of the “RDF loader into PostgreSQL” project

�  Alternatives to the baseline algorithm

Known bugs �  TweetPy crashes randomly , so Raphael had to make a

wrapper to restart it when needed

Page 11: S4

Merci!