MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION Director: Codirector: MASTER THESIS Phd. Ruben Tous Liesa Phd. Jordi Torres VIñals Presented by Omar Iván Sulca Correa FACULTAT D’INFORMÀTICA DE BARCELONA
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTIONDirector:
Codirector:
MASTER THESIS
Phd. Ruben Tous LiesaPhd. Jordi Torres VIñals
Presented byOmar Iván Sulca Correa
FACULTAT D’INFORMÀTICA DE BARCELONA
MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
Agenda1. Introduction2. Background3. Multimedia Big Data Computing for Trend Detection4. Results and Conclusions
MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
Introduction• The relevance of social media data has had an explosive growing in the last few years, because the user’s interactions and communications in social networks provide key information (government and non-government organizations).• Social media data is vast, noisy, distributed, unstructured, and dynamic in nature; thus traditional analysis methods prove to be inefficient and expensive with it. It’s necessary looking new alternatives• Exist a lot potentiality in the photo-sharing social networks as Instagram, especially in digital marketing.
1
This work is a proof of concept on Streaming and Machine Learning functionalities of the new Big Data platform: Apache Spark. Using Spark subprojects MLlib and Spark Streaming. It seeks to implement an application which allows find the Trending Topics (using the model LDA) on data collected from the social network Instagram
MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
II. Background
MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
Older msgs Newer msgs
Kafka topic ProducerConsumer
Apache Kafka• Apache Kafka is an open source, distributed, partitioned, and replicated commit-log-based publish-subscribe messaging system
2
• Streams• Batch
Kafka ClusterBroker 1
Broker 2
Broker 3
Zookeeper
Producer Consumer
Front End
Front End
Front End
Haddop
Real Time
Security
Kafka Cluster
Zookeeper
MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
SparkSQL SparkStreaming MLlib GraphX
Apache Spark
Apache Spark (I)• Apache Spark is an open source cluster-computing platform designed to be fast and general-purpose. • Allows combine different types of computations in one single plataform• Spark support in-memory processing, allowing a performance up to 100x
2 Interactivequeries
streamingBatchapplications
Iterativealgorithms
DataFrame DStream Vector & Matrix Vertex & Edge
RDDs Actions and Transformations
A RDD (Resilient Distributed Datasets) represent a collection of elements that can be manipulated
SparkStreaming
MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
Apache Spark (II)• Spark Streaming is a Spark component that enables processing live streams of data. Spark Streaming provides an API for manipulating data streams that closely matches the Spark Core’s RDD API
2
• Spark Streaming provides a high-level abstraction calledDiscretized Stream or Dstream• A DStream is a sequence (a series of RDDs) of data arrivingover time
SocketsFile Stream
Actors (Akka)Quenue RDDs
TransformationsWindow OperationsOutput Operations
MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
1:k : Los tópicos : Es una distribución sobre el vocabulario: Las proporciones de tópicos para el th documento
, : Es la proporción de tópicos del tópico en el documento : Las asignaciones de tópicos para el th documento
, : Es la asignación tópicos para la n-sima palabra en el documento : Son las palabras observadas en el documento,
, : Es la nth palabra en el documento , que es un elemento del vocabulario fijo
: , : , : , : = ∏ ∏ (∏ , | ( , | : , , ))
Topic Modeling• It’s a suite of statistical algorithms that aim to discover and annotate large archives of documents with thematic information• topic models do not require any prior annotations or labeling of the documents, the topics emerge from the analysis of the original texts
2 Latent Dirichlet Allocation (LDA)
Documents
Topic proportions andassigments
gene 0.04dna 0.02genetic 0.01….
life 0.02evolve 0.01organism 0.01
brain 0.04neuron 0.02nerve 0.01
data 0.02number 0.02computer 0.01
Topics
MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
III. Multimedia Big Data Computing for Trend Detection
MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
Overview
JSONfiles
Spark Streaming MLlib
Apache SparkSpark SQL
3
Instgram’s API
Kafka
Ingest Read Data Pre-processing & filtering Topic Modeling
#Lavidaeschula dataset ¿?Iterations : 1Topics : 5Processing Time : 129 min
MULTIMEDIA BIG DATA COMPUTING FOR TREND DETECTION
Conclusions• Spark fulfills its purpose efficiently; however Spark Streaming is not yet entirely stable.• The algorithms available in MLlib are basic and do not work in streaming (until Spark 1.3.0)• Factors that influence the performance of the application: the size of the dataset, number of iterations and number of searched topics.• It is necessary that the dataset count on a certain amount of words so that the result is consistent