Social Network Analytics on Cray Urika-XA Mike Hinchey, [email protected] Technical Solutions Architect Cray Inc, Analytics Products Group April, 2015
Social Network Analytics on CrayUrika-XA
Mike Hinchey, [email protected] Solutions Architect
Cray Inc, Analytics Products GroupApril, 2015
Agenda1. Introduce platform
2. Technology and architecture for analytics
3. Use case analysis and results
4. Conclusions
Urika-XA•
Apache Spark•
Social Network Analysis•
Urika-XA Hardware
Extreme Analytics
48 Analytic Nodes•96 CPU's, 1536 cores•6 TB total RAM•38 TB total local SSD (for HDFS)•48 TB total local HDD•120 TB Sonexion 900 Lustre Storage•FDR InfiniBand Fabric Network•Standard 42U Rack•Dual rack configuration also available•
Urika-XA Software
Extreme Analytics
Cloudera Hadoop Distribution•and Management UI•
HDFS•on the 38 TB local SSD•
YARN•manages jobs on 48 nodes•
Hadoop MapReduce•Apache Spark•
Urika-GD
Graph Discovery
4 TB RAM•128 XMT compute processors•128 hardware threads per processor•Lustre file system•RDF: W3C Resource Description Framework•SPARQL, W3C graph query language•
Goals for this Project1. Business Use Case
2. Technology and Architecture
Demonstrate analytics•On a broadly accessible use case•Showing valuable insights•
Bring together various technologies and techniques•Demonstrate architecture of an end-to-end solution•Cray R&D also uses this for performance tests•
Business Use CaseCollect data from social media
Discover communities of users with interest in a particular topic(consumer electronics, sports)
Identify users according to role: key influencers, rebroadcasters,connectors
Process Overview
Technology OverviewApache Spark applications•
Data load and transform•Community detection•Analytics•Query for visualization•
Web app, JavaScript•Query from the Spark app•Charts and graphs to visualize data and results•This presentation•
Technology and ArchitectureBring together various technologies and techniques
Demonstrate an end-to-end solution
Lambda ArchitecturePrinciples for an analytics system that includes Batch and Real-timepipelines
Based on functional programming (lambdas)
To achieve consistency, reliability, etc
Source data is immutable, append-only
Business/analytics code duplicated for Batch and Real-time use cases
Lambda ArchitectureBatch layer for completeness and accuracy: typicallyHadoop/MapReduce
Speed layer for real-time, minimal latency, may sacrifice accuracy
Data stream
Batch layer
Real-time stream
Serving layer
Presentation
Kappa ArchitectureRethinking the Lambda Architecture - multiple frameworks andduplication of code is too difficult
Rethinking the traditional database - based on a transaction log, but onlyinternally
Use Streams everywhere, the transaction/event log is the foundation ofall data
Avoid the traditional batch pipeline (where possible, wrt legacysoftware)
Avoid inconsistent caches of data, like memcached, within apps, etc
Kappa ArchitectureBatch is a slow-lane stream, and allows for re-processing of historicaldata
Real-time is a fast-lane stream using the same framework, so code isshared
Data stream
Batch stream
Real-time
Serving layer Presentation
Apache SparkFunctional API, immutability, stateless
Immutable dataset abstraction, transparently distributed
High-level API: map, reduce, filter, group by, join, union, left outer join
Graph Algorithms: pagerank, svd++, connected components, shortestpaths
Machine Learning: k-means, linear regression, logical regression, naivebayes
Streaming: real-time, periodic
Why XA?Considering the principles of Lambda and Kappa Architectures,
And the capabilities of Spark,
What is the value of Urika-XA?Pre-configured Hadoop/Yarn cluster•
Minimize time to value for a project•
Hardware architecture built for both batch and real-time•
Why XA?Hardware Architecture built for both batch sizeand real-time speedLots of memory and CPU
HDFS on fast, local SSD
Shared file system, Lustre
Perform numerous transformations and joins in memory•
Bigger joins, and temporary files are fast•
Parallel and fast, for input and output data•Reliable without 3x data duplication•
SNA - ETLSource data is stored in immutable files
Start ETL (extract, transform, load) process based on some start data (tore-process old data)
The spark-streaming window specifies how much data per micro-batch
SNA - Real-timeFast-lane window is seconds: for real-time alerts, complex events
Aggregations, metrics
Complex Event Processing (CEP), such as spotting trending hashtags
Community DetectionLabel Propagation (LP) is a Graph-based Community DetectionAlgorithm (CDA)
LP is not implemented streaming, so executed periodically, on one dayof collected data
More data produces better results, more meaningful communities
This is done in a second stream: not real-time, longer window, lowerlatency
VisualizationThis presentation is a web app, loads data that is output from the Sparkjob
d3.js: render charts and graphs in SVG•crossfilter.js: manipulate data across multiple dimensions•dc.js: reusable charts•
SNA Analytics Pipeline
SNA Analytics PipelineSocial Network Analytics
ETL Algorithms Analysis Visualization
ETL (extract, transform, load): Spark Streaming, Scala
Algorithms: GraphX Label Propagation, Machine Learning
Analysis: Spark, Scala, SQL
Visualization: JavaScript, D3, SVG
Source Data - Twitter.comTweet download is based on search terms (related to consumerelectronics, sports, life sciences, etc)
Streaming download since April 2014
Data archived in files to allow reprocessing
0 2 4 6 8 10121416182022240
5,000,000
10,000,000
15,000,000
20,000,000
25,000,000
30,000,000
35,000,000
40,000,000
45,000,000
twee
ts p
er h
our
09/2910/01 10/08 10/15 10/22 10/2911/01 11/08 11/15 11/22 11/2912/01 12/08 12/15 12/22 12/2901/01 01/08 01/15 01/22 01/2902/010
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
6,000,000
7,000,0008,000,000
twee
ts p
er d
ay
Source Data - Twitter.comThe full Twitter firehose is about 600M tweets/day.During the displayed timeframe, we collected674,106,415 tweets, about 0.91% of the firehose.
09/29 Mon10/01 Wed10/08 Wed10/15 Wed10/22 Wed10/29 Wed11/01 Sat11/08 Sat11/15 Sat11/22 Sat11/29 Sat12/01 Mon12/08 Mon12/15 Mon12/22 Mon12/29 Mon01/01 Thu01/08 Thu01/15 Thu01/22 Thu01/29 Thu02/01 Sun02468
101214161820222426
files
per
day
Source Data - StorageJSON saved to files, gzipped
2,290 files, 317GB
SNA - Counts and AggregationsAggregations are done for both
Tweets
Users
Unique hashtags
periodic, per window•running total since processing began•
More Counts and AggregationsHashtags matched to topics
Top hashtags
Top hashtags per user
Errors in source data
NSFW: censor out some tweets based on keywords
Build the network for CDALabel propagation (LP) is a community detection algorithm (CDA), built-in to Spark-GraphX
Input is a network - a list of relationships between entities
We'll look at users that mention other users in tweets
Further restrict to where Users have mentioned each other
If user A mentioned user B•and B mentioned A•then infer that A knows B•
CommunitiesLP results in one community for each user
Community
User
member
ACBPSTL
Real_DealRaps
wildabeast24
JuggDaGreat
meggahpopularCraftMatik
AlMcFallinIII
lyriCALVINom
MiltownBloeParkLyfeEnt
CORTEZ_HSP
TheSaurus831CraveMyThoughts
TheComedyHumorAdorableWords
femaIenotes
diaryforteens
FemaIeThings
TeenagerNotes
FemaleTexts
StealHisHeart
TheseDamnQuote
LooneyTunes002
PolitiBunny
truckinmatador
Ann_Marie1
medfordcaniac
cdnKaren fazwiesenfeld
andilinks
grsvt81
sarahzview
Philscbx
Brockr1967Brock
MLKstudios
AmareshMisraFC justinwooten
99212017
99212017
99212017
99212017
9921201799212017
99212017
99212017
9921201799212017
99212017
99212017
996217376
996217376
996217376
996217376
996217376
996217376
996217376996217376
996217376
996217376
999453985
999453985999453985
999453985
999453985
999453985
999453985999453985
999453985
999453985
999453985
999453985
999453985
999453985
999453985
999453985
Community metricsCount the users in each community
Density is proportion of users that know each other
Filter out tiny and huge communities as not interesting
Community CharacterizationCommunity
Topic
references
557130838
465729427
105792697
14562685
2402207456
1440483044341087665
968986351
1105181540
2937701728
1963043526
50225717
1030726256
392508844
171599451
2419276662
616930338
2951801733
1875210830
38188541
Sports
Finance
Consumer Electronics
Find ways to describe communities
Most popular topics amongusers
•
Community CharacterizationCommunity
Hashtag
references
2341961923
1602729157
910377870
384107910910377870
282816280
2873767141
259906442
1407262566
105792697
2528383177
2929986897
2665469203
2944446625
14243930
14243930
910377870
2329106982
28537986902771282304
hardwork
GenerationsLegacy
RageBoy
Bellarke
autism
NBA
watch
Music
cover
MaxScherzer
quote
Ubuntu
vaccines
money
startpharma
Business
fandomscollide
stream
Most popular hashtags amongusers
•
User RolesIdentify user roles within community•
key influencers: retweeted by others•rebroadcasters: retweet a lot•
Identify users role between community:•
connectors•
relationships with people in different groups•and the strength to each community is balanced•
Results
Communities and PopularHashtags
User Roles within a Community
Connector Role acrossCommunities
Conclusions
ConclusionsAnalytics needs a variety of techniques: Graph, Machine Learning,Iterative, Streaming
Spark: functional, high-level, transparently distributed
Urika-XA: pre-configured cluster, 6T memory
ReferencesApache Spark: http://spark.apache.org/
Twitter Data: https://dev.twitter.com/streaming/public
Lambda Architecture: http://lambda-architecture.net/
Kreps, Jay, "Questioning the Lamba Architecture", 7/2/2014,http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html
Kleppmann, Martin, "Turning the database inside out with ApacheSamza", 9/21/2014, https://youtu.be/fU9hR3kiOK0
DC, dimensional charting: http://dc-js.github.io/dc.js/
Questions?Or contact me later...
Cray Analytics, Urika-XA: http://cray.com/analytics
Mike Hinchey, [email protected]