Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

Social Network Analytics on CrayUrika-XA

Mike Hinchey, [email protected] Solutions Architect

Cray Inc, Analytics Products GroupApril, 2015

Agenda1. Introduce platform

2. Technology and architecture for analytics

3. Use case analysis and results

4. Conclusions

Urika-XA•

Apache Spark•

Social Network Analysis•

Urika-XA Hardware

Extreme Analytics

48 Analytic Nodes•96 CPU's, 1536 cores•6 TB total RAM•38 TB total local SSD (for HDFS)•48 TB total local HDD•120 TB Sonexion 900 Lustre Storage•FDR InfiniBand Fabric Network•Standard 42U Rack•Dual rack configuration also available•

Urika-XA Software

Extreme Analytics

Cloudera Hadoop Distribution•and Management UI•

HDFS•on the 38 TB local SSD•

YARN•manages jobs on 48 nodes•

Hadoop MapReduce•Apache Spark•

Urika-GD

Graph Discovery

4 TB RAM•128 XMT compute processors•128 hardware threads per processor•Lustre file system•RDF: W3C Resource Description Framework•SPARQL, W3C graph query language•

Goals for this Project1. Business Use Case

2. Technology and Architecture

Demonstrate analytics•On a broadly accessible use case•Showing valuable insights•

Bring together various technologies and techniques•Demonstrate architecture of an end-to-end solution•Cray R&D also uses this for performance tests•

Business Use CaseCollect data from social media

Discover communities of users with interest in a particular topic(consumer electronics, sports)

Identify users according to role: key influencers, rebroadcasters,connectors

Process Overview

Technology OverviewApache Spark applications•

Data load and transform•Community detection•Analytics•Query for visualization•

Web app, JavaScript•Query from the Spark app•Charts and graphs to visualize data and results•This presentation•

Technology and ArchitectureBring together various technologies and techniques

Demonstrate an end-to-end solution

Lambda ArchitecturePrinciples for an analytics system that includes Batch and Real-timepipelines

Based on functional programming (lambdas)

To achieve consistency, reliability, etc

Source data is immutable, append-only

Business/analytics code duplicated for Batch and Real-time use cases

Lambda ArchitectureBatch layer for completeness and accuracy: typicallyHadoop/MapReduce

Speed layer for real-time, minimal latency, may sacrifice accuracy

Data stream

Batch layer

Real-time stream

Serving layer

Presentation

Kappa ArchitectureRethinking the Lambda Architecture - multiple frameworks andduplication of code is too difficult

Rethinking the traditional database - based on a transaction log, but onlyinternally

Use Streams everywhere, the transaction/event log is the foundation ofall data

Avoid the traditional batch pipeline (where possible, wrt legacysoftware)

Avoid inconsistent caches of data, like memcached, within apps, etc

Kappa ArchitectureBatch is a slow-lane stream, and allows for re-processing of historicaldata

Real-time is a fast-lane stream using the same framework, so code isshared

Data stream

Batch stream

Real-time

Serving layer Presentation

Apache SparkFunctional API, immutability, stateless

Immutable dataset abstraction, transparently distributed

High-level API: map, reduce, filter, group by, join, union, left outer join

Graph Algorithms: pagerank, svd++, connected components, shortestpaths

Machine Learning: k-means, linear regression, logical regression, naivebayes

Streaming: real-time, periodic

Why XA?Considering the principles of Lambda and Kappa Architectures,

And the capabilities of Spark,

What is the value of Urika-XA?Pre-configured Hadoop/Yarn cluster•

Minimize time to value for a project•

Hardware architecture built for both batch and real-time•

Why XA?Hardware Architecture built for both batch sizeand real-time speedLots of memory and CPU

HDFS on fast, local SSD

Shared file system, Lustre

Perform numerous transformations and joins in memory•

Bigger joins, and temporary files are fast•

Parallel and fast, for input and output data•Reliable without 3x data duplication•

SNA - ETLSource data is stored in immutable files

Start ETL (extract, transform, load) process based on some start data (tore-process old data)

The spark-streaming window specifies how much data per micro-batch

SNA - Real-timeFast-lane window is seconds: for real-time alerts, complex events

Aggregations, metrics

Complex Event Processing (CEP), such as spotting trending hashtags

Community DetectionLabel Propagation (LP) is a Graph-based Community DetectionAlgorithm (CDA)

LP is not implemented streaming, so executed periodically, on one dayof collected data

More data produces better results, more meaningful communities

This is done in a second stream: not real-time, longer window, lowerlatency

VisualizationThis presentation is a web app, loads data that is output from the Sparkjob

d3.js: render charts and graphs in SVG•crossfilter.js: manipulate data across multiple dimensions•dc.js: reusable charts•

SNA Analytics Pipeline

SNA Analytics PipelineSocial Network Analytics

ETL Algorithms Analysis Visualization

ETL (extract, transform, load): Spark Streaming, Scala

Algorithms: GraphX Label Propagation, Machine Learning

Analysis: Spark, Scala, SQL

Visualization: JavaScript, D3, SVG

Source Data - Twitter.comTweet download is based on search terms (related to consumerelectronics, sports, life sciences, etc)

Streaming download since April 2014

Data archived in files to allow reprocessing

0 2 4 6 8 10121416182022240

5,000,000

10,000,000

15,000,000

20,000,000

25,000,000

30,000,000

35,000,000

40,000,000

45,000,000

twee

ts p

er h

our

09/2910/01 10/08 10/15 10/22 10/2911/01 11/08 11/15 11/22 11/2912/01 12/08 12/15 12/22 12/2901/01 01/08 01/15 01/22 01/2902/010

1,000,000

2,000,000

3,000,000

4,000,000

5,000,000

6,000,000

7,000,0008,000,000

twee

ts p

er d

ay

Source Data - Twitter.comThe full Twitter firehose is about 600M tweets/day.During the displayed timeframe, we collected674,106,415 tweets, about 0.91% of the firehose.

09/29 Mon10/01 Wed10/08 Wed10/15 Wed10/22 Wed10/29 Wed11/01 Sat11/08 Sat11/15 Sat11/22 Sat11/29 Sat12/01 Mon12/08 Mon12/15 Mon12/22 Mon12/29 Mon01/01 Thu01/08 Thu01/15 Thu01/22 Thu01/29 Thu02/01 Sun02468

101214161820222426

files

per

day

Source Data - StorageJSON saved to files, gzipped

2,290 files, 317GB

SNA - Counts and AggregationsAggregations are done for both

Tweets

Users

Unique hashtags

periodic, per window•running total since processing began•

More Counts and AggregationsHashtags matched to topics

Top hashtags

Top hashtags per user

Errors in source data

NSFW: censor out some tweets based on keywords

Build the network for CDALabel propagation (LP) is a community detection algorithm (CDA), built-in to Spark-GraphX

Input is a network - a list of relationships between entities

We'll look at users that mention other users in tweets

Further restrict to where Users have mentioned each other

If user A mentioned user B•and B mentioned A•then infer that A knows B•

CommunitiesLP results in one community for each user

Community

User

member

ACBPSTL

Real_DealRaps

wildabeast24

JuggDaGreat

meggahpopularCraftMatik

AlMcFallinIII

lyriCALVINom

MiltownBloeParkLyfeEnt

CORTEZ_HSP

TheSaurus831CraveMyThoughts

TheComedyHumorAdorableWords

femaIenotes

diaryforteens

FemaIeThings

TeenagerNotes

FemaleTexts

StealHisHeart

TheseDamnQuote

LooneyTunes002

PolitiBunny

truckinmatador

Ann_Marie1

medfordcaniac

cdnKaren fazwiesenfeld

andilinks

grsvt81

sarahzview

Philscbx

Brockr1967Brock

MLKstudios

AmareshMisraFC justinwooten

99212017

99212017

99212017

99212017

9921201799212017

99212017

99212017

9921201799212017

99212017

99212017

996217376

996217376

996217376

996217376

996217376

996217376

996217376996217376

996217376

996217376

999453985

999453985999453985

999453985

999453985

999453985

999453985999453985

999453985

999453985

999453985

999453985

999453985

999453985

999453985

999453985

Community metricsCount the users in each community

Density is proportion of users that know each other

Filter out tiny and huge communities as not interesting

Community CharacterizationCommunity

Topic

references

557130838

465729427

105792697

14562685

2402207456

1440483044341087665

968986351

1105181540

2937701728

1963043526

50225717

1030726256

392508844

171599451

2419276662

616930338

2951801733

1875210830

38188541

Sports

Finance

Consumer Electronics

Find ways to describe communities

Most popular topics amongusers

•

Community CharacterizationCommunity

Hashtag

references

2341961923

1602729157

910377870

384107910910377870

282816280

2873767141

259906442

1407262566

105792697

2528383177

2929986897

2665469203

2944446625

14243930

14243930

910377870

2329106982

28537986902771282304

hardwork

GenerationsLegacy

RageBoy

Bellarke

autism

NBA

watch

Music

cover

MaxScherzer

quote

google

Ubuntu

vaccines

money

startpharma

Business

fandomscollide

stream

Most popular hashtags amongusers

•

User RolesIdentify user roles within community•

key influencers: retweeted by others•rebroadcasters: retweet a lot•

Identify users role between community:•

connectors•

relationships with people in different groups•and the strength to each community is balanced•

Results

Communities and PopularHashtags

User Roles within a Community

Connector Role acrossCommunities

Conclusions

ConclusionsAnalytics needs a variety of techniques: Graph, Machine Learning,Iterative, Streaming

Spark: functional, high-level, transparently distributed

Urika-XA: pre-configured cluster, 6T memory

ReferencesApache Spark: http://spark.apache.org/

Twitter Data: https://dev.twitter.com/streaming/public

Lambda Architecture: http://lambda-architecture.net/

Kreps, Jay, "Questioning the Lamba Architecture", 7/2/2014,http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html

Kleppmann, Martin, "Turning the database inside out with ApacheSamza", 9/21/2014, https://youtu.be/fU9hR3kiOK0

DC, dimensional charting: http://dc-js.github.io/dc.js/

http://spark.apache.org/

http://lambda-architecture.net/

https://dev.twitter.com/streaming/public

http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html

https://youtu.be/fU9hR3kiOK0

http://dc-js.github.io/dc.js/

Questions?Or contact me later...

Cray Analytics, Urika-XA: http://cray.com/analytics

Mike Hinchey, [email protected]

mailto:[email protected]

http://cray.com/analytics

Social Network Analytics on Cray Urika-XAcug.org/proceedings/cug2015_proceedings/includes/files/pap158... · Functional API, immutability, stateless Immutable dataset abstraction,

Documents