1 METADATA AND THE POWER OF PATTERN-FINDING MAY 24, 2016 FOR DATAVERSITY LEON GUZENDA Chief Technology Marketing Officer
1
M E T A D A T A A N D T H E P O W E RO F P A T T E R N - F I N D I N G
M A Y 2 4 , 2 0 1 6 F O R D A T A V E R S I T Y
LEON GUZENDAChief Technology Market ing Of f icer
2
A G E N D A
• Who We Are
• Open Source Big & Fast Data Analytics
• Our Core Technology & New Product
• Pattern Finding Examples
• Q & A
O B J E C T I V I T Y , I N C .
4
O B J E C T I V I T Y I N C . O V E R V I E W
• Private company, headquartered in Silicon Valley since 1988
• Verticals:• Government: Intelligence, defense, crime detection & prevention• Financial Services• Industrial Internet of Things (IIoT)• Energy• Healthcare
• Horizontals:• Graph analytics• Complex, distributed, scalable database applications
S A M P L E C U S T O M E R S A N D P A R T N E R SCapital
IntensiveCustomers
Government Customers
Telco & Network
Customers
Technology Partners
SIPartners
5
O P E N S O U R C E B I G & F A S T D A T A A N A L Y T I C S
OPEN SOURCE ANALYTICS. . .
[Fall 2016]
,R Proprietary Rules, Ontologies, Queries...
Reports, Archives...
Workflow Design GUI
Proprietary
. . .OPEN SOURCE ANALYTICS
PROS:• Large community• Lots of algorithms• Model works at scale• Low startup costs• Cost effective
CONS:• Most algorithms are based on
statistical correlation, clustering or filtering
• Graph algorithms mainly tackle theoretical problems
• Hadoop mostly targets files, not metadata.
• Metadata tools focus on technical parameters, not semantic content.
• Vertex, Edge and Triplet operations
• Graph modification operations
• RDD join operations
• Adjacent triplet operations
• Iterative graph-parallel operations
• Page rank, connected, triangle counts etc.
APACHE SPARK GRAPHX API
• Vertex, Edge and Triplet operations
• Graph modification operations
• RDD join operations
• Adjacent triplet operations
• Iterative graph-parallel operations
• Page rank, connected, triangle counts etc.
Spark GraphFrames add Motifs (a simple subgraph definition)
APACHE SPARK GRAPHX API
• Vertex, Edge and Triplet operations
• Graph modification operations
• RDD join operations
• Adjacent triplet operations
• Iterative graph-parallel operations
• Page rank, connected, triangle counts etc.
Spark GraphFrames add Motifs (a simple subgraph definition)
BUT
Efficient pathfinding and complex navigation are inhibited because of a table/triplet approach.
APACHE SPARK GRAPHX API
O U R C O R E T E C H N O L O G Y
13
O U R F O C U S• Complex Objects at scale:
• Relationships are first class citizens
• Ultra-fast navigation and pathfinding
• Not restricted by available RAM
• Scalability, performance, reliability and flexibility:
• Distributed database and distributed processing
• Light, small database kernel - from embedded to cluster to cloud
14
• 1,000’s of trillions of unique objects
• 1,000’s of petabytes of storage
• Resolving an ID fast and regardless of the number of objects
D I S T R I B U T E D D A T A - S I N G L E L O G I C A L V I E WPut the data and processing where it’s needed
15
Put the data and processing where it’s needed
D I S T R I B U T E D P R O C E S S I N G
ThingSpan
Cache
Client Processes
T H I N G S P A N
T H I N G S P A N E N V I R O N M E N T
• Uses Apache Spark open source processing engine
• In partnership with Cloudera, Databricks, HortonWorks and MapR
• Powerful object and relationship modeling
• Can store data in HDFS and/or POSIX
• Ultra-fast graph navigation, pathfinding and pattern finding
• REST Server and API for loading data and performing graph analytics
• Spark DataFrame support to leverage MLlib, GraphX, SQL etc.
T H I N G S P A N F E A T U R E S
D I S T R I B U T E D P R O C E S S I N G & D A T A B A S E
Hadoop Distributed File System
Distributed from top to bottom
OPEN SOURCE ANALYTICS STACK
[Fall 2016]
,R Proprietary Rules, Ontologies, Queries...
Reports, Archives...
Workflow Design GUI
Proprietary
THINGSPAN ENHANCED ANALYTICS STACK
[Later this year]
T H I N G S P A N C O M P O N E N T S
P A T T E R N F I N D I N G
• Conventional Business Intelligence Analytics: Uses filters and statistical correlation to find relationships between parameters.
• Graph Pattern Finding Analytics: Uses a combination of outlier, navigational and pathfinding queries.
• Find outliers with SQL or MLlib
• Navigational query can specify Vertex and Edge types to be included/excluded and can invoke methods during the traversal, e.g. to compute transit time to a node.
• Pathfinding query can find shortest or all paths between two or more Vertices.
• Query type order depends upon the problem
P A T T E R N F I N D I N G T E C H N I Q U E S
CITY
LINK• Mode• Duration• Cost
P A T H - F I N D I N G Q U E R Y• Problem: Find the least expensive route between San Francisco and New
York for a 60 ton, very wide load that must arrive by Saturday and minimizes mode transitions (road/rail/water etc.)
• Implied: We can avoid Rail connections.
• Financial: Money Laundering Detection
• Intelligence Analysis: Threat Detection
• AdTech: Recommendation Engine Support
• Industrial Internet of Things (IIoT): Network Congestion Analysis
P A T T E R N F I N D I N G E X A M P L E S
1. Load Person, Account and Transaction data into ThingSpan
$
$
$
$
$
$
$
$
🏡🏡
F I N A N C I A L : M O N E Y L A U N D E R I N G D E T E C T I O N
P1
Acc 1
Acc 2
Acc 22
Acc 23
Acc 24
Acc 35
Acc 21
Acc 31
Acc 32
Acc 33
Acc 20
P2 P3
$
2. Identify people with more than 5 accounts (centrality)
$ $
$
$
$
$
$
$
$
🏡🏡🏡🏡
F I N A N C I A L : A P P L Y S P A R K G R A P H X
Acc 1
Acc 2
P1 P2
Acc 20
Acc 21
Acc 22
Acc 23
Acc 24
Acc 35
P3
Acc 31
Acc 32
Acc 33
3. Look at all of that person's transactions to see if they terminate in just 1 or 2 offshore accounts
$ $$
$
$
$
$
$
4. INVESTIGATE🏡🏡🏡🏡
F I N A N C I A L : A P P L Y A N A V I G A T I O N A L Q U E R Y
Acc 1
Acc 2
P2
Acc 20
Acc 21
Acc 22
Acc 23
Acc 24
Acc 35
Acc 31
Acc 32
Acc 33
P1 P3
$
1. Load People, Calls, Places and Sightings into the Graph
Seen2Seen1
PlaceZ
Seen3
Seen4
H U M I N T : T H R E A T D E T E C T I O NP1 P2 P3 P5
P6 P7 P8
P9 P10
P12
P13
P11
P14
P15
P16
P18
P17 PlaceX
PlaceY
CDR1 CDR2 CDR3
CDR4 CDR5
CDR7
CDR13
CDR15 CDR16
CDR14
CDR6
CDR12
CDR10
CDR8
CDR11
CDR9
CDR17
2. Use Spark GraphX to find "islands" of callers/callees.
P3CDR1 CDR1
CDR1 CDR1
CDR1
CDR1
CDR1 CDR1P17
CDR1
CDR1
CDR1
CDR1
CDR1
CDR1
CDR1
CDR1 CDR2 CDR3
CDR4 CDR5 CDR6
CDR7
CDR8
CDR9 CDR10
CDR11 CDR12
CDR13 CDR14
CDR15 CDR16
H U M I N T : A P P L Y S P A R K G R A P H XP1 P2
P6
P10
P16
P11
P7 P8
P14
P9
P12
P13
P15
P5
P18
CDR17
3. Use a navigational query to see if any of those People have been seen near Places that need to be protected.
PlaceX
CDR1 CDR1
CDR1 CDR1
CDR1
CDR1
CDR1 CDR1P17
CDR1
CDR1
CDR1
CDR1CDR1
CDR1
CDR1
Seen2Seen1
CDR2 CDR3
CDR4 CDR5 CDR6
CDR7
CDR8
CDR9 CDR10
CDR11 CDR12
CDR13 CDR14
CDR15 CDR16
PlaceY PlaceZ
Seen3
Seen4 CDR17
H U M I N T : A P P L Y A N A V I G A T I O N A L Q U E R Y
P1 CDR1 P2 P3 P5
P6
P10
P11
P7 P8
P9
P16
P14
P12
P13
P15
P18
CDR1
CDR1
4. P14 and P15 have been seen near potential target PlaceX, so they plus P11, P7 and P8 should be put under surveillance.
PlaceX
CDR1 CDR1 CDR1
CDR1
CDR1
CDR1
CDR1 CDR1
CDR1
CDR1
CDR1CDR1
CDR1
CDR1
Seen2Seen1
CDR2 CDR3
CDR4 CDR5 CDR6
CDR7
CDR8
CDR9 CDR10
CDR11 CDR12
CDR13 CDR14
CDR15 CDR16
PlaceZSeen4
H U M I N T : P L A N A C T I O NP1 P2
P6
P3
P7 P8
P5
P9
P12
PlaceY
Seen3
P10
P16
P13
P17
CDR17 P18
P11
P14
P15
Joe Fred Mary Jane
1. Load Products, Orders, People and Social_Links into ThingSpan.
Bill
A D T E C H : P R E - P L A N N E D A D S
Pr1
Pr2
Pr3
Pr4
Pr5
Pr6
Sale2 Sale3 Sale4 Sale5
Follows Follows Follows
Sale1
Joe Fred Mary
2. We want to place adds for Product Pr2
Bill
A D T E C H : P R E - P L A N N E D A D S
Pr2
Pr4
Pr5
Pr6
Sale1 Sale2 Sale3 Sale4 Sale5
Follows Follows Follows
Jane
Pr1
Pr3
Joe Fred Mary Jane
3. Use ThingSpan to find bloggers who bought Pr2 and who also have followers.
Bill
Result: Fred bought Pr2. Mary follows Fred's blogs. Jane & Bill follow Mary's.
A D T E C H : W H O F O L L O W S B U Y E R S O F T H E P R O D U C T ?
Pr1 Pr2 Pr3
Pr4
Pr5
Pr6
Sale1 Sale2 Sale3 Sale4 Sale5
Follows
Follows
Follows
Joe Fred Mary Jane
4. Next time you spot Mary, Jane or Bill, display a personalized Ad for Pr2.
Bill
Result: Fred bought Pr2. Mary follows Fred's blogs. Jane & Bill follow Mary's.
💥💥Buy 1!
A D T E C H : D I S P L A Y T H E A D
Pr1 Pr2 Pr3
Pr4
Pr5
Pr6
Sale1 Sale2 Sale3 Sale4 Sale5
Follows
Follows
Follows
1. Load Location, Equipment, Link (+Load) into the graph
20% 20%
95%
65%
20%
50%
30%
25%
Link 2
Link 3
Link 4
Link 5 Link 7
Link 8
Link 9
Link 1
Off
Link 6
SAN JOSE SALT LAKE CITY CHICAGO NEW YORK
I I O T : T E L C O N E T W O R K C O N G E S T I O N
L1 L2 L3 L4
E1
E2
E3
E20
E21
E22
E30
E31
E32
E33
E40
2. Use Spark SQL to find links that are over 90% loaded.
20%
95%
65%
20%
50%
30%
Off 25%
Link 2
Link 3
Link 4
Link 6
Link 7
Link 8
Link 9
Link 1
Link 5
SALT LAKE CITY CHICAGO NEW YORKSAN JOSE
I I O T : A P P L Y S P A R K S Q L
L1 L2 L3 L4
E1
E2
E3
E20
E21
E22
E31
E32
E33
E4020% E30
3. Use a graph query to find the leaf nodes (branch ends)...
20% 20%
95%
65%
20%
50%
30%
25%
Link 2
Link 3
Link 4
Link 6
Link 7
Link 8
Link 9
Link 1
Link 5
Off
... Then Investigate...
SALT LAKE CITY CHICAGO NEW YORKSAN JOSE
I I O T : A P P L Y A T H I N G S P A N N A V I G A T I O N A L Q U E R Y
L1 L2 L3 L4
E1 E20 E30 E40
E31E21E2
E3 E22 E32
E33
20% 20%
95%
65%
20%
50%
30%
25%
4. Aha! E2 and E3 in San Jose are streaming 8K UHDTV video movies from MovieFlix in New York, overloading Link 6.
Link 1
Link 2
Link 3
Link 4
Link 6
Link 7
Link 8
Link 9
OffLink 5
SALT LAKE CITY CHICAGO NEW YORKSAN JOSE
I I O T : D I A G N O S E
L1 L2 L3 L4
E1 E20 E30 E40
E31E21E2
E3 E22 E32
E33
20% 20%
50%
65%
20%
50%
30%
25%
5. Solved - by switching on Link 5.
Link 1
Link 2
Link 3
Link 4
Link 6
Link 7
Link 8
Link 9
45%Link 5
SALT LAKE CITY CHICAGO NEW YORKSAN JOSE
I I O T : F I X
L1 L2 L3 L4
E1 E20 E30 E40
E2 E21 E31
E3 E22 E32
E33
S U M M A R Y
• Open Source Big & Fast Data analytics tools are great at what they're designed for.
• ThingSpan adds a Metadata Store and scalable graph analytics• Ultra-fast navigation and pathfinding queries.
• It can interoperate with streaming systems and Big Data platforms• ThingSpan is extensible to other open source systems
Q U E S T I O N S ?