Top Banner
B A ([email protected] ) It is not only Hadoop…
24

Understanding Big Data for policy professionals

Jan 23, 2018

Download

Technology

Alex Jouravlev
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Understanding Big Data for policy professionals

BA

([email protected])

It is not only Hadoop…

Page 2: Understanding Big Data for policy professionals

BA

• Big Data are the new types of data that let go of the limitations we had to impose decades ago due to the state of hardware and software back then

• The main challenge is therefor unlearning said limitations, and learning to incorporate Big Data capabilities and agility into [policy] work

• Traditional reporting and BI works with “known knowns”. Big data allows working with “known unknowns”, “unknown knowns” and “unknown unknowns”.

• There are several distinctive types of technologies that fall under the “Big Data“ moniker, which has their unique capabilities: Hadoop, NOSQL, Semantic, Graph

2© Copyright Business Abstraction Pty Ltd 2014-2015

Page 3: Understanding Big Data for policy professionals

BA

• Consists of tables tightly packed with data, specific type per row

• Tables identified and created in advance

• Tables populated from human input

• Tables used by filtering, grouping by rows, as well as performing a limited number of joins, for reports, OLAP etc

• Text data are supposed to be read by people

3© Copyright Business Abstraction Pty Ltd 2014-2015

Page 4: Understanding Big Data for policy professionals

BA

• Data coming from all over the Internet

• Data from Internet of Things.

• Human circumstances

• XML structures

• Data come from someplace, designed by someone else

• Machine learning

• Clustering

• Graph algorithms

4© Copyright Business Abstraction Pty Ltd 2014-2015

Page 5: Understanding Big Data for policy professionals

BA

• Traditional for IT

• Fully defined data

• Traditional Database

• Master Data Model

• Data Warehouse

• New generation of ideas and technologies

• Presumes only part of information is known

• Internet

• Information across multiple enterprises

• Information extracted from texts

5© Copyright Business Abstraction Pty Ltd 2014-2015

Page 6: Understanding Big Data for policy professionals

BA

New generations of tools, often coming from Internet companies, designed for “New Data”

• Hadoop File System

• NoSQL: Cassandra, MarkLogic, Couchbase, DynamoDB

• Column-store RDB

• Semantic DBs

• Graph DBs

• Map/Reduce of different flavours

• Xquery

• Sparql

• Gremlin

6© Copyright Business Abstraction Pty Ltd 2014-2015

Page 7: Understanding Big Data for policy professionals

BA

• Write anything associated with a primary key (akin to a file path)

• Distributed over commodity servers

• Highly concurrent write and read

• Everything is cheap – hardware, “design” etc

• However, small records have to be stored in Sequence files or Map Files

• Anything at scale – can store files in Petabytes

• Designed for Map/Reduce batch work, data lakes

• Anything interactive requires massive hardware

7© Copyright Business Abstraction Pty Ltd 2014-2015

Page 8: Understanding Big Data for policy professionals

BA

The term “NoSQL” means “not relational”, and as such covers a lot of different models. Some of them are suitable for complexity of generic data storage. They are called “semi-structured” as although individual data items are structures, the structures are not necessarily defined in advanced

NoSQL platforms combine Hadoop’s “store anything” capability with indexing and

• store and index XML or JSON documents (“trees”).

• A deep row store can be seen as a document database where depth of trees limited to 2..

• Tables with named fields per row

8© Copyright Business Abstraction Pty Ltd 2014-2015

Page 9: Understanding Big Data for policy professionals

BA

• “Interactive Hadoop”

• Low-granularity Hadoop

• Data Lake

• Operations DBs with complex data

• Data consolidation

• Dynamic Data Warehouse

• Operational Data Warehouse

• Data presumed “forests” of “trees” – connected data are handled not as good

• A touch more expensive than Hadoop

9© Copyright Business Abstraction Pty Ltd 2014-2015

Page 10: Understanding Big Data for policy professionals

BA

Provide traditional RDB interface in the new world

• Different internal structure

• Less suitable for OLTP

• Suitable for sparse data – empty fields don’t take space or penalise for read

• Much faster for analytics, especially if only selected fields are used

• Analytics when schema is known

• Cannot do schema-on-read

10© Copyright Business Abstraction Pty Ltd 2014-2015

Page 11: Understanding Big Data for policy professionals

BA

Support Resource Description Framework (RDF), originally created for Semantic Web metadata. It stores information in Subject-Predicate-Object “triples”, the most flexible representation possible. Use Sparql for queries.

• Graph patterns

• Metadata for Hadoop/NoSQL. Lack of internal schema requires external metadata

• Do not scale as much

• Hype-contaminated: people who understand enterprise and understand Semantic Tech are rare

11© Copyright Business Abstraction Pty Ltd 2014-2015

Page 12: Understanding Big Data for policy professionals

BA

Graph Databases see data as one huge graph. They are optimised for navigating the edges of the graph. Use Gremlin.

• Implementing Graph Analytics

• Bespoke graph logic

• Backend for general apps (if BASE jumping is too boring)

• Not as scalable as NoSQL

• Lack declarative data type, patterns & rules definitions of Semantic DBs

• Depend on ability to build and maintain a graph

12© Copyright Business Abstraction Pty Ltd 2014-2015

Page 13: Understanding Big Data for policy professionals

BA

Platform for massively parallel computations, enables effective sharing of workload between commodity servers.

• MapReduce

• YARN

• Apache Spark

• Batch jobs over massive data

• On-demand queries where some lag is acceptable

• Implementations have powerful Analytics/Machine Learning libraries

• Latency

13© Copyright Business Abstraction Pty Ltd 2014-2015

Page 14: Understanding Big Data for policy professionals

BA

• Ensure datasets are identifiable

• Capture metadata

• Ensure your data are not lost

• Profile data across field names, structures etc

• Locate data as needed

• As you learn more about data, build up your metadata

• Hadoop

• NoSQL

14© Copyright Business Abstraction Pty Ltd 2014-2015

Page 15: Understanding Big Data for policy professionals

BA

• A server in $1,000-$10,000 range

• 0.5TB – 25TB per server

• A lot of them if needed

• Doubling the number of servers reduces the time to execute the task by the factor of 2.

15© Copyright Business Abstraction Pty Ltd 2014-2015

Page 16: Understanding Big Data for policy professionals

BA

Perhaps more complex than learning

• There are a lot of data you do not know about which is available and can be used

• For many types of objects, it is natural to have uncommon attributes

• Data storage is cheap. It doesn’t cost much to store everything remotely related

• No massive pre-work.

• Ask everything

16© Copyright Business Abstraction Pty Ltd 2014-2015

Page 17: Understanding Big Data for policy professionals

BA

• Traditional reporting, BI

• Predictive analytics

• Data consolidation, Semantic Integration, Object-based Intelligence

• Clustering

17© Copyright Business Abstraction Pty Ltd 2014-2015

Page 18: Understanding Big Data for policy professionals

BA

• Straightforward operation, no design upfront

• Can take immensely complex metadata, like UML & BPMN models

• Apply OWL for classification

• SWRL builds complex linkages

• Refer to Classes defined by lower-level Ontologies rather than data

18© Copyright Business Abstraction Pty Ltd 2014-2015

Page 19: Understanding Big Data for policy professionals

BA

• Words to be converted to tags (URLs)

• Some words have multiple meanings

• Ontology provides possible tags for nouns

• Software tries to resolve expected predicates

• The tag that can find necessary relations (predicates) wins

• Use ontology to restrict search

• Much more flexible than “foreign key”

19© Copyright Business Abstraction Pty Ltd 2014-2015

Page 20: Understanding Big Data for policy professionals

BA

• Information about data

• Traditional metadata was stored in form of data schema

• With schema-less storage, metadata should be stored separately

• Incremental discovery process requires Open World Assumption – you don’t know what other data are there.

• Reasoning to handle complexity

• Relationships as first-class citizens and the basis for classification

20© Copyright Business Abstraction Pty Ltd 2014-2015

Page 21: Understanding Big Data for policy professionals

BA

• Work in progress – not there yet

• Different data paradigms mandate different views

• SQL view of Big Data (Apache PIG etc)

• Excel import

• Analytic visualisation frontends

• 30+ JavaScript libraries

• Presume development

• Mahoot & other libraries

• Writing code in Scala, Java, Python, Groovy

21© Copyright Business Abstraction Pty Ltd 2014-2015

Page 22: Understanding Big Data for policy professionals

BA

• Better picture of the current state

• What if prediction

• Researching impact

• Increasing the number of categories, by several orders of magnitude if necessary

• Common, meaningful view of individual, organisation etc

• Prevention of undesirable effects on insights, complex events and prediction

22© Copyright Business Abstraction Pty Ltd 2014-2015

Page 23: Understanding Big Data for policy professionals

BA

23© Copyright Business Abstraction Pty Ltd 2014-2015

Page 24: Understanding Big Data for policy professionals

BA

• Description Logic, while using First-Order Predicate Logic terminology

• Reduced for practical purposes

• Is not necessary to be productive

• Can be applied to anything

• Class can be derived depending on values

• A State is a Class

• New triples can be derived from existing

24© Copyright Business Abstraction Pty Ltd 2014-2015