This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Big Data and High Performance Computing Solutions in the AWS Cloud
Michel Pereira, Enterprise Solutions Architect
May 27, 2014
Big Data HPC
Customer Success Story
Getting Started on AWS
What we’ll cover today…
Big Data HPC
Customer Success Story
Getting Started on AWS
What we’ll cover today…
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
GB TB PB
95% of the 1.2 ze.abytes of data in the digital universe is unstructured
70% of of this is user-‐generated content
Unstructured data growth explosive, with esDmates of compound annual growth (CAGR) at 62% from 2008 – 2012. Source: IDC
ZB
EB
Big Data: Unconstrained data growth
Lower cost, higher throughput Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Customer segmentation
Marketing spend optimization
Financial modeling & forecasting
Ad targeting & real time bidding
Clickstream analysis
Fraud detection
Use Cases
Visits, views, clicks, purchases
Source, device, location, time
Latency, throughput, uptime
Likes, shares, friends, follows
Price, frequency
Metrics
Relational
NoSQL
Web servers
Mobile phones
Tablets
3rd party feeds
Sources
Structured
Unstructured
Text
Binary
Near Real-time
Batched
Formats
Reporting
Dashboards
Sentiment
Clustering
Machine Learning
Optimization
Analysis
Lower cost, higher throughput
Highly constrained
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
Generated data
Available for analysis
Data volume
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
Elastic and highly scalable
No upfront capital expense
Only pay for what you use +
+
Available on-demand +
= Remove constraints
Accelerated
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Technologies and techniques for working productively with data, at any scale.
Big Data
Big data and AWS cloud computing
Big data Cloud computing Variety, volume, and velocity requiring new tools
Variety of compute, storage, and networking options
Big data and AWS cloud computing
Big data Cloud computing Potentially massive datasets Massive, virtually unlimited capacity
Big data and AWS cloud computing
Big data Cloud computing Iterative, experimental style of data manipulation and analysis
Iterative, experimental style of infrastructure deployment/usage
Big data and AWS cloud computing
Big data Cloud computing Frequently not a steady-state workload; peaks and valleys
At its most efficient with highly variable workloads
Big data and AWS cloud computing
Big data Cloud computing Absolute performance not as critical as “time to results”; shared resources are a bottleneck
Parallel compute projects allow each workgroup to have more autonomy, get faster results
Ease of use Lower costs
no capital investment
pay as you go
no subscriptions
only pay for what you use
Ease of use Lower costs
programmable
zero admin easy to configure
integrate with existing tools
Ease of use Lower costs
One tool to rule them all
Use the right tools
Amazon S3
Amazon Kinesis
Amazon DynamoDB
Amazon Redshift
Amazon Elastic
MapReduce
Store anything
Object storage
Scalable
99.999999999% durability
Amazon S3
Real-time processing
High throughput; elastic
Easy to use
EMR, S3, Redshift, DynamoDB
Integrations
Amazon Kinesis
NoSQL Database
Seamless scalability
Zero admin
Single digit millisecond latency
Amazon DynamoDB
Relational data warehouse
Massively parallel
Petabyte scale
Fully managed
$1,000/TB/Year
Amazon Redshift
Hadoop/HDFS clusters
Hive, Pig, Impala, Hbase
Easy to use; fully managed
On-demand and spot pricing
Tight integration with S3,
DynamoDB, and Kinesis
Amazon Elastic
MapReduce
HDFS
Analytics languages
Data management
Amazon RedShift
Amazon EMR Amazon
RDS
Amazon S3 Amazon DynamoDB
Amazon Kinesis
Sources Sources Data
Sources
AWS Data Pipeline
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Amazon Glacier
S3
Amazon DynamoDB
Amazon RDS Amazon
Redshift
AWS Direct Connect
AWS Storage Gateway
AWS Import/ Export
Amazon Kinesis Amazon EMR
Generation
Collection & storage
Analytics & computation
Collaboration & sharing
Amazon EC2 Amazon EMR Amazon Kinesis
Generation
Collection & storage
Analytics & computation
Collaboration & sharing Amazon
CloudFront AWS
CloudFormation
S3
Amazon DynamoDB
Amazon RDS
Amazon Redshift
Amazon EC2 Amazon EMR
AWS Data Pipeline
The right tools. At the right scale. At the right time.
AWS Customer Success Story Sergio Mafra, Líder de Inovação em TI ONS – Operador Nacional do Sistema Elétrico
• O Operador Nacional do Sistema Elétrico (ONS) é uma empresa privada, responsável pelo planejamento e operação da geração e transmissão de energia elétrica no Sistema Interligado Nacional (SIN).
• Com cerca de 800 funcionários, em 5 local idades (Rio de Janeiro, Recife, Florianópolis e Brasília), o ONS é uma empresa intensiva em informações com uso contínuo de modelos matemáticos que requer HPC (High Performance Computing e Big Data)
“A Amazon Web Services permite provisionar clusters de alto desempenho em minutos, reduzindo significantemente o tempo total de processamento”.
“Com isso, percebemos que a AWS transforma High Performance
Computers em High Performance Customers”
- Sérgio Mafra
O SIN atende 98% do consumo de eletricidade
do Brasil.
SIN - Sistema Elétrico Brasileiro
Sistemas Isolados Amazônia Legal 2% do Mercado Predominantemente Térmico + 300 localidades isoladas -
Modelo predominantemente hidroelétrico com grandes
reservatórios e grandes interligações.
O Desafio
• Prover ao ONS uma plataforma de maior capacidade de processamento, permitindo obter uma redução no tempo de solução dos modelos matemáticos, com custo adequado ao tempo de utilização, de fácil gestão do ambiente em cluster e que fosse transparente para a organização.
• Permitir o “time-to-market” para a área de TI , de tendo o conhec imento e a responsividade às demandas inesperadas provenientes das áreas da organização.
“Scotty, We Need More Power”
Benefícios alcançados
• Redução de cerca de 40% no tempo de resolução dos modelos matemáticos de planejamento eletro-energéticos, com custo 30% inferior.
• Condição de analisar 5 estratégias de utilização dos modelos Newave/Decomp em prazo recorde (1 semana), com a execução de 600 casos. O prazo on-premises seria de 3 semanas, incompatível com o compromisso acordado com o MME.
Virtual Private Cloud
Work
Controlador
Internet/AWS
10.24.0.0/24 10.24.1.0/24
10.21.0.0/16
Benefícios alcançados
• “Uau... 40 minutos para 4 minutos !!!!” • “Agora vou usar todos os parâmetros de
cálculo para ter um estudo mais completo” • “Salta 4 x 80 para agora !!!” • “Obrigado por poder sair 2 horas mais
cedo. Todos os casos já rodaram” • “Rodamos o estudo em 2 minutos. O
sistema pode ser operacional e vai virar caso internacional de sucesso”
Sistema de Medição Sincronizada de Fasores - SMSF
PDC
Armazenamento Anual do SMSF
2013 • 8,5 TB
2015 • 70 TB
2018 • 120 TB
2022 • 312 TB
Big Data
Data
Coleta estimada para apenas 7 grandezas de medida
Volume total do Storage do DC do Rio em 2013
Histórico
1 Tb
Cluster Hadoop
OpenPDC
Coletor
Master
Nó 1
Nó 3
Nó N
Nó 2
HDFS
HDFS
HDFS
HDFS
S3
Armazenador
Glacier
Historiador
Glacier
Glacier Glacier
Glacier
Analytics
PMUs
Controlador
Processamento
Arquitetura
EM ESTUDO
Big Data HPC
Customer Success Story
Getting Started on AWS
What we’ll cover today…
Solution Architects
Professional Services
Premium Support
AWS Partner Network (APN)
AWS is here to help
AWS Architecture Diagrams
https://aws.amazon.com/architecture/
Processing large amounts of parallel data using a scalable cluster
Use commonly-available cluster scheduling tools, such as Grid Engine or Condor