Copyright (c) 2016 by Amazon.com, Inc. or its affiliates. Streaming Analytics Pipeline is licensed under the terms of the Amazon Software License available at https://aws.amazon.com/asl/ Streaming Analytics Pipeline AWS Implementation Guide Chris Rec December 2016
20
Embed
Streaming Analytics Pipeline - s3. · PDF fileArchitecture Overview ... Amazon Elasticsearch Service ... The Streaming Analytics Pipeline AWS Lambda function processes data at a default
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Copyright (c) 2016 by Amazon.com, Inc. or its affiliates.
Streaming Analytics Pipeline is licensed under the terms of the Amazon Software License available at
Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016
Page 4 of 20
solution uses Amazon Kinesis Streams to load streaming data, Amazon Kinesis Analytics to
filter and process that data, and Amazon Kinesis Firehose to deliver the data to various data
stores for search, storage, or further analytics.
Cost You are responsible for the cost of the AWS services used while running the Streaming
Analytics Pipeline. The total cost of this solution depends on the amount of data you stream
through the Streaming Analytics Pipeline. As of the date of publication, the cost of running
this solution with the default settings in the US East (N. Virginia) Region is approximately
$1.38 per hour.1 Prices are subject to change. For full details, see the pricing webpage for
each AWS service you will be using in this solution.
We recommend adjusting your AWS Lambda and Amazon Kinesis Firehose batch
configurations as your record count and data size increase to manage costs.
Architecture Overview Deploying this solution with the default parameters builds the following environment in
the AWS Cloud.
Figure 1: Streaming Analytics Pipeline default architecture on AWS
By default, the AWS CloudFormation template creates a new Amazon Kinesis stream with
two shards, an Amazon Kinesis Firehose delivery stream that encrypts data with AWS Key
Management Service, an Amazon Simple Storage Service (Amazon S3) bucket to store raw
and analyzed data, and an AWS Identity and Access Management (IAM) role with least-
1 The cost estimate assumes the solution will stream 1,000 records per second with an average size of three kilobytes per record,
and that the external destination is an Amazon Simple Storage Service (Amazon S3) bucket .
Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016
Page 5 of 20
privilege access permissions. The template also launches an AWS Lambda custom resource
that creates an Amazon Kinesis Analytics application based on settings you specify in a
YAML configuration file. For more information, see Appendix B. The application consumes
records from the source Amazon Kinesis stream and puts records into the Amazon Kinesis
Firehose delivery stream.
Note: If you do not specify a YAML configuration file, the Amazon Kinesis Analytics application will require further modification through the AWS Management Console and/or the service API to efficiently analyze your data.
If you choose to persist raw data, an AWS Lambda function is deployed. The Lambda
function gets raw records from the source Amazon Kinesis stream, decodes the Base64-
encoded data, batches the records, and puts them into another Amazon Kinesis Firehose
delivery stream for delivery to Amazon S3.
The Streaming Analytics Pipeline can be customized to fit your needs. When you deploy the
solution, you can specify an existing Amazon Kinesis stream, a configuration for an Amazon
Kinesis Analytics application, whether or not to encrypt the data, and whether or not to
persist raw data from your source Amazon Kinesis stream to Amazon S3. You can also
choose from four destinations for your analyzed data: an Amazon S3 bucket (default), a pre-
configured Amazon Redshift cluster, a pre-configured Amazon Elasticsearch Service
domain, or an existing Amazon Kinesis stream.
Figure 2: Streaming Analytics Pipeline architecture on AWS
Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016
Page 6 of 20
Design Considerations
Regional Deployments The Streaming Analytics Pipeline uses AWS Lambda and Amazon Kinesis Analytics.
Therefore, you must deploy this solution in an AWS Region that supports both Lambda and
Amazon Kinesis Analytics. As of the date of publication, this includes the US East (N.
Virginia) Region, the US West (Oregon) Region, and the EU (Ireland) Region.
Streaming Data Format Amazon Kinesis Analytics allows you to specify a schema to classify your streaming data
before it executes SQL queries against your input Amazon Kinesis stream. If you specify a
strict schema for all records, the analysis could fail if some records do not match the
expected format specified in the schema. For this solution, consider applying a flexible
schema to your streaming data to ensure all data is collected. Then, refine the schema using
standard SQL.
Shard Count The number of shards you need for a new Amazon Kinesis stream depends on the amount
of streaming data you plan to produce. Each shard can support up to 1,000 records per
second for writes, up to a maximum total data write rate of 1 MB per second (including
partition keys). For example, an application that produces 100 records per second with a
size of 35 kilobytes per record for a total data input rate of 3.4 megabytes per second needs
4 shards.
The Streaming Analytics Pipeline AWS Lambda function processes data at a default rate of
1,000 records per second. But, you can adjust the timeout and batch size to accommodate
faster processing and delivery of raw data.
While there is no upper limit to the number of shards in a stream or account, each region
has a default shard limit. For information on shard limits, please visit Amazon Kinesis
Streams Limits. To request an increase in your shard limit, please use the Stream Limits
form.
Multiple External Destinations Amazon Kinesis Analytics allows users to specify up to three external destinations for
analyzed data. By default, the Streaming Analytics Pipeline allows users to specify a single
external destination for their analyzed data. For customers who want to send analyzed data
to multiple external destinations, this solution includes a template (add-output) to allow
Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016
Page 8 of 20
accessible with a public IP address. The cluster’s Amazon Elastic Compute Cloud (Amazon
EC2) security group should allow access from the AWS Region’s Amazon Kinesis Firehose
IP addresses:
US East (N. Virginia) Region: 52.70.63.192/27
US West (Oregon) Region: 52.89.255.224/27
US West (N. California) Region: 52.19.239.192/27
Amazon Elasticsearch Service To configure Amazon Elasticsearch Service, your Amazon Elasticsearch Service domain
should have an existing index and type to which data can be assigned. We also recommend
you create and map your fields to the appropriate data type before you start the Amazon
Kinesis Analytics application to ensure that the solution assigns your data to the right type.
If you do not map the data types before you deploy the Streaming Analytics Pipeline, the
solution will create data types for you. But, these data types may not be the types you want.
What We’ll Cover The procedure for deploying this architecture on AWS consists of the following steps. For
detailed instructions, follow the links for each step.
Step 1. Launch the stack
Launch the AWS CloudFormation template into your AWS account.
Enter values for required parameters.
Review the other template parameters, and adjust if necessary.
Step 2. Validate and Start the Application
Verify that the schema and application code are correct.
Start the application.
Step 3. Start Streaming Data
Start streaming data to the source Amazon Kinesis stream.
View results in your external destination.
Step 1. Launch the Stack This automated AWS CloudFormation template deploys Streaming Analytics Pipeline on
the AWS Cloud. Please make sure that you’ve configured your Amazon Redshift cluster or
Amazon Elasticsearch Service domain before launching the stack, if you chose one of those
as your destination.
Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016
Page 9 of 20
Note: You are responsible for the cost of the AWS services used while running this solution. See the Cost section for more details. For full details, see the pricing webpage for each AWS service you will be using in this solution.
1. Log in to the AWS Management Console and click the button to
the right to launch the streaming-analytics-pipeline AWS
CloudFormation template.
You can also download the template as a starting point for your own implementation.
2. The template is launched in the US East (N. Virginia) Region by default. To launch this
solution in a different AWS Region, use the region selector in the console navigation bar.
Note: This solution uses AWS Lambda and Amazon Kinesis Analytics, which are currently available in the US East (N. Virginia) Region, the US West (Oregon) Region, and the EU (Ireland) Region. Therefore, you must launch this solution one of those regions2.
3. On the Select Template page, verify that you selected the correct template and choose
Next.
4. On the Specify Details page, assign a name to your Streaming Analytics Pipeline
solution stack.
5. Under Parameters, review the parameters for the template and modify them as
necessary. This solution uses the following default values.
Parameter Default Description
New or Existing Stream New Kinesis Stream The source Amazon Kinesis stream. Create a new
stream or choose an existing stream.
New Stream Shard Count <Requires input> The number of shards to allot to your new stream
Note: If you use an existing stream, leave this parameter blank.
Existing Stream Name <Requires input> The name of an existing stream in the same AWS
Region where you launch the solution
Note: If you use a new stream, leave this parameter blank.
External Destination Amazon S3 The destination for your analyzed data. Select
Amazon S3 (default), Amazon Redshift, Amazon
Elasticsearch Service, or Amazon Kinesis stream.
2 For the most current Lambda and Amazon Kinesis Analytics availability by region, see the AWS service offerings by region.
Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016
Page 10 of 20
Parameter Default Description
Note: If you choose Amazon Redshift, Amazon Elasticsearch Service, or Amazon Kinesis stream, you must configure the destination. See Appendix A for steps to configure the destination.
Configuration File Location <Requires input> The Amazon S3 bucket and key where the completed
YAML configuration file is stored. For example,
<bucket-name>/<key>.
For information about the YAML file configuration,
see Appendix B.
Encrypt Data at Rest? Yes Specify whether or not the solution will create an
AWS KMS encryption key, and encrypt raw and
analyzed data in Amazon S3
Persist Raw Source Data? Yes Specify whether or not the solution will persist raw
streaming data from your source Amazon Kinesis
stream to Amazon S3
Destination Prefix AggregateData The prefix name that will be created in the Amazon
S3 bucket
Note: Use this parameter only if you choose the default option (Amazon S3) as your destination.
Buffer Interval 300 Specify the number of seconds (60-900) that Amazon
Kinesis Firehose should buffer data before loading it
to Amazon S3
Buffer Size 5 Specify the size of data in MB (1-128) that Amazon
Kinesis Firehose should buffer before loading it to
Amazon S3
Send Anonymous Usage
Data
Yes Send anonymous data to AWS to help us understand
usage across our customer base as a whole. To opt out
of this feature, choose No.
For more information, see Appendix C.
6. Verify that you modified the correct parameters for your chosen destination.
7. Click Next.
8. On the Options page, choose Next.
9. On the Review page, review and confirm the settings. Be sure to check the box
acknowledging that the template will create IAM resources.
10. Click Create to deploy the stack.
Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016
Page 11 of 20
You can view the status of the stack in the AWS CloudFormation Console in the Status
column. You should see a status of CREATE_COMPLETE in roughly five (5) minutes.
Step 2. Validate and Start the Application Once the stack is created, complete the following steps.
1. Navigate to the stack Outputs tab.
2. Note the name of the Amazon Kinesis Analytics application.
3. Navigate to the Amazon Kinesis Analytics console.
4. Select the name of your Analytics application and choose Application Details.
Figure 3: Example Amazon Kinesis Analytics application details
5. To view your data schema, select the pencil icon next to the source Amazon Kinesis
stream, and scroll to the bottom of the page.
6. Under Real-time analytics, choose Go to SQL editor.
7. When asked if you want to start the application, select No, I’ll do this later.
8. Review your SQL code and edit as necessary. Then, choose Save and run SQL.
Your Amazon Kinesis Analytics application will change to a Starting state. Your application will start after 30-90 seconds.
Step 3. Start Streaming Data Once you start your Amazon Kinesis Analytics application, configure your streaming data producers to send streaming records to your source Amazon Kinesis stream. For more information on how to configure streaming data producers, please visit Writing Data to Amazon Kinesis Streams.
Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016
Page 12 of 20
Note: To test this solution with sample data, you can use the Amazon Kinesis Data Producer. The data producer generates records using random data based on a template you provide.
As data flows through the Amazon Kinesis stream, it will automatically be processed by the Analytics application and Amazon Kinesis Firehose delivers the data to the specified external destination.
Once enough data has been sent through the Streaming Analytics Pipeline, or after the Firehose buffer interval has been reached, analyzed data is sent to the destination. If you have chosen to persist raw streaming data to Amazon Simple Storage Service (Amazon S3), you will also see Base64-decoded record data in the solution’s Amazon S3 bucket with the prefix, rawStreamData.
Security When you build systems on AWS infrastructure, security responsibilities are shared between
you and AWS. This shared model can reduce your operational burden as AWS operates,
manages, and controls the components from the host operating system and virtualization
layer down to the physical security of the facilities in which the services operate. For more
information about security on AWS, visit the AWS Security Center.
Security Groups The Streaming Analytics Pipeline does not create any security groups. However, we
recommend that you follow best practices for least-privilege access when creating access
rules for associated resources. If you selected an existing Amazon Redshift cluster as your
external destination, and your cluster is in an Amazon VPC with a publicly available IP
address, you must open the Amazon Redshift security group to the Amazon Kinesis Firehose
CIDR block for your AWS Region. For more information, see Prerequisites.
IAM Roles AWS Identity and Access Management (IAM) roles enable customers to assign granular
access policies and permissions to services and users on the AWS Cloud. Depending on
your configuration, the Streaming Analytics Pipeline creates between two and five IAM
roles. The solution creates the following roles:
A role with granular access policies for each Amazon Kinesis Firehose delivery stream
that the solution creates. The policies allow the Amazon Kinesis Firehose delivery
streams to log their events, get a particular AWS Key Management Service encryption
key to encrypt data in a specific Amazon S3 prefix, and send streaming events to a
Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016
Page 14 of 20
Appendix A: Modify Destination Parameters If you select Amazon Redshift, Amazon Elasticsearch Service, or an Amazon Kinesis stream as the destination for your analyzed data, you must modify the parameter for your selected destination.
For Amazon Redshift, modify the parameters in the following table.
Parameter Default Description
Master User Name <Requires input> Username of the user with permissions to edit the specified
table in the Amazon Redshift cluster
Master User Password <Requires input> Password of the user with permissions to edit the specified
table in the Amazon Redshift cluster.
JDBC URL <Requires input> The JDBC URL of the Amazon Redshift cluster. You can obtain
this from the Amazon Redshift console. The URL has the
following format:
jdbc:redshift://endpoint:port/database
Table Name <Requires input> The name of an existing, preconfigured table in the specified
Amazon Redshift cluster, to which the results of the Amazon
Kinesis Analytics application will be loaded
Column Pattern <Requires input> By default, Amazon Kinesis Firehose will copy records to
Amazon Redshift in the same order they leave the Amazon
Kinesis Analytics application. If you wish to change the order
or enter analyzed data into certain columns, provide a comma-
separated list of the column names in the desired order. For
example, column1, column2, column3, column4.
Buffer Interval 300 Specify the number of seconds (60-900) that Amazon Kinesis
Firehose should buffer data before loading it to Amazon
Redshift
Buffer Size 5 Specify the size of data in MB (1-128) that Amazon Kinesis
Firehose should buffer before loading it to Amazon Redshift
For Amazon Elasticsearch Service, modify the parameters in the following table.
Parameter Default Description
Domain Name <Requires input> The name of the Amazon Elasticsearch Service domain. You
must deploy the solution in the same AWS Region as the
domain.
Index Name <Requires input> The name of the index for analyzed data
Type Name <Requires input> The name of the type for analyzed data. We recommend that
you create the type before you start the Amazon Kinesis
Analytics application.
Index Rotation NoRotation The frequency at which the specified Amazon Elasticsearch
Service index rotates
Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016
Page 15 of 20
Parameter Default Description
Buffer Interval 300 Specify the number of seconds (60-900) that Amazon Kinesis
Firehose should buffer data before loading it to Amazon
Elasticsearch Service.
Buffer Size 5 Specify the size of data in MB (1-128) that Amazon Kinesis
Firehose should buffer before loading it to Amazon
Elasticsearch Service.
For an Amazon Kinesis stream, modify the Destination Stream Name parameter. The Destination Stream Name is the name of the stream that will be your destination for your analyzed data.
Appendix B: YAML File Configuration The Streaming Analytics Pipeline includes a YAML file that contains configuration
information for the Amazon Kinesis Analytics application that the solution creates. Review
the parameters in the YAML file and modify them as necessary for your implementation.
Then, upload the file to an Amazon S3 bucket.
streaming-analytics-pipeline-config.yaml: Use this file to
specify your Amazon Kinesis Analytics application configuration.
Parameter Default Description
Input Format Type CSV The format of the records of the source stream. Choose CSV
or JSON.
Record Column
Delimiter
“,” The column delimiter of CSV-formatted data from the
source stream. For example, “|” or “,”.
Note: Leave this parameter blank if you chose JSON as your Input Format Type.
Record Row Delimiter “/n” The row delimiter of CSV-formatted data from the source
stream. For example, “/n”.
Note: Leave this parameter blank if you chose JSON as your Input Format Type.
Record Row Path “$” The path to the top-level parent that contains the records.
Note: Leave this parameter blank if you chose CSV as your Input Format Type.
Output Format Type CSV The format of the analyzed data that is put in the output