Self-Service Data-Preparation with Trifacta for Amazon SageMaker · 2018-09-27 · combine the datasets. In this case the suggestion to join the weather file and the excel store file

Self-Service Data-Preparation with Trifacta for Amazon SageMaker In a recent blog covering usage of SageMaker’s unique modeling algorithms such as DeepAR, but also more traditional ones such as Autoregressive Integrated Moving Average (ARIMA) or Exponential Smoothing (ES) to better forecast; it came to my mind that all these algorithms expect well structured and clean data to deliver the most accurate prediction. However, forecast modeling depends on numerous data sets such as inventory data, promotions, past orders, products, or even weather data and product ratings originated from internal systems but more often than not, from various parties (retailers, distributors, brokers, manufacturers, CPG, public data, social media, etc.) with their own proprietary formats, standards and a very personal perspective of what data quality is. To structure, clean, and combine these disparate data sets into a consistent matrix for modeling, most data scientists have to spend a massive amount of their time preparing the data.

Data Preparation for Machine Learning Data scientists should focus on finding and tuning the best models instead of spending too much time on the janitorial work of cleaning the data. This is the reason self-service data preparation gets an increasing adoption in the ML/AI world. This blog covers the typical work one has to go through. Trifacta is a recognized leader in self-service data preparation, is an AWS ML Competency partner and is available on the AWS marketplace. Self-service data preparation consist of a visual approach to structure, clean, combine, and enrich disparate data at scale (leveraging the elastic power of AWS) and achieve it way faster than traditional approaches such as coding. Let’s take a closer look at preparing data for SageMaker using Trifacta.

The 4 Steps to Prepare Data for SageMaker Data Preparation involves 4 major steps to transform raw data into a refined asset ready for modeling in Amazon SageMaker.

https://aws.amazon.com/blogs/machine-learning/now-available-in-amazon-sagemaker-deepar-algorithm-for-more-accurate-time-series-forecasting/

https://www.trifacta.com/gated-form/forrester-wave-data-prep/

https://aws.amazon.com/marketplace/search/results?x=0&y=0&searchTerms=trifacta&page=1&ref_=nav_search_box

STEP 1: STRUCTURING THE DATA The data can come from internal systems and they will be files, applications and databases mostly, however when it has to be combined with retailer or distributor data, it may come in various shapes (i.e. hierarchical files, JSON, XML) and often exports of a report and of course Excel. Any data like this must first be put into a readable form, standardized into columns and rows so it can be assessed. Trifacta automatically recognizes data types and organizes them into a familiar grid interface.

STEP 2: ASSESSING DATA ACCURACY Once data has been reformatted a bit, it becomes easier to assess its overall quality. Are fields well formed, maybe there are mismatched values like an invalid zip code, unit of measures may be inconsistent. You might even see value distribution anomalies, a form of unexpected data value (such percentages over 100 for fractions or a weight way over the average). Identifying these data flaws and reviewing for accuracy is critical to effective data modeling, and it’s at the core of Trifacta’s architecture. Every time you open a data set or derive a new value from existing data, Trifacta automatically and dynamically profiles your data, assesses its accuracy, and then displays a health check bar with information for each column.

STEP 3: ADDRESSING DATA INCONSISTENCY Following the assessment and discovery of the data, the problems can be addressed. Starting with simple missing and mismatched attributes as well as anomalies and outliers, it then gets into data formats, standardization, enrichments to make the data consistent for the model. Sometimes cross reference and lookup tables or conversion formulas are useful depending on an organization’s needs. Lastly, duplicate and inconsistent records are also addressed.

STEP 4: BUILDING A CONSOLIDATED VIEW Once the data is clean, Trifacta provides easy ways to join datasets, to pivot and unpivot data, aggregate values, and calculate key indicators and predictors. Because modeling implies sampling out of very large datasets, Trifacta offers native sampling techniques to make sure data flows are comprehensively identified. If all goes well, the data preparation recipes can run at scale to provide in a structured form of data to SageMaker (via S3 files). If there’s a problem, the process iterates to the various steps to tune the data until it is finally ready. While these 4 steps are described separately, they are not linear, they are iterative and inter-relative by nature. As the data is structured, it is assessed, then cleaned to generate a new data set that is assessed again, etc.

Better Product Forecasting With Trifacta and SageMaker In order to demonstrate data preparation for SageMaker, let’s take a scenario of a retailer who is implementing a predictive model to identify the type of goods and quantities to deliver to their stores based on extreme weather conditions. The predictive model will have to be trained based on historical data such as past transactions, store locations and weather data points. This particular company maintains its list of stores in an excel spreadsheet (no surprise here!), order management in a business application depending of a relational database and subscribes to a weather service that delivers JSON files in S3 with numerous past and future weather conditions data points. Let’s follow the 4 steps presented above

STEP 1: STRUCTURING THE DATA: The weather data is provided in daily batches of large JSON files with historical weather information for the past day and some forecast for the next few days and weeks. The files are stored in a S3 bucket. With Trifacta, we can browse the S3 buckets and select the weather file we want to wrangle or a whole folder all at once if the files have the same structure.

We will now also connect to the tables containing our store transactions data which a copy is stored in Amazon RedShift

And finally we are also importing the Excel file that contains the list of stores across the US. Let’s start wrangling now so we can supply some comprehensive and clean data to SageMaker to predict our product forecast based on weather! As you can see in this preview the weather file is structured in JSON, which is not very user friendly. Especially since we have multiple levels of nested structures.

As soon as we open the file to wrangle it, Trifacta automatically recognizes the JSON format and starts to unnest it to present it in columnar format (see the recipe on the right with some transformation steps? we will come back to it later). The other thing that Trifacta does is to infer the column types and run a series of statistics on it to inform the user about value distribution and overall data quality score. In this case everything looks green. Data seems to be in a good shape.

We’re now selecting the 3 bars that represent the temperature, CloudCover and PrecipType, in the weather_summary JSON column and Trifacta suggest to unnest the JSON structure and create 3 columns out of it. Trifacta always shows a preview (in yellow) of the data so one can validate that the transformation step is actually what we expect. We can continue as such to completely flatten the JSON file so we have these features available for our model.

STEP 2 and 3: ASSESSING DATA ACCURACY and ADDRESSING DATA INCONSISTENCY: While the JSON file structure was decent, we may have some surprises when dealing with Excel files or supplier and distributor data. Opening the ACME_STORE.xlsx file, we can see that some extra structuring, standardization, and cleansing might be needed.

For example, we can see that there are some invalid values for the Staff column as well as missing values. This column is important because it gives us an idea of the size of the store, so it could be a predictor for our SageMaker model. Better to have it clean.

By clicking the red square in the quality bar, we can preview the invalid values and a suggestion to delete them. In this case, we accept the suggestion (while ideally we should reach out to the owner of the Excel file an ask them to correct it). Either way, if there are invalid values, this will not do any good to our model. So let’s remove this noise by adding the suggestion and also deleting the missing values.

Next, we will need to standardize the store hours in discrete values. By brushing over the ‘,‘ characters in the Store_hours column, we get a suggestion to split the column in 3. Which is

perfect and way simpler that writing a regular expression.

Trifacta supports regular expressions, it also provides a pattern language to make it easier such as in this other suggestion to split on a dayofweek-abbrev.

By proceeding the same way we can create a reusable and readable recipe of steps that will structure and clean the data to feed SageMaker. Here is an example of a recipe for this excel file that could be edited and unenriched to get to the level of quality required.

Each individual dataset has to be assessed, structured, and standardized to provide as many features as needed. Trifacta provides hundreds of functions to manipulate the data such as pivoting, unpivoting, aggregating, calculating, windowing, date pattern standardization and many more that can be used visually or scripted in the natural Wrangler language.

STEP 4: BUILDING A CONSOLIDATED VIEW Training for a ML model requires creating a large flatten file with all the attributes and features in single columns. Hence, the necessity to combine these datasets together. Trifacta provides join wizard, lookup and union functions to combine the data into a consistent and consolidated view. Similar to suggesting transformations, Trifacta will suggest the best columns to match to combine the datasets. In this case the suggestion to join the weather file and the excel store file is to map them based on the latitude and longitude columns.

Proceeding the same way, we can combine the RedShift data to define a master file with all the attributes. Trifacta provides a logical flow view that showcases how the data is combined.

Before supplying the data to SageMaker, we want to validate that there are no data inconsistencies remaining in the final dataset. Trifacta runs a full profile on the data and highlights possible errors. In this case, we have invalid values in the QTY column. We can revisit the recipe and figure out how we can eliminate this issue.

GENERATING THE DATA FOR SAGEMAKER Now that the combined dataset is standardized and consistent, there are some final steps left to structure the data to make it suitable for SageMaker training. The overall goal will be to create 3 different files (train.csv, validation.csv, and test.csv) that will require binary categorical results to feed into the Sagemaker built-in XGBoost algorithm. Columns filled with categorical data must be converted into a series of new columns in which possible values that will indicate 1 when the value is found and 0 if not (i.e. one-hot encodings). For example, the column class_temp which contains up to 5 possible values must be converted from that structure:

Into this one:

Trifacta provides the one-hot function that does just this translation:

The next step is to randomly shuffle the data set and create 3 distinct files to train, validate, and test the model. This is easily achieved with these steps: First, create a new column with a random value

And by sorting the dataset based on this new column, we’re shuffling the whole dataset

Next, we need to create the 3 files for SageMaker that will respectively contain 70% of the rows to train the model, 20% to validate it and 10% to test it. To do so, we will assign a row number to each row and then assign a group train, test, validation based on the respective percentages. To create the row number, we will use the ROWNUMBER() function like this:

Which will produce this result:

Now we can assign each row to a particular bin. Train, Validation or Test. We can use the case function like this:

This will produce a new column distributing the data in a particular bin

We can now easily split the files by filtering out on the DataSet_Model_Usage value

Clicking the run job button takes the recipe and pushes it to EMR to process the data at scale using CSV, which is the format used by SageMaker XGBoost algorithm.

Data preparation at scale is part of the iteration process needed to get data fit for machine learning modeling. And doing it with Trifacta reduces the time it takes (often more that 70%) so that data scientists can focus on the modeling part of the project, which is where the business outcome will become visible. However if the data preparation is inconsistent so are the business predictions. Now we have the data ready for SageMaker built-in algorithm, XGBoost. We use the algorithm to predict whether QTY is 10 or more (1) or less (0) for given features. We have three .csv files namely, train.csv, validation.csv and test.csv. We can make use of our SageMaker XGBoost script, which is modified from two Amazon SageMaker sample scripts, Targeting Direct Marketing with Amazon SageMaker XGBoost and Predicting Product Success When Review Data Is Available. . Three csv files and the sample notebook (Trifacta-Blog-Retail-Transaction-Xgboost.ipynb) are placed in the same directory in the SageMaker jupyter notebook.

Once the data and notebook are ready, there are only three steps to start training.

(1) Specify S3 bucket and prefix that you want to use for training and model artifacts. Copy two csv files, train.csv and validation.csv to the S3 as input for SageMaker’s managed training.

bucket = #'<your s3 bucket>'#

prefix = 'sagemaker/DEMO-XGBOOST-RETAIL-TRANSACTIONS-QTY-csv'

import sagemaker

role = sagemaker.get_execution_role()

boto3.Session().resource('s3').Bucket(bucket).Object(prefix +

'/train/train.csv').upload_file('train.csv')

boto3.Session().resource('s3').Bucket(bucket).Object(prefix +

'/validation/validation.csv').upload_file('validation.csv')

https://www.trifacta.com/customers/pepsico/

https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_applying_machine_learning/xgboost_direct_marketing/xgboost_direct_marketing_sagemaker.ipynb

https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_applying_machine_learning/xgboost_direct_marketing/xgboost_direct_marketing_sagemaker.ipynb

https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_applying_machine_learning/video_game_sales/video-game-sales-xgboost.ipynb

https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_applying_machine_learning/video_game_sales/video-game-sales-xgboost.ipynb

(2) Spacify algorithm, locations to train.csv and validation.csv, training instances and hyper parameters.

from sagemaker.amazon.amazon_estimator import get_image_uri

container = get_image_uri(boto3.Session().region_name, 'xgboost')

s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket,

prefix), content_type='csv')

s3_input_validation =

sagemaker.s3_input(s3_data='s3://{}/{}/validation/'.format(bucket, prefix),

content_type='csv')

sess = sagemaker.Session()

xgb = sagemaker.estimator.Estimator(container,

role,

train_instance_count=1,

train_instance_type='ml.m4.xlarge',

output_path='s3://{}/{}/output'.format(bucket, prefix),

sagemaker_session=sess)

xgb.set_hyperparameters(max_depth=5,

eta=0.2,

gamma=4,

min_child_weight=6,

subsample=0.8,

silent=0,

objective='binary:logistic',

num_round=100)

(3) Start training by calling fit().

xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})

First we copy train.csv and validation.csv to S3 (i.e. s3://<- your s3 bucket name ->/sagemaker/DEMO-XGBOOST-RETAIL-TRANSACTIONS-QTY-csv/train and s3://<- your s3

bucket name ->/sagemaker/DEMO-XGBOOST-RETAIL-TRANSACTIONS-QTY-csv/validation, respectively). Second, we specify the algorithm. In this case, we use XGBoost. We also need to declare the S3 location where the train.csv and validation.csv are stored. We choose a single ml.m4.xlarge as the training instance, although we could use multiple training instances too. The S3 location for the output model artifact is s3://<- your s3 bucket name ->/sagemaker/DEMO-XGBOOST-RETAIL-TRANSACTIONS-QTY-csv/output. We set hyperparameters for binary:logistic. Finally we start training by calling .fit(). After the training is over, the model is deployed to an inference endpoint on the cloud by executing the following command. xgb_predictor = xgb.deploy(initial_instance_count=1,

instance_type='ml.m4.xlarge')

Conclusion In this blog post we demonstrated an end to end process to prepare the data using Trifacta, and then train and host the model using Amazon SageMaker. Although we used a popular SageMaker built-in algorithm, XGBoost the process would be very similar for other training methods on SageMaker. By using other built-in algorithms, through deep learning frameworks such as TensorFlow, MXNet or PyTorch, or with your own custom algorithms. A free trial of Trifacta Wrangler is available on AWS marketplace here. To learn more about the SageMaker example used in this blog post or more, please take a look at Amazon SageMaker examples. Enjoy!

https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html

https://docs.aws.amazon.com/sagemaker/latest/dg/tf.html

https://docs.aws.amazon.com/sagemaker/latest/dg/mxnet.html

https://docs.aws.amazon.com/sagemaker/latest/dg/pytorch.html

https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms.html

https://aws.amazon.com/marketplace/pp/B076HD1TP4?qid=1536363639704&sr=0-2&ref_=srh_res_product_title&cl_spe=C

https://aws.amazon.com/marketplace/pp/B076HD1TP4?qid=1536363639704&sr=0-2&ref_=srh_res_product_title&cl_spe=C

https://github.com/awslabs/amazon-sagemaker-examples

Self-Service Data-Preparation with Trifacta for Amazon SageMaker · 2018-09-27 · combine the datasets. In this case the suggestion to join the weather file and the excel store file

Documents