About the Presenter: David J Corliss • PhD in statistical astrophysics formerly parttime faculty at Wayne State University • Analytics Architect in the automotive industry • Work focuses on bringing university research in bog data and time series analysis to the private sector • Founder of PeaceWork, a volunteer cooperative of statisticians, data scientists and other researchers applying analytics to issues in poverty, education and social justice
30
Embed
About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
About the Presenter: David J Corliss
• PhD in statistical astrophysics;; formerly part-time faculty at Wayne State University• Analytics Architect in the automotive industry • Work focuses on bringing university research in bog data and time series analysis to the private sector• Founder of Peace-Work, a volunteer cooperative of statisticians, data scientists and other researchers applying analytics to issues in poverty, education and social justice
Best Practices in Big Data
David J Corliss, PhDPeace-Work
4/27/2016
IHBIThe Institute for Healthand Business Insight
OUTLINE
Data Management
Sampling and Coding for Big Data
Tests For Model Performance
Distributed Computing
Summary
Data Management for Big Data
• Pre-screen records and variables
• Process only the records and variables needed
• Efficient Data Step Coding
• Use less computationally intensive methods
Bad Data Management 101Proc sort data=applicants;
by demographic_seg ID;
proc genmod data=applicants;
class demographic_seg;
model accept = var1—var221 /
dist = bin
link = logit
lrci; run;
Proc sort data=applicants;by demographic_seg ID;
proc genmod data=applicants;
class demographic_seg;
model accept = var1—var221 /
dist = bin
link = logit
lrci; run;
Bad Data Management 101Unnecessary Sort
Proc sort data=applicants;by demographic_seg ID;
proc genmod data=applicants;
class demographic_seg;
model accept = var1—var221 /
dist = bin
link = logit
lrci; run;
Bad Data Management 101
Doesn’t screenvariables first
Unnecessary Sort
Proc sort data=applicants;by demographic_seg ID;
proc genmod data=applicants;
class demographic_seg;
model accept = var1—var221 /
dist = bin
link = logit
lrci; run;
Bad Data Management 101
Doesn’t screenvariables first
Unnecessary Sort
Models allvariables
Proc sort data=applicants;by demographic_seg ID;
proc genmod data=applicants;
class demographic_seg;
model accept = var1—var221 /
dist = bin
link = logit
lrci; run;
Bad Data Management 101
Doesn’t screenvariables first
Unnecessary Sort
Models allvariables
Computationally intensivebut not needed
proc glmselect data=applicants(where ranuni(0) le 0.001);
model accept=var1—var221/selection=lasso(stop=none choose=sbc);