2 JUNE 2016 BUILDING A CLOUD BASED DATA WAREHOUSE GILDAS BAH, BRENT BENSON, & RYAN FRAZIER
2 JUNE 2016
BUILDING A CLOUD BASED DATA WAREHOUSE
GILDAS BAH, BRENT BENSON, & RYAN FRAZIER
2
PresentersAbout HBXHBX Data
Management Initiative
Architecture & Implementation
ChallengesData Driven Culture
ImpactsWhat’s Next?
Ryan Frazier – Director, Systems Engineering and Operations
Brent Benson – Enterprise Architect
Gildas Bah – Data Analyst Engineer
3
4
Harvard Business School’s newest division, tasked with reimagining business education for the digital age
Launched in June 2014 Located in Allston, five minutes
from HBS campus Moving from start-up to enterprise
mode The teaching model sets HBX apart from many online learning options and is reflective of the HBS in-person classroom approach
What is HBX?About HBXHBX Data
Management Initiative
Architecture & Implementation
ChallengesData Driven Culture
ImpactsWhat’s Next?
5
HBX PlatformsAbout HBXHBX Data
Management Initiative
Architecture & Implementation
ChallengesData Driven
CultureImpacts
What’s Next?
HBX Online Platform HBX Live
Mainly asynchronous online business education
Engagement through student interaction in cohorts of ~400
Case-based learning with highly interactive teaching elements and peer help
WGBH studio-based virtual classroom
Synchronous audio/video with chat, polls, boards
Up to 60 global students on studio wall, hundreds or more observers
66
Building a Data Management Practice
7
Why Build a Data-Driven Culture?About HBXHBX Data
Management Initiative
Architecture & Implementation
ChallengesData Driven
CultureImpacts
What’s Next?
Enhance Outcomes• Proactively
support struggling students
• Identify challenging content
• Evaluate and improve interactive content, social engagement, and retention
Improve Effectiveness
• Scale data intensive activities like marketing, admissions, & grading
• Use data to test ideas and improve quality of decisions
Refine Pedagogy• Evaluate new
pedagogical approaches
• Optimize evaluation approaches
• Support pedagogical research activities and innovation
STUDENTS STAFF FACULTY
Foster Innovation & Continuous Improvement• Identify and evaluate innovation opportunities• Drive continuous improvement
8
Data Management Program Objectives
About HBXHBX Data
Management Initiative
Architecture & Implementation
ChallengesData Driven
CultureImpacts
What’s Next?
Integrate Data Sources into Comprehensive Data
Warehouse
Build Reports and Dashboards
Enable Self Service Ensure Data Quality and Integrity
9
Tool and Vendor SelectionAbout HBXHBX Data
Management Initiative
Architecture & Implementation
ChallengesData Driven
CultureImpacts
What’s Next?
Data Warehouse• Standard relational DB, Redshift• Chose Redshift because of scalability, performance• Aligns with AWS platform focus
ETL• Informatica, Talend• Chose Informatica because of university
relationship and myriad of plugable connectors
Reporting/Analytics• Microstrategy, Qlik, Tableau• Chose Tableau because of feature set and industry
adoption
10
Reporting Copy
HBX Data EcosystemAbout HBXHBX Data
Management Initiative
Architecture & Implementation
ChallengesData Driven
CultureImpacts
What’s Next?
Course
PlatformVer. A
MongoDB MySQL Reporting
Copy
Course Platform
Ver. B
MongoDB MySQL
Historical Data
MongoDB MySQL
Admin System
MySQL
Salesforce
Redshift Informatica
Secure Agent
NEW!
Tableau Server
Progress ODBC for MongoDB
sync
11
HBX Data Management by the Numbers
About HBXHBX Data
Management Initiative
Architecture & Implementation
ChallengesData Driven
CultureImpacts
What’s Next?
Source Systems• 35 databases• 887 tables• 5,844 fields• 109,751,902 rows
Data Warehouse• 4 Redshift clusters• 8 databases• 404 tables• 5,674 fields• 400,794,679 rows
Daily ETL Process• 300 jobs• 6,515,599 rows
* Updated 6/1/2016
12
HBX Data ModelsAbout HBXHBX Data
Management Initiative
Architecture & Implementation
ChallengesData Driven
CultureImpacts
What’s Next?
{ "_id": ObjectId("556f25ab662a9b059ea8df8b"), "tei_id": "554a607b241b5a3f0e09eefe", "course_instance_id":"556dcf55b7431f414d87f06f", "user_id" : "8701", "comments" : [ { "id" : "ce59a25a-ce69-47-c534611f7ebf", "text" : "This is a great response…, "author_id" : "6411", "date_created" : “2015-09-10”,
MySQL-Relational MongoDB-Semi-Structured
Course offeringsStudent demographics
Applications & registration
Limited course content
Course structureCourse content
Student course stateMetric (timing) data
13
ChallengesAbout HBXHBX Data
Management Initiative
Architecture & Implementation
ChallengesData Driven
CultureImpacts
What’s Next?
?
X
X
X Immature data connection support
Large object storage limitations in Redshift
Difficulty flattening complex/polymorphic data structures
14
{"_id': ObjectId("2804c514e4c20e6d"), "course_instance_id": "2804c51563c9c772", "tei_id": "241b5a14b75fac83", "user_id": "3312”, "date_created": datetime.datetime(2015, 10, 14, 10, 56, 59, 137000), "category": "timespan", "metric": {"interaction_time": 180, "is_interaction_time": True}}
{"_id": ObjectId("2804c514f92f1f64"), "course_instance_id": "2804c51563c9c772", "tei_id": "241b5a14b75fab70", "user_id": "3312”, "date_created": datetime.datetime(2015, 10, 14, 17, 11, 56, 967000), "category": "view_user_response", "metric": {"viewed_user_id": "9212"}}
Document-Structured Data Challenges
About HBXHBX Data
Management Initiative
Architecture & Implementation
ChallengesData Driven Culture
ImpactsWhat’s Next?
15
Document-Structured Data Challenges
About HBXHBX Data
Management Initiative
Architecture & Implementation
ChallengesData Driven Culture
ImpactsWhat’s Next?
{"_id": ObjectId("562666868c58dab88be84345"), "course_instance_id": "561829c02804c51563c9c772", "tei_id": "55c21b88241b5a14b75fab8d", "user_id": "3312", "state": {"answer": "The case really drove home..."}}
{"_id": ObjectId("56414b498c58dab88bf10873"), "course_instance_id": "561829c02804c51563c9c772", "tei_id": "5639ef402804c509af1d2721", "user_id": "3312", "state": {"summary": [{"content": "Incorrect: Being quick to market...", "correct": False, "id": "5bc32d05-1173-452c-801b-34c2368ea4b6"}, {"content": "Correct: In the early stages...", "correct": True, "id": "88893c55-3e30-4f50-8495-a6fe1f1cef94"}, {"content": "Incorrect: Customization becomes...", "correct": False, "id": "4afbf29f-d23d-43a3-8266-e41df3defa69"}]}}
User state documents for reflection and multiple choice
16
Document-Structured Data Challenges
About HBXHBX Data
Management Initiative
Architecture & Implementation
ChallengesData Driven Culture
ImpactsWhat’s Next?
• Documents with simple and consistent structure are easy to translate into relational form
• Documents with simple, but polymorphic structure are handled by modern MongoDB drivers (metric example)
• Documents with complicated and polymorphic structure (user state example) push the boundaries of current drivers and declarative tools
• Current solution: copy like-typed documents into separate collections
• Preferred solution: copy all documents into warehouse and do post-copy transforms for summary and detailed information in relational form
17
Creating a Data-Driven Culture
18
Creating a Data Driven CultureAbout HBXHBX Data
Management Initiative
Architecture & Implementation
ChallengesData Driven
CultureImpacts
What’s Next?
PeopleTechnicalPartners
LeadershipStaff Technology
Self-serviceEliminate
ComplexityExperimentation
ProcessProcess
GovernanceData Governance
Education
19
Enablers for Building Data Driven Culture at HBX
About HBXHBX Data
Management Initiative
Architecture & Implementation
ChallengesData Driven
CultureImpacts
What’s Next?
Strong Partners• Use off-shore partner
Mindtree to accelerate• Active engagement of
vendors on technology challenges
Education• Short Presentations to
staff• Data Analysis Exercise
at all-staff team meeting
Program Governance• Active interest &
involvement from Business Areas
• Alignment to organizational priorities
Experimentation• HBX willingness to try
new things• Helps drive engagement
with vendors
20
Organizational ImpactsAbout HBXHBX Data
Management Initiative
Architecture & Implementation
ChallengesData Driven
CultureImpacts
What’s Next?
Enablement of real-time data-driven decision making
• Dashboards for Registration Pipeline and Demographics
• Application Forecasting Dashboard
A move from spreadsheets to dashboards and configurable business processes
• Development of grading automation data pipeline
• Reporting for B2B Participants
A move from individually handled data requests to dashboards and self-service reporting
• Self-service marketing data extract
21
What’s Next?About HBXHBX Data
Management Initiative
Architecture & Implementation
ChallengesData Driven
CultureImpacts
What’s Next?
Streaming Data?
Native JSON Data Warehouse?
Analytics?
Additional Data Sources?
www.hbx.hbs.edu
Questions?