Top Banner

Click here to load reader

BIG DATA ANALYTICS MEANS “IN-DATABASE” ANALYTICS

Jan 19, 2015

ReportDownload

Business

Presented by: Dr. Bruce Aldridge, Sr. Industry Consultant Hi-Tech Manufacturing, Teradata

TIBCO Spotfire and Teradata: First to Insight, First to Action; Warehousing, Analytics and Visualizations for the High Tech Industry Conference
July 22, 2013 The Four Seasons Hotel Palo Alto, CA

  • 1. Teradata Proprietary and Confidential BIG DATA ANALYTICS MEANS IN-DATABASE ANALYTICS Dr. Bruce Aldridge Sr. Industry Consultant Hi-Tech Manufacturing Teradata 760.458.1376 [email protected]

2. 2 7/30/2013 Teradata Confidential Overview of Topics Big Data Analytics > The problems of extreme data > Key principles for analytic engines Analytic Technologies > Changing from sequential to parallel > Design for analytics Operationalizing Analytics > Analytic life cycle management > Visualization / interacting 3. 3 7/30/2013 Teradata Confidential What is Big Data? Big Data: any information thats too fast, too large or doesnt fit what you are using Data Explosion > Automation of equipment and business processes > Sensor integration > Communication (networks / web) > Compliance 4. 4 7/30/2013 Teradata Confidential Using Big Data Collecting data and using data are different things > Data Lakes serve as high volume low cost repositories for collection > Data may be semi-structured or structured - frequently the conversion happening within the repository > Large amounts of data may be stored for reporting, compliance or investigations Unusual or new events provide learning (Most big data will not provide new information or knowledge) 5. 5 7/30/2013 Teradata Confidential Guidelines for Big Data Collecting learning Using data > Data stored on appropriate system for use > Data mining and statistic tools for learning > Model publication (PMML) & monitor for deployment > Visualization tools critical for all 6. 6 7/30/2013 Teradata Confidential6 > 7/30/2013 Extreme data brings new challenges New techniques to limit variables for analysis / modeling Emergence of columnar analytics Wealth of data results in more variables than responses = 1, 2, 3, 4, , where n>m Data organization struggles with wide data (>100,000 columns) Id V1 V2 V3 V4 V5 V6 V7 V8 v9 AA 1.2 3.1 41 56 a 9 0.2 ? ? AB 0.9 2.7 41 62 a 8 0.2 1.1 7 BA 1.0 2.9 42 57 b 9 0.1 1.1 ? Id Col ID Val AA V1 1.2 AA V3 41 AB V1 0.9 AB V8 1.1 AB V2 2.7 pivot Id V1 V2 V3 V4 V5 V6 V7 V8 v9 CA 1.2 3.1 41 56 a 9 0.2 ? ? CB 0.9 2.7 41 62 a 8 0.2 1.1 7 BB 1.0 2.9 42 57 b 9 0.1 1.1 ? Id Col ID Val AA V1 1.2 AA V3 41 AB V1 0.9 AB V8 1.1 AB V2 2.7 CA V4 56 CB V3 41 BB V1 1.0 Multiple tables add more rows 7. 7 7/30/2013 Teradata Confidential Technology Requirements for Big Data Analytics Need for large amounts of data storage Ability to get at the data (SQL) Availability of tools for > Visualization > Characterizing, organizing and cleaning data > Summarizing (descriptive statistics) > Analyzing (predictive models, data discovery) > Monitoring & reporting Analytic Fault Tolerance (massive systems imply more failures) Dynamic growth ability to add more capability without starting over mixing technologies ROI llll llll 8. 8 7/30/2013 Teradata Confidential Analytic Tools Faster analytics require a different approach Parallel > Sequential processing will be limited > Parallel analytics distributes calculations across multiple nodes with each node having the data necessary > Management of calculation (distribution) and collection Because data is generally stored on multiple nodes, so.. No choice but to bring the analytics to the data. Data Analytic Modeling Tools Business Results Local Data repository Parallel Analytic Procedures Simple reporting / management tools Data 9. 9 7/30/2013 Teradata Confidential Putting it all together: Analytic Architecture LANGUAGES MATH & STATS DATA MINING DISCOVERY PLATFORM LOW COST HIGH CAPACITY PARALLEL DATA LAKE CAPTURE | STORE | REFINE LANGUAGES MATH & STATS DATA MINING BUSINESS INTELLIGENCE APPLICATIONS FLEXIBLE ANALYTIC / DISCOVERY PLATFORM REPORTING / MONITOR SYSTEM OF RECORD - DATA WAREHOUSE AUDIO & VIDEO IMAGE S TEXT WEB & SOCIAL MACHINE LOGS CR M SCM ER P Environment for: Low Cost high capacity storage High power analytics Fault tolerant high performance reporting Exploration / visualization across all areas Visualization exploration 10. 10 7/30/2013 Teradata Confidential Data Preparation Transform, clean and aggregate data to form data set suitable for analysis Monitor / Model Deployment Deploy statistical model to run iroutinely - automatically monitoring for control Data Exploration Explore all data with statistical profiling and visualization Understand / Model the data Apply mathematical / relational models to test hypotheses about the data Modeling ADS Sample Data Build ADS Production ADS Automated process Analytics Process SQL In-dbs Function PMML or UDF Models 11. 11 7/30/2013 Teradata Confidential Business / Data understanding > Defining objects and requirements of the business > Data collection and data profiling / characterization Data preparation joins between tables, attribute selection, cleaning, building new values Modeling: Analytic algorithms applied and parameters adjusted Evaluation: results scored according to objectives and requirements Deployment: Models and parameters put into on demand or automatic execution on new data Analytic discovery process CRISP Cross Industry Standard Process data mining 12. 12 7/30/2013 Teradata Confidential Analysis The generation of knowledge Generation of knowledge is iterative and interactive An idea related to a problem or observation is formulated Data is collected to support or refute the idea (deduction what kind of data is necessary?) Analysis is made on the data to validate or refute (induction) Results either support / reject idea or suggest modifications Monitoring Known analytic models used for prediction / verification Adjust / control based on prediction vs. observation Business scoring used to prioritize Data (facts, phenomena) Idea (model, hypothesis, theory, conjecture) Monitor / controlValidate Revise 13. 13 7/30/2013 Teradata Confidential Establishing a Robust Environment Quality Information Master Data Management Data Profiling / visualization Logical/Physical Model Data correction Data Steward/Cleansing processes Discovery Statistical / Data Mining Tools Secure access Robust Analysis capabilities Visualization / understanding Clear and significant results Flexibility in data and models Automation and Alerts Simple publication of discovery knowledge Automated pattern/anomaly detection Business scoring for notification and escalation Clear communication of results Visualization and Reports Choice of the tools to match needs (e.g. Dashboard vs. Engineering views) Timing and need for data refresh Reporting on Core or staging Consistent use of metrics/results (e.g. analytics in database vs. at the reporting layer) 14. 14 7/30/2013 Teradata Confidential Analytics Key Requirements Performance: > Parallel processing - true shared nothing architecture > Data structure influences analytics (order of magnitude) > Management of analytics and data critical Fault tolerance > More nodes WILL result in more failures > Analytic Fault Tolerance is more than database fault tolerance the ability to avoid restarting the analytics Different node performance > Execution in parallel will never be identical adjust for node differences > System expansions must be compatible Flexible analytics > Big data analytics combine queries with analytic functions > Analytic languages not parallel (in general) need ability to add / customize new functions 15. 15 7/30/2013 Teradata Confidential Analytic Applications Existing parallel analytics > In-database proprietary > In-database addons (Fuzzy Logix, SAS, Partial R) > Hybrid (Aster) Database architecture supporting MAP-Reduce functions Many existing applications moving parallel > SAS: Partnered with Teradata for seamless in- database execution of more analytics > R: Partnered with Revolution R for rapid data extraction and execution of some analytics in- database AND in parallel > Spotfire: Execution of aggregation analytics and ability to define in-database analytic functions. Embedded TERR (Tibco Enterprise Runtime R) Write your own > Map reduce framework > User defined functions 16. 16 7/30/2013 Teradata Confidential 16 > 7/30/2013 Analytic Libraries and Enhancements Database built in: Descriptive Statistics Basic data mining models (regression, cluster, trees, PCA) User defined functions Partners Revolution R, SAS, Fuzzy Logix, Spotfire, Enhancements High Speed connections Native data storage 17. 17 7/30/2013 Teradata Confidential Device Lot Raw Data Wafer Dashboard as an Analytic Tool The Dashboard becomes a 2-way interface User interaction parameterizes and launches new analytics 18. 18 7/30/2013 Teradata Confidential Integration of Dashboard Reporting / visualization tool with ability to execute custom functions in-database > Empower all users - ability to publish in-database analytics to users 19. 19 7/30/2013 Teradata Confidential Monitor Analytics Analytic models generally are published into SQL compatible queries Applying models to data involves: > Gather and format data for analytic > Group data into consistent sets > Screen data > Apply algorithms > Evaluate results Complete Sequence applied to > Massive amounts of analyses > Repetitive / automated analyses > Scoring / Triage to identify most significant results 20. 20 7/30/2013 Teradata Confidential An Analytic Monitor approach User direct edit of Group Description Table (very infrequent) View for Instances Stage Data Group Instance Table Group Description Table Core EDW Data Model Reporting, BI & Alert management tools Alert settings Core EDW Data Model Core EDW Data Model Core EDW Data Model Standard ETL Views for Data (Group Instance) Creation of Views to evaluate load data (installation / dba level user) Analytic proced