Top Banner
IBM Research Brazil Fabio Porto ( [email protected] ), LNCC – MCTI DEXL Lab (dexl.lncc.br) Challenges in Scientific Big Data Management
75

Ibmr 2014

Jun 19, 2015

Download

Data & Analytics

Fabio Porto

This talk was given at IBM Research Brazil presenting some works that we are currently doing at the DEXL Lab at LNCC
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 1. IBM Research Brazil Fabio Porto ([email protected]), LNCC MCTI DEXL Lab (dexl.lncc.br) Challenges in Scientific Big Data Management

2. Outline Introduction Big Data in Science Hypothesis Driven-Research Hypothesis as Data Upsilon-DB SimDB Final remarks IBM Research Brazil 3. Laboratrio Nacional de Computao Cientfica (LNCC) Petropolis, Rio de Janeiro IBM Research Brazil 4. LNCC - MCTI Graduate Course in Computational Modelling CAPES 6 BioInfo Laboratory High throughput sequencing Coordinator of INCT MACC Medicine Supported by Computational Science Coordinator of SINAPAD HPC National System Thematic laboratories ACIMA Augmented Reality MARTIN Network and Software Engineering DEXL Big Data COMCIDIS Distributed Systems HEMOLAB Cardio Vascular System Modelling IBM Research Brazil 5. DEXL On going Projects DEXL Laboratory What-if data analysis Astronomy Data Management Simulation Data management Hypothesis Upsilon-DB Gene regulation Network Scientific workflow optimization Seismic Data Mngmt (EMC) Bioknowlogy IBM Research Brazil Olympic Laboratory 6. Objective To provide scientists with an in-silico cockpit from which scientific data and metadata can be efficiently managed IBM Research Brazil 7. IBM Research Brazil Scientists are spending most of their time manipulating, organizing, finding and moving data, instead of researching. And its going to get worse Office Science. Data-Management Challenge Report DoE - 2004 8. Big Data in science An expression that reflects the data deluge produced in science astronomy, astrophysics Biology, Neuroscience Sports Geology, Geophysics, etc. IBM Research Brazil 9. Big Data - Dimensions Volume Velocity Variety MB GB TB PB file database Uncertainty heterogeneity, evolution batch online sensors, alerts real time IBM Research Brazil 10. A challenge on volume Dark Energy Survey (DES) project expects to produce 100 PB in 10 years; (source:personal comm.) 5000o sky cover, all objects, perfect accuracy Yahoo claims to manage 2 PB of click data in a modified PostgreSQL EMBL - nucleotide database 260 Gbases High-throughput sequencing 454 Roche technology Sequence 400-600 million bases in 10 hours Eg. A project at Max Plant Institute aims at sequencing the whole genome of the Neanderthal at 3 billion base pairs is expected to take 2 years to finish. IBM Research Brazil 11. 1D-3D coupled simulations IBM Research Brazil 12. From Observation to Data Analysis IBM Research Brazil 13. BIG DATA in Science Scientific process is being remodelled to be developed within an in-silico environment Powerful instruments: Digital telescopes DNA sequencers Mass spectrometers Huge simulations Weak lensing Human Cardio-vascular system Massive amounts of information streams in and out Hypothesis-driven research supported by in-silico infrastructure, methods, models IBM Research Brazil 14. Hypothesis Formulation Modeling Experiment Life-cycle IBM Research Brazil PublicationPhenomenon e-Science life cycle 15. Big Data urgent call in e-science Scientific life cycle metadata management; Scientific Hypothesis formulation and validation; Scientific Data management; Scientific data processing architecture; IBM Research Brazil 16. MODELLING - HYPOTHESIS-DRIVEN BIG DATA RESEARCH 17. To see what is in front of ones nose needs a constant struggle George Orwell IBM Research Brazil 18. To make sense of Big Data we need models [Peter Haas Data is Dead without what-if models, PVLDB 2011] Scientific Models are formal interpretation of phenomena Hypotheses formalized as models Scientific life cycle driven by hypotheses validation IBM Research Brazil 19. Hypothesis driven Big Data analyses Scientific Hypothesis a model for scientists interpretation of a phenomenon; Different hypotheses co-habit a scientific domain; Science method prove hypotheses Big Data analyses hypotheses exploration In new Big Data prediction analysis identify first principles that guide predictions deep vs shallow prediction IBM Research Brazil 20. Big Data Hypotheses driven life cycle Hypotheses, experiment Goals Experiment, Workflow Design Workflow Preparation Workflow ExecutionPost- Execution analysis Workflow repository Data Sources Provenance Store Monitoring Hypotheses database Adaptado de [Mattoso et al. 2010] Analysis Results IBM Research Brazil 21. Equivalence of interpretation IBM Research Brazil (hypotheses) Models (phenomenon) (data) 22. HYPOTHESIS AS DATA 23. Phenomenon 0..1 1..1 explains 1..1 1..1 -DB Conceptual Model Continuous Ph_Process Discrete Ph_Process Mathematical Model 1..1 formulatedby isTheBlendOf1..n 1..n Is basedOn represented_as Compared_with Mathematical Formulae XML Represented with Physical Quantities Phenomenon physical quantities 1..1 0..n 0..n 1..1 Formal Representation Scientist 1..m 0..n 0..n 0..n elements constant fucntion equation 1..n 0..n 1..n 1..1 Observation Element Simulated Element Data View (query over Data view) Modeled_as 1..1 0..n Refers-to 0..1 Space-Time Dimension 1..1 0..n 0..n1..1 0..n Event Computational Model View modeled_as transforms Mesh 1..1 1..n Mesh Data view Domain ontology URL 1..1 0..n Formal Language Discrete Phenomenon Simulation 0..1 0..1 0..1 0..n represents 1..1 1..1 0..n Topologically modeled by0..n 0..1 1-n State Ph_Process represented_as SC Hypothesis1..n isAuthor variable 1..n [Porto et al. ER 2008, ER 2012] IBM Research Brazil 24. Hypotheses as Data Upsilon DB From the triangular equivalence, we derive that Hypothesis = Model = Data How can we infer data from Model? [Bernardo Gonalves, Fabio Porto, PVLDB 2014] IBM Research Brazil 25. Hypothesis as Models Law of free fall If a body falls from rest, its velocity at any point is proportional to the time it has been falling. a(t) = -g v(t) = -gt + vo s(t)= - g/2 t2 + vot+ so Hypothesis Scientific Model for k = 0:n; t = k * dt; v = -g*t + v0; s = -(g/2)*t2 + v0*t + s0; t_plot(k) = t; v_plot(k) = v; s_plot(k) = s; end Computational Model IBM Research Brazil 26. Hypothesis - From Models to Data IBM Research Brazil for k = 0:n; t = k * dt; v = -g*t + v0; s = -(g/2)*t2 + v0*t + s0; t_plot(k) = t; v_plot(k) = v; s_plot(k) = s; end SOLVER Input Output 27. Hypothesis as Data for k = 0:n; t = k * dt; v = -g*t + v0; s = -(g/2)*t2 + v0*t + s0; t_plot(k) = t; v_plot(k) = v; s_plot(k) = s; end Law of free fall If a body falls from rest, its velocity at any point is proportional to the time it has been falling. a(t) = -g v(t) = -gt + vo s(t)= - g/2 t2 + vot+ so t v s 0 0 5000 1 -32 4984 2 -64 4936 3 -96 4856 4 -128 4744 IBM Research Brazil Free_Fall 28. Hypothesis as Data Computing a DB schema Mathematical Models formalize hypotheses equations establish a functional dependency between dimensions and parameters and predicting variables eg: g,t,vo -> v Derive a DB schema from DFs extracted from equations IBM Research Brazil 29. Hypothesis as Data In the Free Fall example: 1 = { -> g,vo, so g, -> a g, vo, t, -> v g, vo, so, t, -> s} Observe that and are epistemological variables referring to the phenomenon and the hypothesis, respectively; IBM Research Brazil v(t) = -gt + vo 30. Hypothesis as Data - schema -> g,vo, so defines the model parameters It is expected to be violated reproducing the uncertainty in the model input; Such uncertainty contributes to the quality of the hypothesis From 1, the schema for predicting a under hypothesis h1 would be: h1 (, , a) From 1, the input parameters are defined as: *key violation h1_input(, g, v0, s0) IBM Research Brazil 31. Hypothesis as Uncertain Data IBM Research Brazil g v0 s0 1 32 0 5000 1 32 10 5000 1 32 20 5000 1 32.2 0 5000 1 32.2 10 5000 1 32.2 20 5000 Uncertainty: 50% Uncertainty: 33% INPUT_H1 32. Uncertainty Introduction _DB is a probabilistic database [D. Suciu et al, Probabilistic Databases, 2011] a Y-relation includes certain and conditional columns; a conditional column is a pair (Vi , Di), where Vi is a random variable and Di is one of its possible values; ex: Create table Y_g as select U_phi, U_g from (repair key phi in (select phi, g , count(*) as Fr from INPUT_H1 group by phi, g weight by Fr) as U IBM Research Brazil 33. Hypothesis as Uncertain Data Create table Y_g as select U.phi, U.g from (repair key phi in (select phi, g , count(*) as Fr from INPUT_H1 group by phi, g) as U IBM Research Brazil g 1 32 1 32 1 32 1 32.2 1 32.2 1 32.2 INPUT_H1 V-> D g 1 x1 1 32 1 x1 2 32.2 Y_g 34. Synthesizing Prediction as a query as g, a in 1 , we can predict a as a query on uncertain relations Y_g and Y_R create table Y1_a as select H.phi, H.upsilon, H.a from H1_OUTPUT_a as H, Y_R as R, Y1_g as G, (select min(tid) as tid, phi, g from H1_INPUT group by phi, g) as U where H.tid=U.tid and G.phi=U.phi and G.g=U.g and H.phi=R.phi and H.upsilon=R.upsilon IBM Research Brazil 35. Predicted Y-DB relation Y[a] IBM Research Brazil g a u 1 32 32 0.5 1 32.2 32.2 0.5 [a] -DB enables data oriented uncertainty quantification of predicting variables 36. Sum Up -DB is a probabilistic database designed to manage hypothesis as data In -DB, both the intrinsic uncertainty of the model [R] and those of predicting variables (eg. [a]) are automatically computed -DB and Research Lattices are the basis for managing Hypotheses over Big Data in science (and we believe in any domain) IBM Research Brazil 37. ORDERING HYPOTHESES 38. Ordering Hypotheses Different competing hypotheses must be placed into context according to their phenomenon explanation capacity predicting capability (predicting variables); assumptions and constraints IBM Research Brazil 39. Hypotheses in the Dark Energy Survey Project Phenomenon The universe is increasing its expansion acceleration Discovered in 1998 during supernovae investigation Supported by redshift observation of far away supernovae Hypotheses A new behaviour Dark Energy pushes the acceleration The Universe density is not uniform Evidences gravitational lenses Galaxy clusters IBM Research Brazil 40. Research Lattices structure hypotheses of a phenomenon Dark Energy Non uniform universe Weak lensing Galaxy clustering Earth special location [B. Gonalves, F. Porto, Research Lattices, AMW 2013] IBM Research Brazil 41. Research Lattices Each Node is a hypotheses Given two hypotheses h1 and h2, in a R.L., if h1 h2 then h1 is more general than h2; Top corresponds to all knowledge of a domain; Bottom is the empty representation of lack of knowledge; IBM Research Brazil 42. Research Lattice: Acceleration Lei da queda livre d2s/dt2=9,8 h1 Primeira Lei Newton h2 Segunda Lei Newton F=mag h3 Acelerao Centrpeta ac=4r/T2 h4 3a Lei de Kepler r3/T2= c h5 Lei da Gravitao Universal Fg= G Mn/ r2 h6 Lei do inverso quadrado da distncia ac 1 /r2 h7 IBM Research Brazil 43. Research Lattice for the Human Cardio Vascular System IBM Research Brazil 44. Research lattice Operations Add/delete hypotheses consistently keep the partial ordering; automatic placement of hypotheses in the RL Querying finding hypotheses based on Free Fall hypothesis find competing hypotheses wrt Dark Energy Hypothesis How to access the predictive capacity of a hypothesis? IBM Research Brazil 45. Sum up Research Lattice enables a formal yet bound representation of a research domain Hypotheses are scientists encoding of their studied phenomenon interpretation IBM Research Brazil 46. MULTIDIMENSIONAL REPRESENTATION 47. Dealing with Space-Time dimension on Hypotheses Most phenomena occur on space-time; In computational model, simulations use meshes to model the physical domain; Predicting variables are computed in a point of a multidimensional space 3D, 1D, 4D etc.. Data Representation is a multidimensional matrix [ArrayDB, SciDB,] IBM Research Brazil 48. 1D and 3D Meshes representations of human artery IBM Research Brazil 49. Multidimensional Array Representation Is it efficient for processing queries over meshes such as the ones of HCVS?? IBM Research Brazil 50. SciDB in 20 sec Unit of representation are multidimensional arrays each dimension has a name and a size a reference to all dimensions in an array leads to a cell a cell has many attributes columnar store An array may be partitioned in its dimensions Two query languages AQL and AFL IBM Research Brazil 51. IBM Research Brazil 52. Loading Simulation data Simulation output Wrapper Unidimensional array Multidimensional array Geometry3d_raw Geometry3d redimension_store IBM Research Brazil 53. A multidimensional array in SciDB IBM Research Brazil CREATE ARRAY Geometry3d < velocity_x:double, velocity_y:double, velocity_z:double, pressure:double, displacement_x:double, displacement_y:double,displacement_z:double > [simulation_number=0:9,1,0, time_step=0:30720,1920,0, x_axis=0:39,40,0, y_axis=0:39,40,0, z_axis=0:39,40,0] 54. Challenge to map an irregular mesh into a regular array structure IBM Research Brazil 55. SimDB We developed SimDB: A layer on top of SciDB to map irregular meshes into a regular array data representation IBM Research Brazil 56. IBM Research Brazil 57. Experiment Results IBM Research Brazil 58. Experiments set-up 4 Servers and 16 VMs 4 Queries Servers VMs 1 2 4 1 10GB 2 5GB 5GB / 10GB 4 2.5GB 2.5GB / 5GB 2.5GB / 5GB / 10GB 8 2.5GB 2.5GB / 5GB 16 2.5 GB IBM Research Brazil 59. Experiments 8 scidb instances per VM 1, 2, 4, 8, 16 VMs 7 queries 2 arrays (S, T) e (T, S) 30 executions IBM Research Brazil 60. Results 1(1) 2(1) 2(2) 2(2) 4(1) 4(2) 4(2) 4(4) 4(4) 4(4) 8(2) 8(4) 8(4) 16(4) 00:00.00 00:43.20 01:26.40 02:09.60 02:52.80 03:36.00 04:19.20 VM (Server) Query 1 Query 2 Query 3 Query 4 Executiontime Queries 1 Select avg(pressure)from Geometry3d where time step < 1920 group by Simulation number 2 Select avg(pressure)from Geometry3d group by Simulation number 3 Select avg(pressure)from Geometry3d group by time step 4 Select avg(pressure)from Geometry3d where Simulation number < 2 group by time step IBM Research Brazil 61. Results Queries 1 Select avg(pressure)from Geometry3d where time step < 1920 group by Simulation number 2 Select avg(pressure)from Geometry3d group by Simulation number 3 Select avg(pressure)from Geometry3d group by time step 4 Select avg(pressure)from Geometry3d where Simulation number < 2 group by time step 5 Select avg(pressure)from Geometry3d where Simulation number = 0 and time step < 1920 group by time step 6 Select avg(pressure)from Geometry3d where Simulation number < 0 and time step < 0 group by time step 7 Select avg(pressure)from Geometry3d where (time step % 512) = 0 group by time stepIBM Research Brazil 62. Final Remarks We argue that Big Data Analytics require a scientific approach based on hypotheses formulation and follow-up -DB is an innovative approach for Big Data management; Reflects Hypothesis as data principle Is formal and guards equivalence between data and models Models uncertainty in the model and in the data must be extended to cope with observation validation (Bayesian Model) to support multidimensional representation read our paper at VLDB 2014 SimDB is an extension to SciDB to efficiently store irregular meshes on multidimensional array systems IBM Research Brazil 63. Final Remarks space-time models require a multidimensional data model for hypothesis as data management; SciDB is a parallel multidimensional array DBMS We want to extend -DB to multidimensional array still very immature IBM Research Brazil 64. This is a DEXL Team work PhD Candidate Bernardo Gonalves IBM PhD Fellowship 2013-2014 ([email protected]) Dr Ramon Gomes Costa ([email protected]) Msc student Hermano Lustosa ([email protected]) IBM Research Brazil 65. Obrigado ! http://dexl.lncc.br IBM Research Brazil 66. LNCC Meeting 2012 67. EMC Summer School 2013 Olympic Laboratory Objective To study high performance sports as a science discipline To build the first sports laboratory in South America US$ 10M Project sponsored by FINEP(Funding Agency) Departments: Biochemistry, physiology, genetics, nutrition, computational modeling, computer science, physiology 67 68. Our task To support athletes follow-up data Athletes training Variation on biochemical elements Variation on biometric variables More recently For some modalities, Integrate meteorological conditions EMC Summer School 2013 68 69. Analyses Board EMC Summer School 2013 69 70. EMC Summer School 2013 Athletes follow-up database Athletes follow-up data modeled as trajectories Register measurements from athletes in different training states Trajectory model Ordered set of measurements Division of time in training states Materialized view limited in time-range Imprecise measurements Not detected =0 < x -> ]0,x[ y , y x 70 71. More on Athletes Trajectories Stops modelled as measurements Qualified according the athletes training state Training states (recovery, training, rest,) Moves extrapolation between two stops Trajectory the set of measurements, ordered in time, and limited in time according to some criteria (eg. A training program). Measurements of the same observable element Measurements of the same athlete EMC Summer School 2013 71 72. Metaphoric Trajectory EMC Summer School 2013 72 73. EMC Summer School 2013 73 74. EMC Summer School 2013 74 75. Challenges Integrating athletes trajectory with weather information How to efficiently store metaphoric trajectories ? Trajstore [Cudre-Mauroux et al ICDE 2010] SciDB How to express and efficiently process similar trajectories EMC Summer School 2013 75