Big Data and Large Scale Data Analysis Andrew Mead School of Life Sciences 23 rd October 2013
Jan 04, 2016
Big Data andLarge Scale Data Analysis
Andrew MeadSchool of Life Sciences
23rd October 2013
Big Data
• Modern technologies make it increasingly easy to collect large quantities of data– ‘Omics revolution– Remote sensing– Weather (and hence climate change applications)– Internet applications– Social networking– Shopping preferences– Health applications– …
• But how do we make the most of these data?
Gene expression microarrays
• Data on many thousands of genes (spots) on each array
• Comparisons of multiple samples (treatments, time, individual plants or animals, …)
• Processing of data for each gene separately or in combination
Landscape data
• Land-use/cover for each land-parcel
• Basis for simulation studies of changes in land-use
• Summary of spatial data into simple statistics
JCA101 - Simulation01 - Run001 - Year2009
50 100 150 200
50
100
150
200
250
Challenges
• Storage of big data sets• Management
– Structured– Unstructured
• Analysis– Often similar questions as for smaller data sets– Computationally intractable as data volume
increases
Multivariate Statistics and Data Mining
• Dimensionality reduction– Find the important combinations of variables– Use these in models
• Use computing power to search for “patterns”
• Challenge in connecting the analysis process to the data
• Distributed computing, massively parallel processing (MPP), machine learning, search-based applications (SBA), …
Statistics and Big Data
• Computing power is probably crucial!• But statistical approaches are important
– Designing the data collection• Sub-sampling?
– Defining the problem– Managing the data– Dimension reduction
• Finding the signal amidst the noise!