Top Banner
Closing the Loop Evaluating Big Data Analysis Karolina Alexiou
24
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Evaluation of big data analysis

Closing the LoopEvaluating Big Data AnalysisKarolina Alexiou

Page 2: Evaluation of big data analysis

AboutThe speaker● ETH graduate● Joined Teralytics in September 2013● Data Scientist/Software EngineerThe talk (takeaways)● Point out how evaluation can improve your project● Suggest concrete steps to build an evaluation

framework

Page 3: Evaluation of big data analysis

The value of evaluation

Data analysis can be fun and exploratory, BUT:

“If you torture the data long enough, it will confess to anything.”

-Ronald Coase, economist

Page 4: Evaluation of big data analysis

The value of evaluationWithout feedback on the data analysis results, (=closing the loop) I don’t know whether my fancy algorithm is better than a naive one.

How to measure?

Page 5: Evaluation of big data analysis

StrategyPeople-driven● Get a 2nd opinion on your methodologyData-driven● Get another data source to verify results (ground truth)● Convert ground truth and your output to the same

format● Compare against meaningful metric ● Store & visualize results

Page 6: Evaluation of big data analysis

General evaluation framework

Page 7: Evaluation of big data analysis

General evaluation framework

Statistical significance?

Page 8: Evaluation of big data analysis

Teralytics Case Study: Congestion Estimation

Ongoing project: Use of cellular data to estimate traffic/congestion in Swiss roads

Our estimations: Mean speed on a highway at a given time, given location

Page 9: Evaluation of big data analysis

Ground truth● Complex algorithm with lots of knobs and subproblems● How to know we’re changing things for the better?

● Collect ground truth regarding road traffic in Switzerland -> sensor data available from 3rd party site

● Write hackish script to login to website and fetch sensor data that match our highway locations

● Instant sense of purpose :)

Page 10: Evaluation of big data analysis

Same format

Not just a data architecture problem.

● Our algorithm’s speed estimations are fancy averages of distance/time_needed_for_distance (journey speed)

● Sensor data reports instantaneous speed.● Sensors are probably going to report higher speeds

systematically (bias).

Page 11: Evaluation of big data analysis

Comparing against metric

● Group data every 3 minutes● Metric: Percentage of data where the

difference between ground truth and estimation is <7%

● Other options○ linear correlation of time-series of speed○ cross-correlation to find optimal time shift

Page 12: Evaluation of big data analysis

Pitfalls of comparison

● Overfitting to ground truth● Correlation may be statistically insignificant

Need proper methodology (training set/testing set) & adequate amounts of ground truth

Page 13: Evaluation of big data analysis

Visualization● Instant feedback on

what is working and what is not.

● Insights○ on assumptions○ on quality of data sources○ presence of time shift

Page 14: Evaluation of big data analysis

Lessons learned

Ground truth isn’t easy to get● No API - web scraping

● May be biased

● May have to create it yourself

Page 15: Evaluation of big data analysis

Lessons learned

Use the right tools● The output of a Big Data analysis problem is of more manageable size ->

no need to overengineer, python is fitting for the job

● Need to be able to handle missing data/add constraints /average/interpolate-> use existing library (pandas) with useful abstractions

● Crucial to be able to pinpoint what goes wrong -> interactivity (ipython), logging

Page 16: Evaluation of big data analysis

Lessons learned

Use the right workflow● Run the whole thing at once for timely feedback

● Always visualize -> large CSVs are hard to make sense of (false sense of security)

● Iterative development pays off & is sped up by automated evaluation :)

Page 17: Evaluation of big data analysis

Action Points

Ask questions● Is there some place of my data analysis where my

results are unverified?

● Am I using the right tools to evaluate?

● Is overengineering getting in the way of quick & timely feedback?

Page 18: Evaluation of big data analysis

Action Points

Make a plan● What ground truth can I get or create?● How can I make sure I am comparing apples to apples?● How should I compare my data to the ground truth

(metric, comparison method)?● What’s the best visualization to show correlation?

Page 19: Evaluation of big data analysis

Recommended Reading● Excellent abstractions for data

cleaning & transformation● Good performance● Portable data formats ● Increases productivity● +ipython for easy exploring of

the data (more insight, what went wrong etc)

It takes some time to learn to use the full power of pandas - so get your data scientists to learn it asap. :)

Page 20: Evaluation of big data analysis

Recommended Reading● Even new companies have

“legacy” code (code that is blocking change)

● Acknowledges the imperfection of the real world (even if design is good, problems may arise)

● Acknowledges the value of quick feedback in dev productivity

● Case-by-case scenarios to unblock yourself and be able to evaluate your code

Page 21: Evaluation of big data analysis

Recommended Reading

Page 22: Evaluation of big data analysis

Thanks

I would like to thank my colleagues for making good decisions, in particular● Valentin for introducing pandas to Teralytics

● Nima for organizing the collection of ground truth on several projects

● Laurent for insisting on testing & best practices

Page 23: Evaluation of big data analysis

Questions?

We are hiring :)Looking for Machine Learning/Big Data experts

Experience with pandas is a plusJust send your CV to [email protected]

Page 24: Evaluation of big data analysis

Bonus Recommended Reading

Evaluation of impact of charity organizations is a hard, unsolved problem involving data

● transparency● more motivation to

give