Developing Machine Learning Applications Geoff Holmes, University of Waikato 1
Jun 12, 2015
Developing Machine Learning ApplicationsGeoff Holmes, University of Waikato
1
Outline
• What application development have we done?
• What lessons have we learned?
• What is needed in terms of the future of machine learning application development?
2
Applications – a taxonomy
• UCI data sets – very much like our early agricultural data
• Competition data – usually larger than above, often difficult
• Signal control applications (often involve reinforcement learning) – eg autonomous helicopters, vehicles, learning the signature of a great pianist, learning to sail, learning to drive racing cars faster, learning to play soccer (often linked to robotics)
• Key to success = objective measurement – eg Human Computer Interaction, Speech and Image Recognition, Computer Games, etc.
3
WEKA Waikato Environment for Knowledge Analysis
• Machine Learning at Waikato started in 1993
• Build an interface to enable several ML methods to be compared on same data
• Explore datasets of importance to the agricultural sector in NZ
• Apple bruising, Venison bruising, Bull behaviour, Grass grubs, Pasture production, Pea seed colour, Slugs, Squash harvest, Wasp nests, White clover persistence
• Cow culling
• Datasets very of the “bring out your dead” variety
4
WEKA – unscientific study from Google Scholar
• For the query “WEKA applications”
• Bioinformatics
• Grid Computing
• Medicine
• Business and Finance
• Computer Networks
• Education
5
Early lessons learned
• Using WEKA is good but only static solutions are possible
• Datasets need to be large enough to yield significant and meaningful results
• Datasets involving human judgement tend to be unreliable
6
Scientific Equipment Application Methodology
• Obtain samples and reference data from existing technology (eg wet chemistry) – establish targets Y.
• Process same samples using a proxy (eg NIR) – new X
• Construct new dataset with new X and Y
7
Near Infrared Spectroscopy
• Once concept was proven we needed a system to support commercial use (ie alongside the LIMS)
• Developed S2 (with WEKA interface):
• Used continuously at Hill Laboratories and BLGG (Holland) since around 2005 – never gone wrong!
• So far it is the best application of the technology that we have ever come across.
• Faster than wet chemistry
• Predictions can be more accurate
• Large cost savings – multiple analyses per sample
8
S2
9
NIR – lessons learned
• Very lightweight input/output solution using dropboxmethodology was successful as it is transparent and seamless alongside a LIMS.
• Instrument data is extremely reliable
• In this Industry, compliance is important which implies that a single algorithm is better than choosing the best method per dataset.
• As data is abundant, models are rebuilt from time to time.
• No facility for users to develop new applications.
10
Gas Chromatography Mass Spectrometry
• Analytical instrument that combines the features of gas chromatography and mass spectrometry to identify different substances within a test sample
• Typical Applications
• Environmental monitoring
• Food and beverage analysis
• Criminal forensics (CSI!)
• Drugs/explosives detection
11
Example Chromatogram (PAH) – ion counts
12
MS fingerprints
13
Machine Learning Approach
• Chromatograms are pre-processed to extract features
• Dataset constructed combining pre-processed chromatograms with analyst checked compound concentrations
• Learn the relationship between pre-processed chromatograms and compound concentrations:
• extensive pre-processing of data
• parallel processing – 5000 * 300 values per instance (NIR = 1000)
• pre-processing varies among compounds
14
Process Requirements
15
Solution = Advanced DAta Mining System
• get database IDs of chromatograms
• load chromatograms from DB
• identify and reject outliers
• obtain calibration set information, check correctness of set
• align with calibration chromatogram, check correlation
• compound-specific outlier detection
• generate artificial chromatogram with peaks of compound and spike compound
• generate output for WEKA16
Limitations and future directions
• What we have seen so far works with data resident in memory (RAM) all the time
• This implies a limit can easily be reached, esp in applications like GCMS.
• We would like to be able to learn from potentially infinite data sources but with finite memory (RAM).
17
Solution = MOA
18
Future Directions
• Investigate how to get users to deploy their own DM solutions
• Implement incremental pre-processing techniques (Joao has already started!), eg incremental outlier detection.
• Implement incremental algs esp. for regression.
• Encourage work on abstention classifiers, uncertainty associated with point predictions etc.
• Meta-mine which units of a workflow are useful in tandem
• Investigate fusion: ADAMS with MOA, data (image+features), tasks (multiview, multitask, transfer)
19
Finally
Questions or Comments?
20