Machine Learning Application Development

Developing Machine Learning ApplicationsGeoff Holmes, University of Waikato

1

Outline

• What application development have we done?

• What lessons have we learned?

• What is needed in terms of the future of machine learning application development?

2

Applications – a taxonomy

• UCI data sets – very much like our early agricultural data

• Competition data – usually larger than above, often difficult

• Signal control applications (often involve reinforcement learning) – eg autonomous helicopters, vehicles, learning the signature of a great pianist, learning to sail, learning to drive racing cars faster, learning to play soccer (often linked to robotics)

• Key to success = objective measurement – eg Human Computer Interaction, Speech and Image Recognition, Computer Games, etc.

3

WEKA Waikato Environment for Knowledge Analysis

• Machine Learning at Waikato started in 1993

• Build an interface to enable several ML methods to be compared on same data

• Explore datasets of importance to the agricultural sector in NZ

• Apple bruising, Venison bruising, Bull behaviour, Grass grubs, Pasture production, Pea seed colour, Slugs, Squash harvest, Wasp nests, White clover persistence

• Cow culling

• Datasets very of the “bring out your dead” variety

4

WEKA – unscientific study from Google Scholar

• For the query “WEKA applications”

• Bioinformatics

• Grid Computing

• Medicine

• Business and Finance

• Computer Networks

• Education

5

Early lessons learned

• Using WEKA is good but only static solutions are possible

• Datasets need to be large enough to yield significant and meaningful results

• Datasets involving human judgement tend to be unreliable

6

Scientific Equipment Application Methodology

• Obtain samples and reference data from existing technology (eg wet chemistry) – establish targets Y.

• Process same samples using a proxy (eg NIR) – new X

• Construct new dataset with new X and Y

7

Near Infrared Spectroscopy

• Once concept was proven we needed a system to support commercial use (ie alongside the LIMS)

• Developed S2 (with WEKA interface):

• Used continuously at Hill Laboratories and BLGG (Holland) since around 2005 – never gone wrong!

• So far it is the best application of the technology that we have ever come across.

• Faster than wet chemistry

• Predictions can be more accurate

• Large cost savings – multiple analyses per sample

8

S2

9

NIR – lessons learned

• Very lightweight input/output solution using dropboxmethodology was successful as it is transparent and seamless alongside a LIMS.

• Instrument data is extremely reliable

• In this Industry, compliance is important which implies that a single algorithm is better than choosing the best method per dataset.

• As data is abundant, models are rebuilt from time to time.

• No facility for users to develop new applications.

10

Gas Chromatography Mass Spectrometry

• Analytical instrument that combines the features of gas chromatography and mass spectrometry to identify different substances within a test sample

• Typical Applications

• Environmental monitoring

• Food and beverage analysis

• Criminal forensics (CSI!)

• Drugs/explosives detection

11

Example Chromatogram (PAH) – ion counts

12

MS fingerprints

13

Machine Learning Approach

• Chromatograms are pre-processed to extract features

• Dataset constructed combining pre-processed chromatograms with analyst checked compound concentrations

• Learn the relationship between pre-processed chromatograms and compound concentrations:

• extensive pre-processing of data

• parallel processing – 5000 * 300 values per instance (NIR = 1000)

• pre-processing varies among compounds

14

Process Requirements

15

Solution = Advanced DAta Mining System

• get database IDs of chromatograms

• load chromatograms from DB

• identify and reject outliers

• obtain calibration set information, check correctness of set

• align with calibration chromatogram, check correlation

• compound-specific outlier detection

• generate artificial chromatogram with peaks of compound and spike compound

• generate output for WEKA16

Limitations and future directions

• What we have seen so far works with data resident in memory (RAM) all the time

• This implies a limit can easily be reached, esp in applications like GCMS.

• We would like to be able to learn from potentially infinite data sources but with finite memory (RAM).

17

Solution = MOA

18

Future Directions

• Investigate how to get users to deploy their own DM solutions

• Implement incremental pre-processing techniques (Joao has already started!), eg incremental outlier detection.

• Implement incremental algs esp. for regression.

• Encourage work on abstention classifiers, uncertainty associated with point predictions etc.

• Meta-mine which units of a workflow are useful in tandem

• Investigate fusion: ADAMS with MOA, data (image+features), tasks (multiview, multitask, transfer)

19

Finally

Questions or Comments?

20

Machine Learning Application Development

Technology

instrument data

reference data

new applications

advanced data mining

new dataset

data resident inmemory

weka interface

taxonomy uci data sets