NASA Ames Data Sciences Group - Amazon Web …...Vijay Janakiraman, Ph.D. Rodney Martin, Ph.D. Bryan Matthews David Nielsen Nikunj Oza, Ph.D. Veronica Phillips John Stutz HamedValizadegan

Post on 05-Jul-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

NASAAmesDataSciencesGroupNikunjC.Oza,Ph.D.

Leader,DataSciencesGroupnikunj.c.oza@nasa.gov

www.nasa.gov •1

•2

TheDataSciencesGroupatNASAAmes

DataMiningResearchandDevelopment(R&D)forapplicationtoNASAproblems(Aeronautics,EarthScience,SpaceExploration,SpaceScience)

GroupMembersIlya AvrekhKamalika Das,Ph.D.DaveIversonVijayJanakiraman,Ph.D.RodneyMartin,Ph.D.BryanMatthewsDavidNielsenNikunjOza,Ph.D.VeronicaPhillipsJohnStutzHamed Valizadegan,Ph.D.+summerstudents

FundingSources

• ScienceMissionDirectorate:AISTandCMACprograms

• NASAAeronauticsResearchMissionDirectorate- ATD,SMART-NAS,SASOProject

• NASAEngineeringandSafetyCenter

• ExplorationSystemsMissionDirectorate,ExplorationTechnologyDevelopmentProgram

• Non-NASA:DARPA,DoD

Team Members are NASA Employees, Contractors, and Students.

ExampleDataMiningProblems

• Aeronautics:AnomalyDetection,PrecursorIdentification,textmining(classification,topicidentification)

• EarthScience:Fillinginmissingmeasurements,anomalydetection,teleconnections,climateunderstanding

• SpaceScience:Kepler planetcandidates• SpaceExploration:systemhealthmanagement,vascularstructureidentification

FourV’sofBig Tough,SleepDeprivingData

ØVolume:Ø RadarTracks:47facilities(1

year)~423GB(Compressed),~3.2TB(CSV)

Ø WeatherandForecast(EntireNAS):CIWS~2.8TB

ØVeracityØ DatadropoutsØ DuplicatetracksØ TrackendinginmidairØ Reusedflightidentifiers

ØVelocityØ RadarTracks:47Facilities

Ø ~35GB/month(compressed).

Ø ~268GB/month(uncompressed)

Ø WeatherandForecast(EntireNAS):CIWS~233GB/month

ØVarietyØ Numerical

(continuous/binary)Ø Weather(forecast/actual)Ø Radar/AirportmetadataØ ATCVoiceØ ASRStextreports

(Pilot/Controller)

AmazingAlgorithm

IntuitiveReports

AeronauticsDataMiningProblems

• AnomalyDetection– AnomalyDiscoveryoverlargesetofvariables– Particularvariableofinterest,forexample,fuelburn

• Determineexpectedinstantaneousfuelburngivencurrentstateofaircraft

• Comparewithactualinstantaneousfuelburn• Wheredifferenceishigh,problemmaybeoccurring

• PrecursorIdentification– Givenundesirableeffect(e.g.,go-around),identifyprecursors(e.g.,overtakesituation,highspeedapproach)

• Textmining– Textclassification,topicidentification

TopicExtractionExampleTOPIC1

autopltacftspd

capturemoderatelevel

engagedleveloffvertctl

disconnectedselectedfpmlightclbpitch

manuallywarningpwr

TOPIC2

timedayleg

contributingfactorshrscrewfactorfatiguenighttriprestdutyflyinglonglate

previousincidentlack

alerter

TOPIC3apchrwyvisualilstwrlndglocarptfinal

missedclredmsl

interceptvectoredsightgar

terrainfield

uneventfulctl

Otherexamplesof‘fatigue’

AltitudeDeviationSpatialDeviationRampExcursionLandingwithoutclearanceRunwayIncursionUnstableApproach

AeronauticsAnomalyDetection:CurrentMethods

Exceedance-BasedMethods• Knownanomalies• Conditionsover2-3variables(e.g.,speed>250knots,altitude=1000ft,landing)

• Cannotidentifyunknownanomalies• Lowfalsepositiverate,highfalsenegative(misseddetection)rate.

Data-DrivenMethods

• DISCOVERanomaliesby– learningstatisticalpropertiesofthedata– findingwhichdatapointsdonotfit(e.g.,faraway,lowprobability)

– nobackgroundknowledgeonanomaliesneeded

• Complementarytoexistingmethods– Lowfalsenegative(misseddetection)rate– Higherfalsepositiverate(identifiedpoints/flightsunusual,butnotalwaysoperationallysignificant)

• Data-drivenmethods->insights->modificationofexceedance detection

Example:HighSpeedGo-Around

• OvershootsExtendedRunwayCenterline(ERC)byover1SM

• Over250Kts @2500Ft.• Angleofintercept>40°• Overshoots2nd approach

BryanMatthews,DavidNielsen,JohnSchade,Kennis Chan,andMikeKiniry,AutomatedDiscoveryofFlightTrackAnomalies,33rd DigitalAvionicsSystemsConference,2014

ProvidingDomainExpertFeedbackActive learning with rationales frameworkTraining

Input Features

MKAD

Nominals

Anomalies

SME

Activelearning strategy

Testing

Input Features

MKAD

Nominals

Anomalies

Rationale features

2-class classification/ranking

algorithm

Uninteresting anomalies

Operationally significant anomalies

Manali Sharma,Kamalika Das,MustafaBilgic,BryanMatthews,DavidNielsen,andNikunjOza,ActiveLearningwithRationalesforIdentifyingOperationallySignificantAnomaliesinAviation,EuropeanConferenceonMachineLearningandPrinciplesandPractices

OfKnowledgeDiscovery(ECML-PKDD),2016

EarthScienceExample

• Understandrelationshipsbetweenecosystemdynamics andclimaticfactors

• Modelasaregressionanalysisproblem• 3sciencequestions– Magnitudeandextentofecosystemexposure,sensitivityandresiliencetothe2005and2010Amazondroughts

– Understandhuman-inducedandotherattributionascausesofvegetationanomalies

– Howlearneddependencymodelvariesacrosseco-climaticzonesandgeographicalregionsonaglobalscale

NASAESTOAIST-14project,UncoveringEffectsofClimateVariablesonGlobalVegetation(PI:Kamalika Das,Ph.D.)

ProblemFormulation

• Point-to-pointregressionanalysis(GeneticProgrammingbasedSymbolicRegression)

• Estimatespatio-temporaldependencyofforestecosystemsonclimatevariables

Vijt=f(Lcij, CVij

t, CVnbt, CVij

t-1, CVnbt-1,.....CVij

t-k, CVnbt-k)

V:vegetation,LC:landcover type,CV:climate variable(s)

i,j:pixellocationindicest:timeindexnb:spatialneighborhoodof

indexi,jk:temporaldependencyOpenchallenges: 1.Estimatingfunctionf

2.Estimatingbestchoicesfork,nb

DataPipeline

NDVIResolution: 250 m

Projection: Sinusoidal

LSTResolution: 1 km

Projection: Sinusoidal

TRMM (Ver 6)Resolution: 25 kmProjection: WGS84

Reprojectand

resampledata

2000 – 2010 Monthly data

NDVI, TRMM, LST

Resolution: 1 kmProjection: WGS84

Time-Series:Changetoseasonal

2000 – 2010 Seasonal data

Monthly -> Seasonal

4 Seasons/yea

r

Season 1: March – MaySeason 2: June – SepSeason 3: OctSeason 4: Nov - Feb

Windowing:Smoothingover

25x25sizewindow

Filterdatabasedonlandcover

Resultsfor2004-2010

Mean Squared Error

Year RidgeRegression LASSO SVR Symbolic

Regression

2004 0.284 0.284 0.280 0.262

2005 0.289 0.289 0.288 0.278

2006 0.426 0.426 0.430 0.321

2007 0.374 0.374 0.370 0.318

2008 0.308 0.308 0.310 0.336

2009 0.353 0.353 0.360 0.328

2010 0.546 0.547 0.540 0.479

Marcin Szubert,Anuradha Kodali,Sangram Ganguly,Kamalika Das,andJoshC.Bongard,ReducingAntagonismbetweenBehavioralDiversityandFitnessinSemanticGeneticProgramming,ProceedingsoftheGeneticandEvolutionaryComputation

Conference(GECCO),pp.797-804,2016.

OngoingandFutureWork• Experimentwithdifferentcombinationsoftemporal

lookback and/orspatialeffects• Introduceadditionalregressors(radiation,forestfiremaps,

deforestationmaps)• StudytheeffectofdifferentregressorsondifferentAmazon

tiles• DerivenonlinearGPmodelsonAmazontiles• Givenappropriatehistoricaldata,havetheabilitytopredict:

“Underwhatconditionsdoesvegetationnotrecoverwithinacertaintimeframe.”

• Doglobalscaleanalysisinparallel

VESsel GENeration (VESGEN)AnalysisPatriciaParsons-Wingerter,PhD,NASAChiefInnovator/POCNASAAmes2016InnovationFundAward,ChiefTechnologist’sOffice

• VESGEN2Dmapsandquantifiesvascularremodelingforawidevarietyofquasi-2Dvascularizedbiomedicaltissueapplications.

• WorkingontransformingtoVESGEN3D,inlinewithmostvascularizedorgansandtissuesinhumansandvertebrates.

• Vascular-dependentdiseasesincludecancer,diabetes,coronaryvesseldisease,andmajorastronauthealthchallengesinthespacemicrogravityandradiationenvironments,especiallyforlong-durationmissions.

• Onekeycomponentisbinarization:conversionofgrayscaleimagestoblack/whitevascularbranchingpatterns.– Takes10-25hoursofhumaneffort.– Exploringpatternrecognition,matchingfiltering,vessel

tracking/tracing,mathematicalmorphology,multiscaleapproaches,andmodelbasedalgorithms.

OTSUThresholding

OTSUvs.AdaptiveThresholding

FutureWork• Workinprogress:exploringmorepreprocessingandpost-processingtechniques

• Eachstepofpreprocessingandpostprocessing hassomeinputparameters– Theresultissensitivetothisparameters– Weaimtomaketheparameterselectioneitherautomated(machinelearning)orsemi-automated(usercanchoosetherightparameter)

• MachineLearningtolearnthebinarization– Giventhemanuallabels,performsupervisedorsemi-supervisedlearning

– Eachpixelanditsclasslabel(foregroundorbackground)isthetrainingexample

How dowegettheWordOut?DASHlink

disseminate.collaborate.innovate.https://dashlink.ndc.nasa.gov/

DASHlinkisacollaborativewebsitedesignedtopromote:• Sustainability• Reproducibility• Dissemination• Communitybuilding

Userscancreateprofiles• Sharepapers,uploadanddownloadopensourcealgorithms• FindNASAdatasets.

NASAAmesDataSciencesGroupNikunjC.Oza,Ph.D.

Leader,DataSciencesGroupnikunj.c.oza@nasa.gov

www.nasa.gov •21

top related