Top Banner
of 75 days (June 06 to August 19) with Explorers & Rocket Scientists @ NASA Jet Propulsion Laboratory Pasadena, CA 2016 SUMMER INTERNSHIP Thamme Gowda’s
50

Thamme Gowda's Summer2016- NASA JPL Internship

Apr 11, 2017

Download

Technology

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Thamme Gowda's Summer2016- NASA JPL Internship

of 75 days (June 06 to August 19)with Explorers & Rocket Scientists

@ NASA Jet Propulsion LaboratoryPasadena, CA

2016 SUMMER INTERNSHIPThamme Gowda’s

Page 2: Thamme Gowda's Summer2016- NASA JPL Internship

ABOUT ME

● ThammeGowda Narayanaswamy (or TG or Thamme)

● An MS student at University of Southern California, LA

● Majoring in Computer Science, graduating in May 2017

● @thammegowda

2

Page 3: Thamme Gowda's Summer2016- NASA JPL Internship

WHAT I DID THIS SUMMER?

DEEP LEARNING MACHINE LEARNING

NATURALLANGUAGE

PROCESSING

INFORMATION RETRIEVAL

WEB CRAWLING/ SCRAPING

WEB DEVELOPMENT

DISTRIBUTED COMPUTING

INFRASTRUCTURE/ DEV - OPS.

SYSTEM DESIGNARCHITECTURE

3

Page 4: Thamme Gowda's Summer2016- NASA JPL Internship

INITIAL GOALS

ASSIGNMENT:

● will be primarily involved in supporting and influencing DARPA Memex.

● will involve active and iterative development of predominantly media (images, videos, etc.) retrieval models

● will be part of 398M’s vision and aspirations within the field of open source engagement, big data and information retrieval with a specific focus on media.

4

Page 5: Thamme Gowda's Summer2016- NASA JPL Internship

TIKA-1508, TIKA-1986

SYSTEM DESIGNARCHITECTURE

INFRASTRUCTURE/ DEV - OPS.

5

Page 6: Thamme Gowda's Summer2016- NASA JPL Internship

TIKA-1508, TIKA-1986With

Dr. Tim AllisonApache Tika PMC Member

● TIKA-1508 Add uniformity to parser parameter configuration

● TIKA-1986 Support typed parameters (int, double, file, etc)

● Tika Parsers can now accept runtime parameters● All parameters in a single XML config file● Framework binds the parameters to parser instances

○ JAXB○ Java Reflections○ Java Annotations

● My Contributions - Design, Implement and Architectural Discussions

6

Page 7: Thamme Gowda's Summer2016- NASA JPL Internship

TIKA-1508, TIKA-1986

BEFORE

WithDr. Tim Allison

Apache Tika PMC

NOW

7

Page 8: Thamme Gowda's Summer2016- NASA JPL Internship

TIKA-1986 SUPPORTED TYPES

● int● float● byte● double● bool● string

● short● long● file● url● uri● bigint

WithDr. Tim Allison

Apache Tika PMC

8

Page 9: Thamme Gowda's Summer2016- NASA JPL Internship

TIKA-1993: OBJECT RECOGNITION

DEEP LEARNING

INFRASTRUCTURE/ DEV - OPS.

9

Page 10: Thamme Gowda's Summer2016- NASA JPL Internship

TIKA-1993: OBJECT RECOGNITION

● Image Recognition with Apache Tika○ now Tika can also see images!

● Integrated Tensorflow Inception-V3 model● Evaluated multiple ways of integration

○ Command Line Invocation → S-l-o-w as a turtle○ Java Native Interface → Transitive dependency issues○ GRPC Client Server → Dependency version issues○ REST Client Server → Works best, please use this!

● https://wiki.apache.org/tika/TikaAndVision

WithDr. Chris Mattmann

10

Page 11: Thamme Gowda's Summer2016- NASA JPL Internship

TIKA-1993 DEMO

<meta name="OBJECT" content="German shepherd, German shepherd dog, German police dog, alsatian (0.36203)"/>

<meta name="OBJECT" content="military uniform (0.13061)"/>

* Photo Credits - Wikimedia.org

WithDr. Chris Mattmann

11

Page 12: Thamme Gowda's Summer2016- NASA JPL Internship

LABELING THE WEAPONS DATASET

● 1.3 Million Images of DARPA MEMEX dataset● Detected objects in the images● Ran it two times

○ 1st time - top 2 objects○ 2nd time - top 2 objects + confidence threshold of 0.3

● Improved the efficiency for large jobs○ 1.3 million images took ~ 36 hours on 32 CPU cores○ Improvements are upstreamed to Apache Tika

● Indexed the results back to Imagecatdev solr● You can view this on imagespace today

WithDr. Chris Mattmann

http://imagecat.dyndns.org/weapons/imagespace-dev/12

Page 13: Thamme Gowda's Summer2016- NASA JPL Internship

DEMO

http://imagecat.dyndns.org/weapons/imagespace-dev/

WithDr. Chris Mattmann

13

Page 14: Thamme Gowda's Summer2016- NASA JPL Internship

PARSER-INDEXER

INFRASTRUCTURE/ DEV - OPS.

14

Page 15: Thamme Gowda's Summer2016- NASA JPL Internship

PARSER-INDEXER (aka IMAGECAT2)

● Set of tools for parsing and indexing○ Parse various documents○ Index to Solr○ Enrich documents

■ NER■ Links■ Dates

● Last semester’s work● This summer -

○ Enhanced few tools to meet new requirements○ Created a docker image

● https://github.com/uscdataScience/imagecat2

WithAsitang Mishra

15

Page 16: Thamme Gowda's Summer2016- NASA JPL Internship

SPARKLER

INFORMATION RETRIEVAL

WEB CRAWLING/ SCRAPING

SYSTEM DESIGNARCHITECTURE

DISTRIBUTED COMPUTING

INFRASTRUCTURE/ DEV - OPS.

16

Page 17: Thamme Gowda's Summer2016- NASA JPL Internship

SPARKLER

● Spark-Crawler ● Redesigned and reimagined crawler

○ Taking the best parts of Apache Nutch○ Combined with recent advancements in

distributed computing● My contributions:

○ Architecture○ The spark way of distributed computing○ Prototyping the proof of concept○ Initial code

● https://github.com/uscdataScience/sparkler

WithKaranjeet Singh

17

Page 18: Thamme Gowda's Summer2016- NASA JPL Internship

SPARKLER - ROADMAP

● Working Prototype with partitioning for Fair Fetching ✔ ● Apache Solr Backend for crawldb ✔● Stores Data on FS ✔● OSGI based Plugin Framework (Apache Felix) ✔

○ Regex URL Filter ✔● Admin Dashboard -- coming soon● ApacheCon EU 2016 -- Planning● More Plugins from Nutch -- Coming soon

Quick Start: https://github.com/USCDataScience/sparkler/wiki/sparkler-0.1

WithKaranjeet Singh

18

Page 19: Thamme Gowda's Summer2016- NASA JPL Internship

NUTCH-2292 MAVENIZE THE BUILD

WEB CRAWLING/ SCRAPING

INFRASTRUCTURE/ DEV - OPS.

19

Page 20: Thamme Gowda's Summer2016- NASA JPL Internship

NUTCH-2292 MAVENIZE THE BUILD

● Bigger Goal: reuse Apache Nutch plugins in Sparkler● Challenge: Nutch’s plugins are not reusable● Why ?

○ The build is messy. Uses old-school tools: Ant + Ivy ○ The plugins are not publishable on Maven Central Repo.

● Smaller Goals:○ 1. Mavenize the entire build system of Nutch○ 2. Keep the build output backward compatible

■ There are 2 builds: local & deploy● Progress: Backward compatible local build using Apache Maven● Pending: deploy build is not done yet!● https://github.com/apache/nutch/tree/NUTCH-2292

WithDr. Lewis McGibbney

PMC member of Apache Nutch

20

Page 21: Thamme Gowda's Summer2016- NASA JPL Internship

WEB OF SCIENCE SCRAPER

WEB CRAWLING/ SCRAPING

21

Page 22: Thamme Gowda's Summer2016- NASA JPL Internship

WEB OF SCIENCE SCRAPER

● Web of Science is an online scientific citation indexing service● Goal: Scrape all the research papers matching to given keywords● Challenge: Get only the TSV data of articles, not the web pages● Solution:

○ Focus/Custom crawler in python using selenium-driver

WithKyle Hundman

22

Page 23: Thamme Gowda's Summer2016- NASA JPL Internship

WORKFLOW

STOP

23

WithKyle Hundman

Page 24: Thamme Gowda's Summer2016- NASA JPL Internship

Hi prof!I like to work on ML,

NLP this summer. …. pursue PhD in AI

field, so ….JPL is an internationally recognized home to AI. There are the world’sexperts here ….

24

PRIOR TO SUMMER

Page 25: Thamme Gowda's Summer2016- NASA JPL Internship

LANDMARKS CLASSIFICATION

DEEP LEARNING MACHINE LEARNING

25

Page 26: Thamme Gowda's Summer2016- NASA JPL Internship

● Goal: Classify Landmark images from High Resolution Imaging Science Experiment (HiRISE)

● Classes: Crater, Dark Dune, Bright Dune, Streak etc● Trained a deep neural net for image classification● Compared with the results from Caffe based classifier

LANDMARKS CLASSIFICATIONWith

Dr. Kiri Wagstaff

26

Page 27: Thamme Gowda's Summer2016- NASA JPL Internship

SIMPLE NEURAL NET

27*Photo Credits: commons.wikimedia.org

Page 28: Thamme Gowda's Summer2016- NASA JPL Internship

INCEPTION

* Photo Credits: quickmeme.com28

Page 29: Thamme Gowda's Summer2016- NASA JPL Internship

INCEPTION-V3 ARCHITECTURE

* Photo Credits - Google Research

This Net. has 5.64% top-5 error on ILSVRC 2012 validation dataset

29

Page 30: Thamme Gowda's Summer2016- NASA JPL Internship

LANDMARKS CLASSIFICATION

● Challenges:○ Too little training data○ Demands lots of CPU power○ Labels are not precise

● Solution: Transfer Learning○ Start with Inception-V3 Net using state of the art model○ Erase the weights of last layer○ Retrain the network for new classes

● Advantages:○ Knowledge reuse ○ Faster experiments at lesser resources

WithDr. Kiri Wagstaff

30

Page 31: Thamme Gowda's Summer2016- NASA JPL Internship

LANDMARKS CLASSIFICATIONTENSORFLOWJudge↓\TFlo→ streak other dark_dune bright_dune crater [TFlo.Tot]streak 1 55 1 0 1 58other 1 1562 143 2 60 1768dark_dune 0 18 471 0 1 490bright_dune 0 6 1 0 16 23crater 0 225 0 0 158 383[Judge.Total] 2 1866 616 2 236 [2722]

CAFFEJudge↓\Caffe→ streak other dark_dune bright_dune crater [Caffe.Tot]streak 2 13 75 0 1 91other 0 1722 102 0 54 1878dark_dune 0 11 407 0 1 419bright_dune 0 12 28 2 3 45crater 0 108 4 0 177 289[Judge.Total] 2 1866 616 2 236 [2722]

WithDr. Kiri Wagstaff

31

Page 32: Thamme Gowda's Summer2016- NASA JPL Internship

TENSORFLOW VS CAFFE

Caffe↓\TFlo→ streak other dark_dune bright_dune crater [TFlo.Tot]

streak 9 47 0 0 2 58

other 25 1536 88 21 98 1768

dark_dune 57 74 330 17 12 490

bright_dune 0 8 0 3 11 22

crater 0 211 1 4 166 382

[Caffe.Total] 91 1876 419 45 289 [2720]

WithDr. Kiri Wagstaff

32

Page 33: Thamme Gowda's Summer2016- NASA JPL Internship

SUPERVISING UI

● web UI for labelling images● Simple to setup and start● Creates SQLite DB to store labels and the progress● Web UI using Python Flask and Bootstrap

○ Can be extended using new templates and settings● Shows random document for labelling● Feature to navigate back and front (using browser history) ● Accepts multiple labels● Option to download the data as CSV, sorted by time● https://github.com/USCDataScience/supervising-ui

WithDr. Kiri Wagstaff

33

Page 34: Thamme Gowda's Summer2016- NASA JPL Internship

DEMO WithDr. Kiri Wagstaff

34

Page 35: Thamme Gowda's Summer2016- NASA JPL Internship

CP1: CLASSIFIER FOR DARPA MEMEX

DEEP LEARNING MACHINE LEARNING

NATURALLANGUAGE

PROCESSING

35

Page 36: Thamme Gowda's Summer2016- NASA JPL Internship

CP1: CLASSIFIER FOR DARPA MEMEX

● Goal: Binary classifier for Human Trafficking (HT) related web page classification

● Training data: Cluster of ads for positive and negative HT● Approach 1:

○ Used LSTM Neural Net for classification○ Challenges: One training epoch took 36 hours○ Outcomes:

■ Tight fitting - The model didn’t generalize well■ Got ~39% ROC AUC

○ What went wrong: ■ Only one training epoch was done

WithKyle Hundman

36

Page 37: Thamme Gowda's Summer2016- NASA JPL Internship

CP1: CLASSIFIER FOR DARPA MEMEX

● Built a new classifier using SVM● Created custom vectors ● Stanford CoreNLP for tokenization● Features:

○ Unigrams○ Selected Bigrams

● All grams are lemmatized● Classification is done at the cluster level● https://github.com/USCDataScience/svm-classifier-memex

WithKyle Hundman

* Photo credits http://scikit-learn.org/

37

Page 38: Thamme Gowda's Summer2016- NASA JPL Internship

CP1: CLASSIFIER FOR DARPA MEMEX

Input text

Tokenizer(CoreNLP)

Dictionary

WithKyle Hundman

Step 1. Build Dictionary

Tokenizer(CoreNLP)

Input text

VectorizerDictionary

Vectors

Training TestingStep 2. Make Vectors

38

Page 39: Thamme Gowda's Summer2016- NASA JPL Internship

CP1: CLASSIFIER FOR DARPA MEMEXWith

Kyle Hundman

Eval

Labels(Ground Truth)

SVM Model

TrainingVectors

Labels(Ground Truth)

Training

TestingVectors

Labels(Predicted)

Predicting

Stop

Accept

Feedback/ Iterate

Change parameters, features, etc

39

Page 40: Thamme Gowda's Summer2016- NASA JPL Internship

● Challenges:○ Dataset was ~23GBs, time was roughly 1 day

■ tasks like POS Tagging, Parsing, NER are expensive■ single threaded job @ 1.7GHz, estimated ~30 Hrs

● Solution: Multithreading ○ 4 cores @ 1.7GHz, estimated 8 Hrs○ 32 CPU Cores @ 2.8GHz

■ ~ 30 Minutes for training + test dataset■ < 30 minutes for evaluation dataset

● Faster experiments → More experiments in search of optimal model

CP1: CLASSIFIER FOR DARPA MEMEXWith

Kyle Hundman

40

Page 41: Thamme Gowda's Summer2016- NASA JPL Internship

CP1: TEST DATASET ROC CURVEWith

Kyle Hundman

83.8% AU-ROC for 145 Clusters (101 +ve, 44 -ve)41

Page 42: Thamme Gowda's Summer2016- NASA JPL Internship

CP1: EVALUATION DATASET AU ROC WithKyle Hundman

81.7% AU-ROC for 487 Clusters (Next best result was 65%) 42

Page 43: Thamme Gowda's Summer2016- NASA JPL Internship

CP1: CLASSIFIER FOR DARPA MEMEX

● What worked ?○ Cluster level Classification instead of document○ Lemmatization

■ Verbs: run, ran, running are same■ Nouns: child, children are same■ Adjectives: hot, hotter, hottest are same

○ Selected Bigrams and N-grams:■ Adjectives and nouns - together■ Adverbs and verbs - together

WithKyle Hundman

Tokens like ‘girl’, ‘best’, ‘party’, ‘hot’ may not mean much;but the phrases means a lot to HT classifier!

43

Page 44: Thamme Gowda's Summer2016- NASA JPL Internship

MARS TARGET EXTRACTION

MACHINE LEARNING

NATURALLANGUAGE

PROCESSING

INFORMATION RETRIEVAL

44

Page 45: Thamme Gowda's Summer2016- NASA JPL Internship

● Goal: Build a search engine for research articles related to planetary science.

■ Minerals, Elements,Targets etc● Contributions: parser and indexer tools

■ Apache Tika to extract text● Grobid parser to extract title,

authors, affiliations etc■ Stanford CoreNLP for NER■ Apache Lucene/Solr inverted index

● https://github.com/USCDataScience/parser-indexer-py

MARS TARGET EXTRACTION (MTE) WithDr. Kiri Wagstaff

Dr. Raymond Francis

45

Page 46: Thamme Gowda's Summer2016- NASA JPL Internship

● Building a custom Named Entity Recognition (NER) model for planetary science

● Entities include ELEMENTS, MINERALS, TARGETS● Dr. Raymond Francis annotated documents published in Lunar and

Planetary Science Conference 1 (LPSC) 2015 using brat 2

● Used these annotations as training data for Stanford CoreNLP’s CRFClassifier 3

● Scripts and the documentation are at https://github.com/USCDataScience/parser-indexer-py/tree/master/src/corenlp

MARS TARGET EXTRACTION WithDr. Kiri Wagstaff

Dr. Raymond Francis

1. http://www.hou.usra.edu/meetings/lpsc2015/2. http://brat.nlplab.org/3. http://nlp.stanford.edu/software/CRF-NER.shtml

46

Page 47: Thamme Gowda's Summer2016- NASA JPL Internship

NER MODEL EVALUATION #1

Entity P R F1 TP FP FN

Element 0.7105 0.9 0.7941 81 33 9

Locality 0 0 0 0 1 2

Mineral 0.9444 0.6667 0.7816 34 2 17

Target 0.9524 0.8696 0.9091 20 1 3

Totals 0.7849 0.7714 0.7781 135 37 40

● Training data: 55 documents from LPSC 2015● Testing data: 8 documents from LPSC 2015

47

Page 48: Thamme Gowda's Summer2016- NASA JPL Internship

NER MODEL EVALUATION #2

Entity P R F1 TP FP FN

Element 0.6864 0.9 0.7788 81 37 9

Locality 0 0 0 0 1 2

Material 0 0 0 0 4 2

Mineral 0.9111 0.8039 0.8542 41 4 10

Target 1 0.913 0.9545 21 0 2

Totals 0.7566 0.8171 0.7857 143 46 32

● Training data: 64 documents from LPSC 2015 and 2016● Testing data: 8 documents from LPSC 2015

48

Page 49: Thamme Gowda's Summer2016- NASA JPL Internship

● Dr. Chris Mattmann● Paul Ramirez● Dr. Kiri Wagstaff

ACKNOWLEDGEMENTS

49

Page 50: Thamme Gowda's Summer2016- NASA JPL Internship

THANKS

● Thanks for the opportunity!● Am I coming back ? (September 12th, hopefully)● Wish to do more ML, NLP research

50