Top Banner
Federal Big Data Working Group Meetup Dr. Brand Niemann Director and Senior Data Scientist Semantic Community http://semanticommunity.info/ http://www.meetup.com/Federal-Big-Data-Working-Group/ http://semanticommunity.info/Data_Science/Federal_Big_Data_Work ing_Group_Meetup February 18, 2014 1
30

Federal Big Data Working Group Meetup

Feb 25, 2016

Download

Documents

salim

Federal Big Data Working Group Meetup. Dr. Brand Niemann Director and Senior Data Scientist Semantic Community http://semanticommunity.info/ http://www.meetup.com/Federal-Big-Data-Working-Group/ http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Federal Big Data Working Group  Meetup

1

Federal Big Data Working Group Meetup

Dr. Brand NiemannDirector and Senior Data Scientist

Semantic Communityhttp://semanticommunity.info/

http://www.meetup.com/Federal-Big-Data-Working-Group/http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup

February 18, 2014

Page 2: Federal Big Data Working Group  Meetup

2

Mission Statement• Federal: Supports the Federal Big Data Initiative, but not endorsed

by the Federal Government or its Agencies;• Big Data: Supports the Federal Digital Government Strategy which

is "treating all content as data", so big data = all your content;• Working Group: Data Science Teams composed of Federal

Government and Non-Federal Government experts producing big data products (How was the data collected, Where is it stored, and What are the results?); and

• Meetup: The world's largest network of local groups to revitalize local community and help people around the world self-organize like MOOCs (Massive Open On-line Classes) being considered by the White House.

Page 3: Federal Big Data Working Group  Meetup

3

Co-organizers• Brand Niemann and Kate Goodier• Kate Goodier, Host: Xcelerate Solutions offices in Tysons Corner:

– Capacity about 50 with Skype and WiFi available. The Silver Line Spring Hill Metro Stop (planned to open in March) is across the street (Route 7 and Spring Hill Road).

• Directions to the building are easy and they have open underground parking:– See photo on Web Site from Xcelerate Solutions Office looking south

to the Spring Hill Road Silver Line Metro Station (planned to open in March 2014).

• Logistics:– Refreshments, restrooms, etc.

Page 4: Federal Big Data Working Group  Meetup

4

Suggested Format• 6:30 p.m. Tutorials (I will start with - Proposed GMU Course, and hope that others would

offer to do tutorials as well) and Refreshments– Continue Data Science Tutorial: Class 3, Recent Tutorial, and Modus Operandi Semantic

Knowledge Base• 7:00 p.m. Introductions and Announcements (10 seconds per individual depending on the

size of the group)– Remarks by Dr. George Strawn, Director, NITRD/NCO and co-chair of the Federal Big Data Senior

Steering Work Group• 7:15 p.m. Featured Presentation/Demonstration (where did you get the data, where did

you store the data, and what were your results?)– Evolution of Semantic Technologies-The Value of Merging Smart Data With Big Data: Eric

Little, Modus Operandi and Department of Defense Metadata Engineers– White Paper “Making Big Data Small" using Semantics & Advanced Analytics for NITRD: Jeff

Lessner, Modus Operandi• 8:30 p.m. Networking/Individual Demos (talk among yourselves and look at one another's

work)• 9:00 p.m. Continue Your Conversations Elsewhere (We need to clear out of the space)

Page 5: Federal Big Data Working Group  Meetup

5

Next Meetups• NIST Data Science Symposium, March 4-5, 9 a.m.

– Hosted at NIST (Gaithersburg, MD)– Registration closes February 21st (free)– We have a poster presentation at 2:45-4 p.m.

• Fourth Meetup: March 4, 6:30 p.m.– Hosted at NSF (Ballston, VA)– Welcome by NIH Program Director, Dr. Peter Lyster– Brief demo of NIH Semantic Medline/YarcData by Tom Rindflesch and Aaron Bossett– Presentation by Drs. George Strawn and Barend Mons on A Data Fairport and Semantic Scientific

Publishing– Discussions and Networking

• Fifth Meetup: March 18, 6:30 p.m.– Continue Data Science Tutorial: Graph Databases and Bigdata SYSTAP Literature Survey of Graph Databases– Bigdata SYSTAP, Bryan Thompson, SYSTAP– Discussions and Networking

• Sixth Meetup: April 1, 6:30 p.m., Seventh Meetup: April 15, 6:30 p.m., Eighth Meetup: May 4, 6:30 p.m. and Ninth Meetup: May 18, 6:30 p.m.

• 2nd Cloud, SOA, Semantics and Data Science Conference, June (in planning)

Page 6: Federal Big Data Working Group  Meetup

6

Overview• Practical Data Science for Data Scientists:

– 2/4 Asking and Answering Questions About Data– Chapters 5 & 6

• Two Book Review Tutorials:– Thinking with Data:

• Recall the Borne Ultimatum: Data Literacy for All! Teach Learning From Data K-12– Data Science for Business:

• Introduction to Data Science for NYU’s new MS in Data Science and adopted by more than twenty other universities for programs in nine countries

• Two Data Science Client Applications:– GIS Inc. – EPA Waterways– Semantic Verses - Data Science for Business Data Science

• Data Science @ Capital One:– Data Science Story and Invitation– Senior Data Scientist Position

Page 7: Federal Big Data Working Group  Meetup

7

Practical Data Science for Data Scientists

http://semanticommunity.info/Data_Science/Practical_Data_Science_for_Data_Scientists

Class 3

Providing On-Line ClassWith Private Tutoring

Page 8: Federal Big Data Working Group  Meetup

8

Resources• Required Textbook

– Doing Data Science:• http://shop.oreilly.com/product/0636920028529.do• Free Sampler:

– http://cdn.oreillystatic.com/oreilly/booksamplers/9781449358655_sampler.pdf (PDF)

• Optional Supplemental Reading:– Data Science Starter Kit:

• http://shop.oreilly.com/category/get/data-science-kit.do– DC Data Community:

• http://datacommunitydc.org/blog/about/

• DC Data Community Calendar:– http://datacommunitydc.org/blog/calendar/

• Technology Requirements– Internet and Free Tools like Spotfire Cloud:

• https://spotfire.cloud.tibco.com/tsc/#!/compproductrequest– NodeXL:

• http://nodexl.codeplex.com/

Page 9: Federal Big Data Working Group  Meetup

9

Class 3• 1/28 Finding, Cleaning, Analyzing, and Visualizing Data

– Discuss Reading: Chapters 5 and 6, Present and Discuss Team Homework Exercise, Hands-on Class Exercise, and Team Homework Exercise.

– My Resources:• AOL Government Stories

• Hands-on Class Exercise:– Media 6 Degrees Exercise

• Media 6 Degrees kindly provided a dataset that is perfect for exploring logistic regression models, and evaluating how good the models are. dds_ch5_binary-class-dataset

– See Spotfire Web Player: Chapter 5 Logistic Regression Media 6 Degrees and Spotfire File

Page 10: Federal Big Data Working Group  Meetup

10

Discuss Reading

• Chapter 5:– In this chapter, we’re talking about logistic regression, but

there’s other classification algorithms available, including decision trees (which we’ll cover in Chapter 7), random forests (Chapter 7), and support vector machines and neural networks (which we aren’t covering in this book).

• Chapter 6:– The main topics for this chapter—times series, financial

modeling, and fancy-pants regression, and building a GetGlue-like recommendation system to address the problem of content discovery within the movie and TV space.

Page 11: Federal Big Data Working Group  Meetup

11

Present and Discuss Team Homework Exercise

• Select One, But Please Present Both:– Jake’s Exercise: Naive Bayes for Article Classificatio

n: NYT Data Set (31 CSV files, 151 MB) Already Used

– A Spam Filter for Individual Words: To do this yourself, go online and download Enron emails

• My Note: Because of the difficulty with these data sets, I provided the Two Data Science Client Applications.

Page 14: Federal Big Data Working Group  Meetup

14

Team Homework Exercise• Exercise: GetGlue and Timestamped Event Data

– GetGlue kindly provided a dataset for us to explore their data, which contains timestamped events of users checking in and rating TV shows and movies.

– Raw data is 11 GB (once it’s uncompressed) and could not be imported into Spotfire.

• Get the Data: Go to Yahoo! Finance and download daily data from a stock that has at least eight years of data, making sure it goes from earlier to later.– If you don’t know how to do it, Google it. Yahoo: http://finance.yahoo.com/q/

hp?s=%5EO...torical+Prices (CSV) See Spotfire Web Player and File• Form Teams (Same or New), Ask Me Questions, and Prepare to Present

Next Week

Page 16: Federal Big Data Working Group  Meetup

16

Two Book Review Tutorials

• Thinking with Data:– Recall the Borne Ultimatum: Data Literacy for All! Teach

Learning From Data K-12:• Thinking with Data: Book Review Tutorial

– In-depth look at many of the same topics in Data Science for Business, with a greater focus on the high-level technical ideas.

• Data Science for Business:– Introduction to Data Science for NYU’s new MS in Data

Science and adopted by more than twenty other universities for programs in nine countries:• Data Science for Business: Book Review Tutorial

– Used in Semantic Verses - Data Science for Business Pilot.

Page 17: Federal Big Data Working Group  Meetup

17

Two Data Science Client Applications 1

• GIS Inc. – EPA Waterways:– We’ve been following your work with the Voyager

implementation at the National Geospatial Intelligence Agency (NGA). I’m doing some work for the Williams Company (Oil and Gas) and we have just launched Voyager at 5 of their office locations. My team was wondering if you might have any time to relay any lessons’ learned with your implementation at NGA?

• Semantic Community:– I use Voyager like a Geographic Clearinghouse to find GIS data

and then Spotfire 6.0 to analyze it.– What is the spatial relationship between Williams Co. current

and planned activities and EPA Waterways data?

Page 18: Federal Big Data Working Group  Meetup

18

Answer the Questions About EPA Waterways

• Where did we find the data?– Online most recent (Better than Voyager this time)

• Where did we store the data?– Shape files & Excel spreadsheets (ultimately Spotfire)

• What did we find when we analyzed the data?– See Spotfire dashboards

• What is our data story and product?– See Spotfire dashboards and TIBCO Spotfire 6 for

Data Science Documentation

Page 19: Federal Big Data Working Group  Meetup

19

Where did we find the data?

Web Site

The Environmental Protection Agency has maintained public databases on the condition of rivers, lakes and streams for decades. But until about a year ago, anyone who wanted to get at that data faced a labyrinthine process, either devising search queries to try to navigate the databases or resorting to a Freedom of Information Act request.My Note: We improved on that!

Page 20: Federal Big Data Working Group  Meetup

20

Where did we store the data?

http://semanticommunity.info/@api/deki/files/28161/EPAWaterways.xlsx

The Data Ecosystem!

Page 22: Federal Big Data Working Group  Meetup

22

What is our data story and product?

• Data Ecosystem:– My Note: I downloaded, inventoried and imported these to Spotfire

which resulted in a 2.5 GB Spotfire file which I then reduced three times to 1.7 GB, 0.7 GB, and finally publish a 0.3 GB Spotfire files to the Web Player.

• Individual Tabs:– 303(d) Listed Impaired Waters– 305(b) Waters– 2002 Impaired Waters– Watershed Boundaries 2002– Impaired Waters with TMDL– 2009 Beaches– Water Quality Standards Program

Page 23: Federal Big Data Working Group  Meetup

23

Two Data Science Client Applications 2

• Semantic Verses - Data Science for Business Data Science:• “Magnet is the only engine that treats topics as semantic objects, which

gives it a competitive edge since the identification of “key topics” is generally considered to be the main feature of any semantic engine.”– Source: Walid S. Saba, PhD, AI/NLP Scientist, February 2014.

– Produced a Data Science for Business Knowledge Base in MindTouch, Excel, and Spotfire:• Structured Mashup with everything treated as an object with a well-

defined URL for the Glossary (taxonomy) and Table of Contents (thesaurus) Integrated together in an Information Model!– Allows one to construct a natural language front-end for enterprise data (and

big data) integration across multiple sources.

Page 24: Federal Big Data Working Group  Meetup

24

Where did we store the data?

http://semanticommunity.info/@api/deki/files/28363/DataScienceforBusiness.xlsx

The Data Ecosystem:491 rows by 16 columns94 rows by 12 columns23 rows by 2 columnsso far!

Page 25: Federal Big Data Working Group  Meetup

25

TIBCO Spotfire 6 for Data Science

http://semanticommunity.info/Data_Science/TIBCO_Spotfire_6_for_Data_Science#Story

Data Science Recipe for Data Science Cooking

Page 26: Federal Big Data Working Group  Meetup

26

TIBCO Spotfire 6 for Data Science

http://semanticommunity.info/Data_Science/TIBCO_Spotfire_6_for_Data_Science

The Data Relationships tool is used for investigating the relationships between different column pairs. The Linear regression and the Spearman R options allow you to compare numerical columns, the Anova option will help you determine how well a category column categorizes values in a (numerical) value column, the Kruskal-Wallis option is used to compare sortable columns to categorical columns, and the Chi-square option helps you to compare categorical columns.

Page 27: Federal Big Data Working Group  Meetup

27

What did we find when we analyzed the data?

My Note: This and the next slide show why it is impossible to distinguish between poisonous and non-poisonous mushrooms.

Page 28: Federal Big Data Working Group  Meetup

28

What did we find when we analyzed the data?

Page 29: Federal Big Data Working Group  Meetup

29

Data Science @ Capital One• Data Science Story and Invitation:

– The classic story of little Signet Bank from the 1990s provides a case in point for Data and Data Science Capability as a Strategic Asset. Fairbanks and Morris became Chairman and CEO and President and COO, and proceeded to apply data science principles throughout the business—not just customer acquisition but retention as well. You may not have heard of little Signet Bank, but if you’re reading this book you’ve probably heard of the spin-off: Capital One. Fairbanks and Morris’s new company grew to be one of the largest credit card issuers in the industry with one of the lowest chargeoff rates. My Note: I invited their data science lead to present.

• Senior Data Scientist Position:– Basic Qualifications:

• Bachelor’s Degree• 2 years experience in Hadoop • 7 years of experience with data mining, machine learning, statistical modeling tools and

underlying algorithms• 5 years experience with relational database and SQL• 5 years experience working with large, unstructured (terabyte or larger) data sets

Page 30: Federal Big Data Working Group  Meetup

30

Preview of What You Are Going To Hear

• Remarks by Dr. George Strawn, Director, NITRD/NCO and co-chair of the Federal Big Data Senior Steering Work Group

• Evolution of Semantic Technologies-The Value of Merging Smart Data With Big Data: Eric Little, Modus Operandi and Department of Defense Metadata Engineers

• White Paper “Making Big Data Small" using Semantics & Advanced Analytics for NITRD: Jeff Lessner, Modus Operandi