Federal Big Data Working Group Meetup

1

Federal Big Data Working Group Meetup

Dr. Brand NiemannDirector and Senior Data Scientist

Semantic Communityhttp://semanticommunity.info/

http://www.meetup.com/Federal-Big-Data-Working-Group/http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup

February 18, 2014

http://semanticommunity.info/

http://www.meetup.com/Federal-Big-Data-Working-Group/

http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup

2

Mission Statement• Federal: Supports the Federal Big Data Initiative, but not endorsed

by the Federal Government or its Agencies;• Big Data: Supports the Federal Digital Government Strategy which

is "treating all content as data", so big data = all your content;• Working Group: Data Science Teams composed of Federal

Government and Non-Federal Government experts producing big data products (How was the data collected, Where is it stored, and What are the results?); and

• Meetup: The world's largest network of local groups to revitalize local community and help people around the world self-organize like MOOCs (Massive Open On-line Classes) being considered by the White House.

3

Co-organizers• Brand Niemann and Kate Goodier• Kate Goodier, Host: Xcelerate Solutions offices in Tysons Corner:

– Capacity about 50 with Skype and WiFi available. The Silver Line Spring Hill Metro Stop (planned to open in March) is across the street (Route 7 and Spring Hill Road).

• Directions to the building are easy and they have open underground parking:– See photo on Web Site from Xcelerate Solutions Office looking south

to the Spring Hill Road Silver Line Metro Station (planned to open in March 2014).

• Logistics:– Refreshments, restrooms, etc.

4

Suggested Format• 6:30 p.m. Tutorials (I will start with - Proposed GMU Course, and hope that others would

offer to do tutorials as well) and Refreshments– Continue Data Science Tutorial: Class 3, Recent Tutorial, and Modus Operandi Semantic

Knowledge Base• 7:00 p.m. Introductions and Announcements (10 seconds per individual depending on the

size of the group)– Remarks by Dr. George Strawn, Director, NITRD/NCO and co-chair of the Federal Big Data Senior

Steering Work Group• 7:15 p.m. Featured Presentation/Demonstration (where did you get the data, where did

you store the data, and what were your results?)– Evolution of Semantic Technologies-The Value of Merging Smart Data With Big Data: Eric

Little, Modus Operandi and Department of Defense Metadata Engineers– White Paper “Making Big Data Small" using Semantics & Advanced Analytics for NITRD: Jeff

Lessner, Modus Operandi• 8:30 p.m. Networking/Individual Demos (talk among yourselves and look at one another's

work)• 9:00 p.m. Continue Your Conversations Elsewhere (We need to clear out of the space)

5

Next Meetups• NIST Data Science Symposium, March 4-5, 9 a.m.

– Hosted at NIST (Gaithersburg, MD)– Registration closes February 21st (free)– We have a poster presentation at 2:45-4 p.m.

• Fourth Meetup: March 4, 6:30 p.m.– Hosted at NSF (Ballston, VA)– Welcome by NIH Program Director, Dr. Peter Lyster– Brief demo of NIH Semantic Medline/YarcData by Tom Rindflesch and Aaron Bossett– Presentation by Drs. George Strawn and Barend Mons on A Data Fairport and Semantic Scientific

Publishing– Discussions and Networking

• Fifth Meetup: March 18, 6:30 p.m.– Continue Data Science Tutorial: Graph Databases and Bigdata SYSTAP Literature Survey of Graph Databases– Bigdata SYSTAP, Bryan Thompson, SYSTAP– Discussions and Networking

• Sixth Meetup: April 1, 6:30 p.m., Seventh Meetup: April 15, 6:30 p.m., Eighth Meetup: May 4, 6:30 p.m. and Ninth Meetup: May 18, 6:30 p.m.

• 2nd Cloud, SOA, Semantics and Data Science Conference, June (in planning)

6

Overview• Practical Data Science for Data Scientists:

– 2/4 Asking and Answering Questions About Data– Chapters 5 & 6

• Two Book Review Tutorials:– Thinking with Data:

• Recall the Borne Ultimatum: Data Literacy for All! Teach Learning From Data K-12– Data Science for Business:

• Introduction to Data Science for NYU’s new MS in Data Science and adopted by more than twenty other universities for programs in nine countries

• Two Data Science Client Applications:– GIS Inc. – EPA Waterways– Semantic Verses - Data Science for Business Data Science

• Data Science @ Capital One:– Data Science Story and Invitation– Senior Data Scientist Position

7

Practical Data Science for Data Scientists

http://semanticommunity.info/Data_Science/Practical_Data_Science_for_Data_Scientists

Class 3

Providing On-Line ClassWith Private Tutoring

http://semanticommunity.info/Data_Science/Practical_Data_Science_for_Data_Scientists

8

Resources• Required Textbook

– Doing Data Science:• http://shop.oreilly.com/product/0636920028529.do• Free Sampler:

– http://cdn.oreillystatic.com/oreilly/booksamplers/9781449358655_sampler.pdf (PDF)

• Optional Supplemental Reading:– Data Science Starter Kit:

• http://shop.oreilly.com/category/get/data-science-kit.do– DC Data Community:

• http://datacommunitydc.org/blog/about/

• DC Data Community Calendar:– http://datacommunitydc.org/blog/calendar/

• Technology Requirements– Internet and Free Tools like Spotfire Cloud:

• https://spotfire.cloud.tibco.com/tsc/#!/compproductrequest– NodeXL:

• http://nodexl.codeplex.com/

http://shop.oreilly.com/product/0636920028529.do



http://cdn.oreillystatic.com/oreilly/booksamplers/9781449358655_sampler.pdf

http://cdn.oreillystatic.com/oreilly/booksamplers/9781449358655_sampler.pdf

http://semanticommunity.info/@api/deki/files/27053/9781449358655_sampler.pdf

http://shop.oreilly.com/category/get/data-science-kit.do



http://datacommunitydc.org/blog/about/

http://datacommunitydc.org/blog/about/

http://datacommunitydc.org/blog/calendar/

http://datacommunitydc.org/blog/calendar/

https://spotfire.cloud.tibco.com/tsc/#!/compproductrequest

https://spotfire.cloud.tibco.com/tsc/#!/compproductrequest

http://nodexl.codeplex.com/

http://nodexl.codeplex.com/

9

Class 3• 1/28 Finding, Cleaning, Analyzing, and Visualizing Data

– Discuss Reading: Chapters 5 and 6, Present and Discuss Team Homework Exercise, Hands-on Class Exercise, and Team Homework Exercise.

– My Resources:• AOL Government Stories

• Hands-on Class Exercise:– Media 6 Degrees Exercise

• Media 6 Degrees kindly provided a dataset that is perfect for exploring logistic regression models, and evaluating how good the models are. dds_ch5_binary-class-dataset

– See Spotfire Web Player: Chapter 5 Logistic Regression Media 6 Degrees and Spotfire File

http://semanticommunity.info/Data_Science/Doing_Data_Science#5._Logistic_Regression

http://semanticommunity.info/Data_Science/Doing_Data_Science#6._Time_Stamps_and_Financial_Modeling

http://semanticommunity.info/#AOL_Government_Stories

http://semanticommunity.info/Data_Science/Doing_Data_Science#Media_6_Degrees_Exercise

http://semanticommunity.info/@api/deki/files/27793/dds_ch5_binary-class-dataset.txt

https://spotfire.cloud.tibco.com/public/ViewAnalysis.aspx?file=/users/bniemann/Public/DoingDataScienceChapters2-5-Spotfire&waid=ca633b46bd01c0002b17c-21015227bfb983


http://semanticommunity.info/@api/deki/files/27802/DoingDataScienceChapters2-5-Spotfire.dxp

10

Discuss Reading

• Chapter 5:– In this chapter, we’re talking about logistic regression, but

there’s other classification algorithms available, including decision trees (which we’ll cover in Chapter 7), random forests (Chapter 7), and support vector machines and neural networks (which we aren’t covering in this book).

• Chapter 6:– The main topics for this chapter—times series, financial

modeling, and fancy-pants regression, and building a GetGlue-like recommendation system to address the problem of content discovery within the movie and TV space.

11

Present and Discuss Team Homework Exercise

• Select One, But Please Present Both:– Jake’s Exercise: Naive Bayes for Article Classificatio

n: NYT Data Set (31 CSV files, 151 MB) Already Used

– A Spam Filter for Individual Words: To do this yourself, go online and download Enron emails

• My Note: Because of the difficulty with these data sets, I provided the Two Data Science Client Applications.

http://semanticommunity.info/Data_Science/Doing_Data_Science#Jake.E2.80.99s_Exercise:_Naive_Bayes_for_Article_Classification

http://semanticommunity.info/Data_Science/Doing_Data_Science#Jake.E2.80.99s_Exercise:_Naive_Bayes_for_Article_Classification

http://semanticommunity.info/Data_Science/Doing_Data_Science#A_Spam_Filter_for_Individual_Words

https://www.cs.cmu.edu/~enron/

12

Hands-on Class Exercise

• Media 6 Degrees Exercise:– Media 6 Degrees kindly provided a dataset that is

perfect for exploring logistic regression models, and evaluating how good the models are: dds_ch5_binary-class-dataset

– See Spotfire Web Player Chapter 5 Logistic Regression Media 6 Degrees and Spotfire File

• See Spotfire User's Guide for Data Science:– Logistics Regression Method– How to Use the Evaluation Page



http://semanticommunity.info/@api/deki/files/27793/dds_ch5_binary-class-dataset.txt


http://semanticommunity.info/@api/deki/files/27802/DoingDataScienceChapters2-5-Spotfire.dxp

http://semanticommunity.info/Data_Science/TIBCO_Spotfire_6_for_Data_Science




http://semanticommunity.info/Data_Science/TIBCO_Spotfire_6_for_Data_Science#Logistic_Regression_Method

http://semanticommunity.info/Data_Science/TIBCO_Spotfire_6_for_Data_Science#How_to_Use_the_Evaluation_Page

13

Chapter 5 Logistic RegressionMedia 6 Degrees

Web Player

https://spotfire.cloud.tibco.com/public/ViewAnalysis.aspx?file=/users/bniemann/Public/DoingDataScienceChapters2-5-Spotfire&waid=b07f22899b9be7a2d22aa-21015227bfb983

14

Team Homework Exercise• Exercise: GetGlue and Timestamped Event Data

– GetGlue kindly provided a dataset for us to explore their data, which contains timestamped events of users checking in and rating TV shows and movies.

– Raw data is 11 GB (once it’s uncompressed) and could not be imported into Spotfire.

• Get the Data: Go to Yahoo! Finance and download daily data from a stock that has at least eight years of data, making sure it goes from earlier to later.– If you don’t know how to do it, Google it. Yahoo: http://finance.yahoo.com/q/

hp?s=%5EO...torical+Prices (CSV) See Spotfire Web Player and File• Form Teams (Same or New), Ask Me Questions, and Prepare to Present

Next Week

http://semanticommunity.info/Data_Science/Doing_Data_Science#Exercise:_GetGlue_and_Timestamped_Event_Data






http://finance.yahoo.com/q?s=GO.TO

http://finance.yahoo.com/q/hp?s=%5EOEX+Historical+Prices




http://semanticommunity.info/@api/deki/files/27804/YahooHistoricStockPrices01012014.csv

https://spotfire.cloud.tibco.com/public/ViewAnalysis.aspx?file=/users/bniemann/Public/YahooHistoricStockPrices01012014&waid=ee0b29d30907cd0a2c3cd-13004927bfe819

https://spotfire.cloud.tibco.com/public/ViewAnalysis.aspx?file=/users/bniemann/Public/YahooHistoricStockPrices01012014&waid=ee0b29d30907cd0a2c3cd-13004927bfe819

http://semanticommunity.info/@api/deki/files/27805/YahooHistoricStockPrices01012014.dxp

15

Chapter 6 Timestamps andFinancial Modeling

Web Player

https://spotfire.cloud.tibco.com/public/ViewAnalysis.aspx?file=/users/bniemann/Public/YahooHistoricStockPrices01012014&waid=e75f25f709a571df398c6-13004927bfe819

16

Two Book Review Tutorials

• Thinking with Data:– Recall the Borne Ultimatum: Data Literacy for All! Teach

Learning From Data K-12:• Thinking with Data: Book Review Tutorial

– In-depth look at many of the same topics in Data Science for Business, with a greater focus on the high-level technical ideas.

• Data Science for Business:– Introduction to Data Science for NYU’s new MS in Data

Science and adopted by more than twenty other universities for programs in nine countries:• Data Science for Business: Book Review Tutorial

– Used in Semantic Verses - Data Science for Business Pilot.

http://semanticommunity.info/@api/deki/files/28361/BrandNiemann02122014.pptx




17

Two Data Science Client Applications 1

• GIS Inc. – EPA Waterways:– We’ve been following your work with the Voyager

implementation at the National Geospatial Intelligence Agency (NGA). I’m doing some work for the Williams Company (Oil and Gas) and we have just launched Voyager at 5 of their office locations. My team was wondering if you might have any time to relay any lessons’ learned with your implementation at NGA?

• Semantic Community:– I use Voyager like a Geographic Clearinghouse to find GIS data

and then Spotfire 6.0 to analyze it.– What is the spatial relationship between Williams Co. current

and planned activities and EPA Waterways data?

18

Answer the Questions About EPA Waterways

• Where did we find the data?– Online most recent (Better than Voyager this time)

• Where did we store the data?– Shape files & Excel spreadsheets (ultimately Spotfire)

• What did we find when we analyzed the data?– See Spotfire dashboards

• What is our data story and product?– See Spotfire dashboards and TIBCO Spotfire 6 for

Data Science Documentation

19

Where did we find the data?

Web Site

The Environmental Protection Agency has maintained public databases on the condition of rivers, lakes and streams for decades. But until about a year ago, anyone who wanted to get at that data faced a labyrinthine process, either devising search queries to try to navigate the databases or resorting to a Freedom of Information Act request.My Note: We improved on that!

http://water.epa.gov/scitech/datait/tools/waters/data/downloads.cfm

20

Where did we store the data?

http://semanticommunity.info/@api/deki/files/28161/EPAWaterways.xlsx

The Data Ecosystem!



21

What did we find when we analyzed the data?

Web Player

https://spotfire.cloud.tibco.com/public/ViewAnalysis.aspx?file=/users/bniemann/Public/EPAWaterways3-Spotfire&waid=c7427bf190fc287eb8641-01171027bfe341

22

What is our data story and product?

• Data Ecosystem:– My Note: I downloaded, inventoried and imported these to Spotfire

which resulted in a 2.5 GB Spotfire file which I then reduced three times to 1.7 GB, 0.7 GB, and finally publish a 0.3 GB Spotfire files to the Web Player.

• Individual Tabs:– 303(d) Listed Impaired Waters– 305(b) Waters– 2002 Impaired Waters– Watershed Boundaries 2002– Impaired Waters with TMDL– 2009 Beaches– Water Quality Standards Program

23

Two Data Science Client Applications 2

• Semantic Verses - Data Science for Business Data Science:• “Magnet is the only engine that treats topics as semantic objects, which

gives it a competitive edge since the identification of “key topics” is generally considered to be the main feature of any semantic engine.”– Source: Walid S. Saba, PhD, AI/NLP Scientist, February 2014.

– Produced a Data Science for Business Knowledge Base in MindTouch, Excel, and Spotfire:• Structured Mashup with everything treated as an object with a well-

defined URL for the Glossary (taxonomy) and Table of Contents (thesaurus) Integrated together in an Information Model!– Allows one to construct a natural language front-end for enterprise data (and

big data) integration across multiple sources.

24

Where did we store the data?

http://semanticommunity.info/@api/deki/files/28363/DataScienceforBusiness.xlsx

The Data Ecosystem:491 rows by 16 columns94 rows by 12 columns23 rows by 2 columnsso far!



25

TIBCO Spotfire 6 for Data Science

http://semanticommunity.info/Data_Science/TIBCO_Spotfire_6_for_Data_Science#Story

Data Science Recipe for Data Science Cooking



26

TIBCO Spotfire 6 for Data Science


The Data Relationships tool is used for investigating the relationships between different column pairs. The Linear regression and the Spearman R options allow you to compare numerical columns, the Anova option will help you determine how well a category column categorizes values in a (numerical) value column, the Kruskal-Wallis option is used to compare sortable columns to categorical columns, and the Chi-square option helps you to compare categorical columns.



27


My Note: This and the next slide show why it is impossible to distinguish between poisonous and non-poisonous mushrooms.

28


29

Data Science @ Capital One• Data Science Story and Invitation:

– The classic story of little Signet Bank from the 1990s provides a case in point for Data and Data Science Capability as a Strategic Asset. Fairbanks and Morris became Chairman and CEO and President and COO, and proceeded to apply data science principles throughout the business—not just customer acquisition but retention as well. You may not have heard of little Signet Bank, but if you’re reading this book you’ve probably heard of the spin-off: Capital One. Fairbanks and Morris’s new company grew to be one of the largest credit card issuers in the industry with one of the lowest chargeoff rates. My Note: I invited their data science lead to present.

• Senior Data Scientist Position:– Basic Qualifications:

• Bachelor’s Degree• 2 years experience in Hadoop • 7 years of experience with data mining, machine learning, statistical modeling tools and

underlying algorithms• 5 years experience with relational database and SQL• 5 years experience working with large, unstructured (terabyte or larger) data sets

http://jobs.capitalone.com/vienna/quantitative-analytics/jobid4042033-senior-director-data-scientist-web-%EF%B9%A0-mobile-jobs

http://jobs.capitalone.com/vienna/quantitative-analytics/jobid4042033-senior-director-data-scientist-web-%EF%B9%A0-mobile-jobs

30

Preview of What You Are Going To Hear

• Remarks by Dr. George Strawn, Director, NITRD/NCO and co-chair of the Federal Big Data Senior Steering Work Group

• Evolution of Semantic Technologies-The Value of Merging Smart Data With Big Data: Eric Little, Modus Operandi and Department of Defense Metadata Engineers

• White Paper “Making Big Data Small" using Semantics & Advanced Analytics for NITRD: Jeff Lessner, Modus Operandi

Federal Big Data Working Group Meetup

Documents

data fairport

agenciesbig data

big data small

big data products

data science teams

thefederal big data

value of merging smart

contentworking group