Top Banner
Federal Big Data Working Group Meetup Dr. Brand Niemann Director and Senior Data Scientist Semantic Community http://semanticommunity.info/ http://www.meetup.com/Federal-Big-Data-Working-Group/ http://semanticommunity.info/Data_Science/Federal_Big_Data_Work ing_Group_Meetup March 18, 2014 1
30

Federal Big Data Working Group Meetup

Feb 26, 2016

Download

Documents

carys

Federal Big Data Working Group Meetup. Dr. Brand Niemann Director and Senior Data Scientist Semantic Community http://semanticommunity.info/ http://www.meetup.com/Federal-Big-Data-Working-Group/ http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup March 18, 2014. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Federal Big Data Working Group  Meetup

1

Federal Big Data Working Group Meetup

Dr. Brand NiemannDirector and Senior Data Scientist

Semantic Communityhttp://semanticommunity.info/

http://www.meetup.com/Federal-Big-Data-Working-Group/http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup

March 18, 2014

Page 2: Federal Big Data Working Group  Meetup

2

Mission Statement• Federal: Supports the Federal Big Data Initiative, but not

endorsed by the Federal Government or its Agencies;• Big Data: Supports the Federal Digital Government Strategy which

is "treating all content as data", so big data = all your content;• Working Group: Data Science Teams composed of Federal

Government and Non-Federal Government experts producing big data products (How was the data collected, Where is it stored, and What are the results?); and

• Meetup: The world's largest network of local groups to revitalize local community and help people around the world self-organize like MOOCs (Massive Open On-line Classes) being considered by the White House.

Co-organizers: Brand Niemann and Kate Goodier

Page 3: Federal Big Data Working Group  Meetup

3

Joint NSF-NIH Biomedical Big Data Research Meetup

http://semanticommunity.info/Data_Science/Euretos_BRAIN#Story

“Thanks again for a wonderful gathering of deep thinkers at the NIH-NSF Big Data event -- that was terrific. Great line up of speakers.”

Page 4: Federal Big Data Working Group  Meetup

4

Scientific Data:A View from the US

• Dr. George Strawn, Director, NITRD/NCO and co-chair of the Federal Big Data Senior Steering Work Group:– Public access mandated for "scientific results" supported by

the U.S. government– Federal agencies have submitted their "initial plans" for

public access to scientific data to OSTP– Digital Object Architecture:

• An "hour glass" for data? (As the Internet was an hour glass for networks: TCP/IP at the narrow point; many applications above, many implementations below)

– One result will be to make the scientific record into a first class scientific object

http://semanticommunity.info/@api/deki/files/28467/GeorgeStrawn01132014.ppt

Page 5: Federal Big Data Working Group  Meetup

5

Activities• White House OSTP - MIT Big Data Privacy Workshop:

– Story and Network Analysis of Tweets:• April 1st Meetup with Kate Goodier and Marc Smith

• NIST Data Science Symposium:– Poster and Story:

• Data Science Team Pilot with Information Services Office

• White Paper for NIST and NITRD:– “Making Big Data Small" using Data Science and Semantics:

• “Thanks again for your effort in putting this program together.!”

• Information Visualization MOOC:– Story and Course Work:

• Forming Teams to Work with Clients for the Remaining 7 Weeks

• DARPA Big Mechanism:– Story and Pilot:

• April 15th Meetup with Mike Megginson, Northrop Grumman, and Fredrik Salvesen, YarcData (in planning)

Page 6: Federal Big Data Working Group  Meetup

6

Agenda• 6:30 p.m. Tutorials (Proposed GMU Course) and Refreshments

– Continue Data Science Tutorial: Class 4 and Graph Databases and Bigdata SYSTAP Literature Survey of Graph Databases

• 7:00 p.m. Introductions and Announcements (10 seconds per individual depending on the size of the group)

• 7:15 p.m. Featured Presentation/Demonstration (where did you get the data, where did you store the data, and what were your results?)– Bryan Thompson, Chief Scientist of SYSTAP, LLC will speak about their SYSTAP open source

graph database platform. Highlights will include support for highly available replication clusters as well their recent work with accelerated graph processing on GPUs at 3 billion traversed edges per second.

– See CSHALS 2014: Tech Talk and Poster in Wiki• 8:30 p.m. Networking/Individual Demos (talk among yourselves and look at one

another's work)• 9:00 p.m. Continue Your Conversations Elsewhere (We need to clear out of the

space)

Page 7: Federal Big Data Working Group  Meetup

7

Next Meetups• Sixth Meetup: April 1, 6:30 p.m.

– Network Analytics and Visualization of Big Data Privacy Workshop Tweets, Dr. Marc A. Smith, Chief Social Scientist, Connected Action Consulting Group, and Remarks by the President on Review of Signals Intelligence, Dr. Kate Goodier, Information Architect, Xcelerate Solutions

• Seventh Meetup: April 15, 6:30 p.m.– DARPA Big Mechanism, Mike Megginson, Northrop Grumman, and Fredrik

Salvesen, YarcData (in planning)• Eighth Meetup: May 6, 6:30 p.m.

– Federating Big Data for Big Innovation, Dr. Jeanne Holm Data.gov Evangelist• Ninth Meetup: May 18, 6:30 p.m.

– The Science Behind Data Science, Ruhollah Farchtchi, Director of Big Data, UNISYS• 2nd Cloud, SOA, Semantics and Data Science Conference, June (in planning)

Page 8: Federal Big Data Working Group  Meetup

8

Overview• Practical Data Science for Data Scientists:

– 2/11 Specific Data Science Tools and Applications 1– Chapters 7 & 8

• Data Science for VIVO & Information Visualization MOOC (not time to cover):– 7 Weeks of Course Work with Sci2 Tools– Forming Teams to Work with Clients for Next 7 Weeks

• NodeXL and Sci2 for Data Science (not time to cover):– NodeXL: A free, open-source template for Microsoft® Excel® that

makes it easy to explore network graphs.– Sci2: A modular tool for science of science research & practice on

scholarly datasets.

Page 9: Federal Big Data Working Group  Meetup

9

Practical Data Science for Data Scientists

http://semanticommunity.info/Data_Science/Practical_Data_Science_for_Data_Scientists

Class 4

Providing On-Line ClassWith Private Tutoring

Page 10: Federal Big Data Working Group  Meetup

10

Resources• Required Textbook

– Doing Data Science:• http://shop.oreilly.com/product/0636920028529.do• Free Sampler:

– http://cdn.oreillystatic.com/oreilly/booksamplers/9781449358655_sampler.pdf (PDF)

• Optional Supplemental Reading:– Data Science Starter Kit:

• http://shop.oreilly.com/category/get/data-science-kit.do– DC Data Community:

• http://datacommunitydc.org/blog/about/

• DC Data Community Calendar:– http://datacommunitydc.org/blog/calendar/

• Technology Requirements– Internet and Free Tools like Spotfire Cloud:

• https://spotfire.cloud.tibco.com/tsc/#!/compproductrequest– NodeXL:

• http://nodexl.codeplex.com/ My Note: Current Focus

Page 11: Federal Big Data Working Group  Meetup

11

Class 4• 2/11 Specific Data Science Tools and Applications 1

– Discuss Reading: Chapters 7 and 8, Present and Discuss Team Homework Exercises, Hands-on Class Exercise, and Team Homework Exercise.

– My Resources:• http://semanticommunity.info/Data_Science/Free_Data_Visualization_and_

Analysis_Tools• http://semanticommunity.info/Data_Science/KDD_Cup• http://www.kdnuggets.com/datasets/

• Hands-on Class Exercise:– SAS and SAS Public Data Sets– See Spotfire Web Player and Spotfire File, Spotfire Web Player and

Spotfire File, and Spotfire Web Player and Spotfire File– Exercise: Build Your Own Recommendation System

Page 12: Federal Big Data Working Group  Meetup

12

Discuss Reading

• Chapter 7:– How do companies extract meaning from the data they

have? In this chapter we hear from two people with very different approaches to that question—namely, William Cukierski from Kaggle and David Huffaker from Google.

• Chapter 8:– This is the most difficult chapter in the book for me to

teach since I do not understand the Python code at the end and have never built a Recommendation Engine myself. I would welcome some help here.

Page 17: Federal Big Data Working Group  Meetup

17

Team Homework Exercise

• Read in next week's reading: Data Visualization for the Rest of Us:– See my Slides and Web Player.– Start to create your own Hubway Data

Visualization Challenge and eventually submit it for your class project and the challenge (now closed but still accepting submissions) if you want.

• Form Teams (Same or New), Ask Me Questions, and Prepare to Present Next Week

Page 19: Federal Big Data Working Group  Meetup

19

A Data Science Big Mechanism for DARPA

• DARPA wants to help the DoD get to the essence of cause and effect for cancer from reading the medical literature.

• The Federal Big Data Working Group Meetup has also been doing that with Semantic Medline - YarcData and Euretos BRAIN (Bio Relations and Intelligence Network). – See the video for Cancer Immunotheraphy (21 minutes) which

Science magazine called the biggest breakthrough in 2013 at the end of 2013 and which Dr. Tom Rindflesch (the inventor of Semantic Medline) identified from Semantic Medline as a very important breakthrough in early 2013!

Page 20: Federal Big Data Working Group  Meetup

20

Data Science Data Mining Process• Business Understanding:

– Broad Agency Announcement (PDF) and Slide Presentation (PPT)• Data Understanding:

– Semantic Medline, Open Catalog, CSHALS* 2014, and “Starter kit“ (to be provided)

• Data Preparation:– Knowledge Base of the Above

• Modeling:– Semantic Medline, Data Papers, and NanoPublications

• Evaluation:– Searchability, Discovery, and Reasoning

• Deployment:– Story and Knowledge Base in MindTouch, Excel, Spotfire, and Be Informed

* Conference on Semantics in Healthcare and Life Sciences

Page 21: Federal Big Data Working Group  Meetup

21

The Initial Knowledge Base-Data Ecosystem

http://semanticommunity.info/Data_Science/A_Data_Science_Big_Mechanism_for_DARPA

Page 22: Federal Big Data Working Group  Meetup

22

Where did we find some structured data?

http://www.darpa.mil/opencatalog/

Page 23: Federal Big Data Working Group  Meetup

23

Where did we store the structured data?

http://semanticommunity.info/@api/deki/files/28732/DARPA.xlsx

Page 24: Federal Big Data Working Group  Meetup

24

Modeling: Approaches• Semantic Medline– Semantic MEDLINE Query: mesothelioma and

Data Science for VIVO• Data Papers:– Sepublica 2014: The Semantics for e-science in an intelligent Big

Data Context• http://sepublica.mywikipaper.org/

• Nanopublications:– The smallest unit of publishable information: an assertion about

anything that can be uniquely identified and attributed to its author.• http://nanopub.org/wordpress/?page_id=65

Page 25: Federal Big Data Working Group  Meetup

25

How did we store the unstructured data?

http://semanticommunity.info/@api/deki/files/28470/BRAIN.xlsx

Well-defined URLsKnowledge and GlossaryRelational and GraphLinked DataFootnote and ReferencesMetadata and Data SourcesReady for NodeXL & Spotfire

Page 26: Federal Big Data Working Group  Meetup

26

Modeling: Examples

Most Recent: 500 citations,Start Date: 01/01/1900,End Date: 11/30/2013,3169 predications extracted.Summarized for Substance Interactions

Dr. Barend Mons: BRAIN Dr. Tom Rindflesch: Semantic Medline

Page 28: Federal Big Data Working Group  Meetup

28

What is our data story and product?

• Data Ecosystem:– BRAIN.xlsx– DARPA.xlsx

• Individual Tabs:– DARPA Open Catalog:

• Bigdata SYSTAP is Category: Infrastructure and License: GPLv2– DARPA Big Mechanism Knowledge Base:

• DARPA Big Mechanism Knowledge Base by Function (21)• DARPA Big Mechanism Knowledge Base by Number of References (175)

– BRAIN Knowledge Base and Examples:• BRAIN Knowledge Base by Function (References)• Data Fairport Conference Dropbox Files by Type (PPTX)

– Data Science for VIVO & IVMOOC• Citations by Publisher (APS)• Total Award Amount ($) by Principal Investigator (Geoffrey Fox)

Page 29: Federal Big Data Working Group  Meetup

29

Graph Databases

http://semanticommunity.info/Data_Science/Graph_Databases#Story http://semanticommunity.info/Data_Science/Graph_Databases/Tutorial

Absent:Bigdata SYSTAPVirtuosoYarcDataEtc.

12 Leading BI Tools and Analytic PlatformsI Tested for OMB

Page 30: Federal Big Data Working Group  Meetup

30

Bigdata SYSTAP Literature Survey of Graph Databases

http://semanticommunity.info/Data_Science/Bigdata_SYSTAP_Literature_Survey_of_Graph_Databases#Story

Awarded Best Paper in 2004!And 10 Years Later…..