Federal Big Data Working Group Meetup

Post on 26-Feb-2016

56 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Federal Big Data Working Group Meetup. Dr. Brand Niemann Director and Senior Data Scientist Semantic Community http://semanticommunity.info/ http://www.meetup.com/Federal-Big-Data-Working-Group/ http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup March 18, 2014. - PowerPoint PPT Presentation

Transcript

1

Federal Big Data Working Group Meetup

Dr. Brand NiemannDirector and Senior Data Scientist

Semantic Communityhttp://semanticommunity.info/

http://www.meetup.com/Federal-Big-Data-Working-Group/http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup

March 18, 2014

2

Mission Statement• Federal: Supports the Federal Big Data Initiative, but not

endorsed by the Federal Government or its Agencies;• Big Data: Supports the Federal Digital Government Strategy which

is "treating all content as data", so big data = all your content;• Working Group: Data Science Teams composed of Federal

Government and Non-Federal Government experts producing big data products (How was the data collected, Where is it stored, and What are the results?); and

• Meetup: The world's largest network of local groups to revitalize local community and help people around the world self-organize like MOOCs (Massive Open On-line Classes) being considered by the White House.

Co-organizers: Brand Niemann and Kate Goodier

3

Joint NSF-NIH Biomedical Big Data Research Meetup

http://semanticommunity.info/Data_Science/Euretos_BRAIN#Story

“Thanks again for a wonderful gathering of deep thinkers at the NIH-NSF Big Data event -- that was terrific. Great line up of speakers.”

4

Scientific Data:A View from the US

• Dr. George Strawn, Director, NITRD/NCO and co-chair of the Federal Big Data Senior Steering Work Group:– Public access mandated for "scientific results" supported by

the U.S. government– Federal agencies have submitted their "initial plans" for

public access to scientific data to OSTP– Digital Object Architecture:

• An "hour glass" for data? (As the Internet was an hour glass for networks: TCP/IP at the narrow point; many applications above, many implementations below)

– One result will be to make the scientific record into a first class scientific object

http://semanticommunity.info/@api/deki/files/28467/GeorgeStrawn01132014.ppt

5

Activities• White House OSTP - MIT Big Data Privacy Workshop:

– Story and Network Analysis of Tweets:• April 1st Meetup with Kate Goodier and Marc Smith

• NIST Data Science Symposium:– Poster and Story:

• Data Science Team Pilot with Information Services Office

• White Paper for NIST and NITRD:– “Making Big Data Small" using Data Science and Semantics:

• “Thanks again for your effort in putting this program together.!”

• Information Visualization MOOC:– Story and Course Work:

• Forming Teams to Work with Clients for the Remaining 7 Weeks

• DARPA Big Mechanism:– Story and Pilot:

• April 15th Meetup with Mike Megginson, Northrop Grumman, and Fredrik Salvesen, YarcData (in planning)

6

Agenda• 6:30 p.m. Tutorials (Proposed GMU Course) and Refreshments

– Continue Data Science Tutorial: Class 4 and Graph Databases and Bigdata SYSTAP Literature Survey of Graph Databases

• 7:00 p.m. Introductions and Announcements (10 seconds per individual depending on the size of the group)

• 7:15 p.m. Featured Presentation/Demonstration (where did you get the data, where did you store the data, and what were your results?)– Bryan Thompson, Chief Scientist of SYSTAP, LLC will speak about their SYSTAP open source

graph database platform. Highlights will include support for highly available replication clusters as well their recent work with accelerated graph processing on GPUs at 3 billion traversed edges per second.

– See CSHALS 2014: Tech Talk and Poster in Wiki• 8:30 p.m. Networking/Individual Demos (talk among yourselves and look at one

another's work)• 9:00 p.m. Continue Your Conversations Elsewhere (We need to clear out of the

space)

7

Next Meetups• Sixth Meetup: April 1, 6:30 p.m.

– Network Analytics and Visualization of Big Data Privacy Workshop Tweets, Dr. Marc A. Smith, Chief Social Scientist, Connected Action Consulting Group, and Remarks by the President on Review of Signals Intelligence, Dr. Kate Goodier, Information Architect, Xcelerate Solutions

• Seventh Meetup: April 15, 6:30 p.m.– DARPA Big Mechanism, Mike Megginson, Northrop Grumman, and Fredrik

Salvesen, YarcData (in planning)• Eighth Meetup: May 6, 6:30 p.m.

– Federating Big Data for Big Innovation, Dr. Jeanne Holm Data.gov Evangelist• Ninth Meetup: May 18, 6:30 p.m.

– The Science Behind Data Science, Ruhollah Farchtchi, Director of Big Data, UNISYS• 2nd Cloud, SOA, Semantics and Data Science Conference, June (in planning)

8

Overview• Practical Data Science for Data Scientists:

– 2/11 Specific Data Science Tools and Applications 1– Chapters 7 & 8

• Data Science for VIVO & Information Visualization MOOC (not time to cover):– 7 Weeks of Course Work with Sci2 Tools– Forming Teams to Work with Clients for Next 7 Weeks

• NodeXL and Sci2 for Data Science (not time to cover):– NodeXL: A free, open-source template for Microsoft® Excel® that

makes it easy to explore network graphs.– Sci2: A modular tool for science of science research & practice on

scholarly datasets.

9

Practical Data Science for Data Scientists

http://semanticommunity.info/Data_Science/Practical_Data_Science_for_Data_Scientists

Class 4

Providing On-Line ClassWith Private Tutoring

10

Resources• Required Textbook

– Doing Data Science:• http://shop.oreilly.com/product/0636920028529.do• Free Sampler:

– http://cdn.oreillystatic.com/oreilly/booksamplers/9781449358655_sampler.pdf (PDF)

• Optional Supplemental Reading:– Data Science Starter Kit:

• http://shop.oreilly.com/category/get/data-science-kit.do– DC Data Community:

• http://datacommunitydc.org/blog/about/

• DC Data Community Calendar:– http://datacommunitydc.org/blog/calendar/

• Technology Requirements– Internet and Free Tools like Spotfire Cloud:

• https://spotfire.cloud.tibco.com/tsc/#!/compproductrequest– NodeXL:

• http://nodexl.codeplex.com/ My Note: Current Focus

11

Class 4• 2/11 Specific Data Science Tools and Applications 1

– Discuss Reading: Chapters 7 and 8, Present and Discuss Team Homework Exercises, Hands-on Class Exercise, and Team Homework Exercise.

– My Resources:• http://semanticommunity.info/Data_Science/Free_Data_Visualization_and_

Analysis_Tools• http://semanticommunity.info/Data_Science/KDD_Cup• http://www.kdnuggets.com/datasets/

• Hands-on Class Exercise:– SAS and SAS Public Data Sets– See Spotfire Web Player and Spotfire File, Spotfire Web Player and

Spotfire File, and Spotfire Web Player and Spotfire File– Exercise: Build Your Own Recommendation System

12

Discuss Reading

• Chapter 7:– How do companies extract meaning from the data they

have? In this chapter we hear from two people with very different approaches to that question—namely, William Cukierski from Kaggle and David Huffaker from Google.

• Chapter 8:– This is the most difficult chapter in the book for me to

teach since I do not understand the Python code at the end and have never built a Recommendation Engine myself. I would welcome some help here.

17

Team Homework Exercise

• Read in next week's reading: Data Visualization for the Rest of Us:– See my Slides and Web Player.– Start to create your own Hubway Data

Visualization Challenge and eventually submit it for your class project and the challenge (now closed but still accepting submissions) if you want.

• Form Teams (Same or New), Ask Me Questions, and Prepare to Present Next Week

19

A Data Science Big Mechanism for DARPA

• DARPA wants to help the DoD get to the essence of cause and effect for cancer from reading the medical literature.

• The Federal Big Data Working Group Meetup has also been doing that with Semantic Medline - YarcData and Euretos BRAIN (Bio Relations and Intelligence Network). – See the video for Cancer Immunotheraphy (21 minutes) which

Science magazine called the biggest breakthrough in 2013 at the end of 2013 and which Dr. Tom Rindflesch (the inventor of Semantic Medline) identified from Semantic Medline as a very important breakthrough in early 2013!

20

Data Science Data Mining Process• Business Understanding:

– Broad Agency Announcement (PDF) and Slide Presentation (PPT)• Data Understanding:

– Semantic Medline, Open Catalog, CSHALS* 2014, and “Starter kit“ (to be provided)

• Data Preparation:– Knowledge Base of the Above

• Modeling:– Semantic Medline, Data Papers, and NanoPublications

• Evaluation:– Searchability, Discovery, and Reasoning

• Deployment:– Story and Knowledge Base in MindTouch, Excel, Spotfire, and Be Informed

* Conference on Semantics in Healthcare and Life Sciences

21

The Initial Knowledge Base-Data Ecosystem

http://semanticommunity.info/Data_Science/A_Data_Science_Big_Mechanism_for_DARPA

22

Where did we find some structured data?

http://www.darpa.mil/opencatalog/

23

Where did we store the structured data?

http://semanticommunity.info/@api/deki/files/28732/DARPA.xlsx

24

Modeling: Approaches• Semantic Medline– Semantic MEDLINE Query: mesothelioma and

Data Science for VIVO• Data Papers:– Sepublica 2014: The Semantics for e-science in an intelligent Big

Data Context• http://sepublica.mywikipaper.org/

• Nanopublications:– The smallest unit of publishable information: an assertion about

anything that can be uniquely identified and attributed to its author.• http://nanopub.org/wordpress/?page_id=65

25

How did we store the unstructured data?

http://semanticommunity.info/@api/deki/files/28470/BRAIN.xlsx

Well-defined URLsKnowledge and GlossaryRelational and GraphLinked DataFootnote and ReferencesMetadata and Data SourcesReady for NodeXL & Spotfire

26

Modeling: Examples

Most Recent: 500 citations,Start Date: 01/01/1900,End Date: 11/30/2013,3169 predications extracted.Summarized for Substance Interactions

Dr. Barend Mons: BRAIN Dr. Tom Rindflesch: Semantic Medline

28

What is our data story and product?

• Data Ecosystem:– BRAIN.xlsx– DARPA.xlsx

• Individual Tabs:– DARPA Open Catalog:

• Bigdata SYSTAP is Category: Infrastructure and License: GPLv2– DARPA Big Mechanism Knowledge Base:

• DARPA Big Mechanism Knowledge Base by Function (21)• DARPA Big Mechanism Knowledge Base by Number of References (175)

– BRAIN Knowledge Base and Examples:• BRAIN Knowledge Base by Function (References)• Data Fairport Conference Dropbox Files by Type (PPTX)

– Data Science for VIVO & IVMOOC• Citations by Publisher (APS)• Total Award Amount ($) by Principal Investigator (Geoffrey Fox)

29

Graph Databases

http://semanticommunity.info/Data_Science/Graph_Databases#Story http://semanticommunity.info/Data_Science/Graph_Databases/Tutorial

Absent:Bigdata SYSTAPVirtuosoYarcDataEtc.

12 Leading BI Tools and Analytic PlatformsI Tested for OMB

30

Bigdata SYSTAP Literature Survey of Graph Databases

http://semanticommunity.info/Data_Science/Bigdata_SYSTAP_Literature_Survey_of_Graph_Databases#Story

Awarded Best Paper in 2004!And 10 Years Later…..

top related