Federal Big Data Working Group Meetup Dr. Brand Niemann Director and Senior Data Scientist Semantic Community http://semanticommunity.info/ http://www.meetup.com/Federal-Big-Data-Working-Group/ http://semanticommunity.info/Data_Science/Federal_Big_Data_Work ing_Group_Meetup May 20, 2014 1
Federal Big Data Working Group Meetup. Dr. Brand Niemann Director and Senior Data Scientist Semantic Community http://semanticommunity.info/ http://www.meetup.com/Federal-Big-Data-Working-Group/ http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup May 20, 2014. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Federal Big Data Working Group Meetup
Dr. Brand NiemannDirector and Senior Data Scientist
Mission Statement• Federal: Supports the Federal Big Data Initiative, but not
endorsed by the Federal Government or its Agencies;• Big Data: Supports the Federal Digital Government Strategy which
is "treating all content as data", so big data = all your content;• Working Group: Data Science Teams composed of Federal
Government and Non-Federal Government experts producing big data products (How was the data collected, Where is it stored, and What are the results?); and
• Meetup: The world's largest network of local groups to revitalize local community and help people around the world self-organize like MOOCs (Massive Open On-line Classes) being considered by the White House.
Co-organizers: Brand Niemann and Kate Goodier
3
May 6th Meetup: EPA/NASA Climate-Environment al Data Analytics & A Redesigned, Open Data.gov
• How was the Meetup?– Thanks for continually providing a forum facilitating discussion and
bringing in speakers with diverse experience. On my drive home NPR was fittingly enough talking about big data.
– Just lots of good info on big data; I also am a big fan of data.gov, so it's exciting that so much is happening with government open data. Perhaps we'll see even more APIs?
– Jeanne Holm: You can find more of the APIs at https://www.data.gov/developers/apis and http://catalog.data.gov/dataset?res_format=api There are about 450 between the two.
– Amazing growth in membership: Our 200th member!• Welcome: Inge, Consultant working in the federal/health space.
Federating Big Data for Big Innovation and A Redesigned, Open Source Data.gov, Dr. Jeanne Holm, Data.gov Evangelist
• Background:– Usability Tests Put Brakes on Data.gov Redesign– Linkedin Discussion
• Main Points:– Releasing and using open data is about empowering people to make better
decisions– Open data is an ecosystem– Building a federated catalog of national data– Keeping the conversation fresh: Multiple rounds of usability testing found that
redesign was needed and now doing monthly builds– A Global Movement has begun to provide transparency and democratization of
data• My Note:
– See my Tutorial Slides 12-19http://semanticommunity.info/@api/deki/files/29263/JeanneHolm05062014.pptx
Activities• White Paper for DARPA, NASA, NIH, NIST and NITRD: “Making Big Data Small"
using Data Science and Semantics:– See Framework and Questions and Answers– Dan Kaufman, DARPA Director of Innovation, and Paul Cohen, DARPA Big Mechanism
Project Director– Drs. Farnam Jahanian (NSF Big Data Publications), Phil Bourne (Data Culture at NIH),
and John Holdren (Climate Change Impacts)• Health Datapalooza V, June 1-3:
• See next slides• CODATA International Society for Digital Earth (ISDE) Workshop on Big Data for
International Scientific Programmes: Challenges and Opportunities, June 8-9:• See next slides
• Big Data for Government, June 16-17:• Keynote from Dr. George Strawn and Presentation by Dr. Tom Rindflesch and Semantic
Medline/YarcData Team• Earth Cube All-Hands Meeting, June 24-26:
• Report at July Meetup
7
Framework for White Paper• Organize a Community of Data Scientists and Related Fields to focus on treating all of your
content as "Big Data"– Example: Federal Big Data Working Group Meetup
• Follow the Cross Industry Standard Process for Data Mining (CRISP-DM; Shearer, 2000) consisting of Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment– Example: Semantic Community Data Science Knowledge Base (Big Data Science for CODATA)
• Mine prominent scientific journals for data policy, data bases, and data results that can be reused.– Example: CODATA Data Science Journal (509 publication by 9 attributes)
• Provide data stories and presentation materials for public education and conferences– Example: CODATA International Workshop on Big Data for International Scientific Programmes, June 8-9,
in Beijing• Obtain NSF funding for sustained data science for data publications work over a period of years
– Example: Critical Techniques and Technologies for Advancing Big Data Science & Engineering (BIGDATA)• Provide a Data Fairport with “Data Publication in Data Browsers”
– Example: Semantic Community Spotfire Cloud Library
8
Framework Questions & Answers• Is this the Barend Mons Nanopub approach to the data publication of cardinal
assertions: No, please see the examples in these slides.• What are the goals of the White Paper and NSF Grant Proposal?:
– White Paper documents the Framework for general public relations and marketing purposes. The NSF Grant Proposal is to obtain long-term funding to sustain this Framework and Mission Statement activity. • In essence we know that NSF wants a community that follows standards to produce data science
publications that reside in a knowledge base repository and workforce training that supports STEM and data scientists.
• What type of Meetup presentations do we want?:– Content that supports the Framework, Mission Statement, and White Paper. But not
every presentation does because we leave that to each presenter. All we ask is that they at least answer three fundamental questions in their presentation:• How was the data collected?• Where is the data stored?, and• What are the data results?• So the presentations are not marketing-vendor-organization promoting.
9
Kaufman and Cohen: A Data Science Big Mechanism for DARPA
• International Association of Scientific, Technical & Medical Publishers: The Voice of Academic and Professional Publishing– STM is at the leading edge of the latest technology trends within publishing. This annual US-
event brings together the industry's most established thinkers and bright up-and-coming future stars to gives attendees an insight into the hottest innovations and vital technological trends and developments which will define STM publishing for years to come.
• Annual US Event: Bright Research, Smart Articles and the new Author Ego-System– Opening Keynotes: Analytics and Metrics
• David Smith (Baseball) and Kevin Boyack (Mapping & Analytics of Science Publishing)– Plenary: The Smart Article
• Increasingly the research article becomes computable, adding research data, algorithms and smart searching. How intelligent will the article become; Can it find you so you no longer need to search for it? Can it test assertions? Generate new hypotheses? Can articles generate new articles without human interference? Will human analysis be eliminated and, if so, up to what point….where are the new opportunities for publishers. Come and listen to two experts in data mining and actionable articles, both well known from FORCE11. (Larry Hunter and Anita de Waard)
Mined STM 2014 Tweets• Tech trend 1: the machine is the new reader. Highlights from the Future Lab team• Tech trend 2: the return to the author• Tech trend 3: new players changing the game. see http://ow.ly/3jPdvY• Kevin Boyack of SciTech shares data that shows books are 2 to 4x more cited than
journal articles in sciences• L Hunter: "With enough data you don't need semantic search. You can just use
statistics."• L Hunter: Knowledge Representation (publishers) look at Alzforum collaborative
knowledge sharing• A baseball metrics talk to open. With perfect timing, the latest submission to the
@writelatex gallery is an article on baseball!: https://www.writelatex.com/articles/professional-baseball-pitchers-performance-and-its-effect-on-salary/
• Anita de Waard: "Looking for Data: Finding New Science“: http://t.co/eok3ma37vOhttp://semanticommunity.info/Data_Science/NSF_Big_Data_Publications#Story
Data Science for Climate Change:Spotfire Data Publication
Web Player (in progress)
26
Agenda• 6:30 p.m. Brand Niemann, Introduction and Continue Data Science Tutorials (Refreshments)• 7:00 p.m. Introductions and Announcements (10 seconds per individual
depending on the size of the group)• 7:10 p.m. Big Data: Forward - Backward, Charles Randall Howard, Adjunct
Professor in the Applied IT Department and Sr. Data Scientist at Novetta Solutions
• 7:45 p.m., Stories that Persuade, Anita de Waard, VP Research Data Collaborations at Elsevier Research Data Services/University of Utrecht. Also see Looking for Data: Finding New Science and Ten Habits of Highly Effective Data
• 8:30 p.m. Networking/Individual Demos (talk among yourselves and look at one another's work)
• 9:00 p.m. Continue Your Conversations Elsewhere (We need to clear out of the space)
Next Meetups• June 2nd: In Planning: Ontology Summit 2014 Postmortem and Reading & Reasoning with Semantic
Insights for the DARPA Big Mechanism– 6:30 pm Welcome and Introduction Slides– 6:35 pm Continue Data Science Tutorial: Practical Data Science for Data Scientists: Data Science Students and
Careers and Sarah Soliman, Rand, and IV MOOC Student Project (invited)– 7:00 p.m. Brief Member Introductions– 7:10 pm Ontology Summit 2014 Postmortem: Big Data with Semantic Web and Applied Ontology, Brand
Niemann See Ontology for Big Data– 7:30 pm Two SIRA-based products: Research Assistant™ and Research Librarian™, Chuck Rehberg,
Semantic Insights and Kate Goodier, Xcelerate Solutions (limited beta test in process). See A Data Science Big Mechanism for DARPA
• June 30th: MIT Big Data Initiative: bigdata@CAIL and the new Intel Science and Technology Center for Big Data, Sam Madden and Why the current "elephants" are good at nothing, Data Tamer, and data integration issues, Michael Stonebraker
• July and August: Once a month to be announced– Silver Line Spring Hill Metro Station Opens in July?
• Practical Data Science for Data Scientists:– Reading Assignments:
• Chapter 11: Causality– This chapter will explore the topic of causality, and we have two experts in this area as guest
contributors, Ori Stitelman and David Madigan. In these cases your mentality or goal is not to optimize for predictive accuracy, but rather to be able to isolate causes.
• Chapters 12: Epidemiology– The contributor for this chapter is David Madigan, professor and chair of statistics at Columbia. Madigan
has over 100 publications in such areas as Bayesian statistics, text mining, Monte Carlo methods, pharmacovigilance, and probabilistic graphical models.
– Resources: See 2/25 Specific Data Science Tools and Applications 3• Team Homework Exercise:
– See my work with the KDD Cup data sets where I have updated this to include 2011-2013.
– See my Research Notes for Project TYCHO Data for Health.– Form Teams (Same or New), Ask Me Questions, and Prepare to Present One of