1 Build the UK’s COINS in the Data Science Library Cloud Brand Niemann US EPA June 9, 2010 http:// semanticommunity.net sclaimer: These slides do not reflect the views of the U.S. Environmental Protection A d does not constitute endorsement by the EPA of the standards or products mentioned.
30
Embed
1 Build the UK’s COINS in the Data Science Library Cloud Brand Niemann US EPA June 9, 2010 Disclaimer: These slides do not.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Build the UK’s COINS in the Data Science Library Cloud
Brand NiemannUS EPA
June 9, 2010http://semanticommunity.net
Disclaimer: These slides do not reflect the views of the U.S. Environmental Protection Agencyand does not constitute endorsement by the EPA of the standards or products mentioned.
• The Challenge• The Data.gov.uk Program• The Expert and His Advice• The Cloud Tools• The Inspiration• The Data Sources• Other Sources of Data• The Process• The Results• Comments• Acknowledgements• References
3
The Challenge• Tim Berners-Lee "Bag of Chips" talk:
– http://www.youtube.com/watch?v=ga1aSJXCFe0• To get five stars: 1-Expose your data, 2-Provide in machine readable format (Excel), 3-Provide as
CSV, 4-Provide at permanent URL, and 5-Provide metadata.
• Nigel_Shadbolt: Lots of eyeballs pouring over COINS:– http://bit.ly/b8XQGB - opendata in the wild - more functionality all the time.
• Advised by Sir Tim Berners-Lee and Professor Nigel Shadbolt and others, government is opening up data for reuse. This site seeks to give a way into the wealth of government data and is under constant development. We want to work with you to make it better.
• We’re very aware that there are more people like you outside of government who have the skills and abilities to make wonderful things out of public data. These are our first steps in building a collaborative relationship with you.
• Edward Tufte Presidential appointment announced by White House, March 5, 2010.
• Tufte Comment on iPhone interface design: Better to have users looking over material adjacent in space within our eyespan rather than stacked in time. This is especially the case for statistical data, where the fundamental analytical task is to make comparisons. Also see page 159 in the above book reference.
• What is data science? Analysis: The future belongs to the companies and people that turn data into products. Mike Loukides.– http://radar.oreilly.com/2010/06/wha...a-science.html.
• My Response: Please see my Data Science Library in the Cloud: http://ondemand.spotfire.com/public/...VL-4372/public and my suggestion that The 2010 Health 2.0 Developer Challenge should build a community health data science library-see http://federaldata.wik.is/ June 3rd: http://twitter.com/bniemannsr/status/15482514867 and http://www.hhs.gov/open/discussion/chdi.html.
• Tried Zipped 2009/10 Adjustment table, 31MiB (405MiB uncompressed): Got 405 MB text file that when imported into Spotfire gave three columns with no headers and 317,346 rows (with the last row saying: (316,119 row(s) affected)!– See next slide.
• Read Comments: Saw where others had had trouble using these datasets.– Is this CSV?
• I unzipped the (non-torrent) version of the 09/10 adjustment table and it wasn't CSV but rather 2-sign delimited (think tab-delim with an @ instead of a tab). also the data wasn't clean for import to something like Excel as it had some lines of non-table data at the end - just the sort of thing to upset already hard-pushed spreadsheet importers on non-high end rigs.
• And read: COINS contains millions of rows of data; as a consequence the files are large and the data held within the files complex. Using these download files will require some degree of technical competence and expertise in handling and manipulating large volumes of data. As such it is likely that this data will be most easily used by organisations that have such expertise, rather than individuals. More directly useful and accessible datasets that draw on the contents of the COINS database will be made available by August 2010.
• The Basic Steps:– Inventory Data Sources and Plan Application– Prepare and Import Data and Metadata– Implement Layout and Analytics– Add Bookmarks and Create Data Stories– Publish and Test in Web Player– Get Feedback and Improve
• First create visualizations, faceted search (filters), and analytics for each individual data source and then look for relationships between the data sources.
Comments• The initial objective to see how fast one could create this basic
application. I am waiting to hear back on requests for full data sets. I want to emulate the Dashboard for Where Does My Money go? I want to work with other data sources in Data.gov.uk: E.g. Climate Change.
• Please use the Add Comment feature at the bottom of this wiki page to provide feedback and suggest additional analyses you would like to see. To use the Add Comment feature you first need to register by providing your email address. Your privacy will be respected and your email addressed will not be available to others or used for any other purpose. You can also download the Spotfire File from this Wiki and a 30-day free evaluation copy from http://spotfire.tibco.com/ and reuse these analyses, add your own data to this file or new Spotfire files that you create. Have fun and give us your feedback!
• The author acknowledges gratefully Dean Allemang, Cory Casanave, Sean Connors, Mills Davis, Li Ding, David Eng, Lee Feigenbaum, Aaron Fulkerson, Jim Hendler, Ralph Hodgson, Kevin Kirby, Kevin Jackson, Bob Marcus, John McMahon, Richard Murphy, Brand Niemann, Jr., Barry Nussbaum, Matthew Phoenix, Tony Shaw, Jeff Stein, George Strawn, George Thomas, Pete Tseronis, and Edward Tufte.
References• Brand L. Niemann, Put Your Desktop in the Cloud to Support the
Open Government Directive and Data.gov/semantic, April 19, 2010, Semantic Universe.
• Brand L. Niemann, Build Your Own Data.gov (Spotfire) and EPA Microsite (Spotfire) with Semantics and Statistics in the Cloud, May 15, 2010. Slides.
• Brand L. Niemann, Build Your Community Health Information "Design for America" Using Mindtouch and Spotfire Analytics, May 17, 2010. Slides.
• Brand Niemann, Build Your Own Data.gov/semantic with Spotfire in the Cloud: The White House Visitor Database, May 22, 2010. Slides. See Data.gov takes the 'Mumsy' test, FCW, May 26, 2010.
• Edward R. Tufte, Beautiful Evidence (2006), Graphics Press LLC.