Top Banner
Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences. John Murtagh, UEL
39

Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.

Dec 24, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.

Data Management forGeoinformatics

A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.

John Murtagh, UEL

Page 2: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.

Data Collection

Page 3: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.

Data sources

3

Page 4: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.

1. Finding data – this involves searching and finding data that has already been released

2. Getting hold of more data – asking for ‘new’ data from official sources e.g. through Freedom of Information requests.

3. Collecting data yourself – This means gathering data and entering it into a database or a spreadsheet – whether you work alone or collaboratively

Sometimes data is public on a website but there is not a download link to get hold of it in bulk – but don’t give up! This data can be liberated with what datawranglers call scraping. More later…

Page 5: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.

Finding already released data or…. open data

Page 6: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.

Open Data – a definition

“A piece of data is open if anyone is free to use, reuse, and redistribute it — subject

only, at most, to the requirement to attribute and/or

share-alike.”

Page 7: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.

Where can I find open data?

Page 8: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.

Lots of places!

Page 9: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.
Page 10: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.
Page 11: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.
Page 12: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.

ScraperWiki

…is an online tool to make the process of extracting "useful bits of data easier so they can be reused in other apps, or rummaged through by journalists and researchers.“

Most of the scrapers and their databases are public and can be re-used.

Page 13: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.

Geospatial data…

A group for open geospatial data with an emphasis on use in teaching and research.

Page 14: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.
Page 15: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.

Data.ac.uk

“a landmark site for academia providing a single point of contact for linked open data development.”

It not only provides access to the know-how and tools to discuss and create linked data and data aggregation sites, but also enables access to, and the creation of, large aggregated data sets providing powerful and flexible collections of information.

Page 16: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.
Page 17: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.

Other developments

Page 18: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.

• A number of startups are emerging, that aim to build communities around data sharing and re-sale. This includes Buzzdata and Figshare — a place to share and collaborate on private and public datasets — and data shops such as Infochimps and DataMarket.

• DataCouch — A place to upload, refine, share & visualize your data.

• The World Bank and United Nations data portals provide high-level indicators for all countries, often for many years in the past.

Page 19: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.

• An interesting Google subsidiary, Freebase, provides "an entity graph of people, places and things, built by a community that loves open data.“

• Research data. There are numerous national and disciplinary aggregators of research data, such as the UK Data Archive. While there will be lots of data that is free at the point of access, there will also be much data that requires a subscription, or which cannot be reused or redistributed without asking permission first.

Page 20: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.

While they may not always be easy to find, many databases on the web are indexed by search engines, whether the publisher intended this or not. Here are a few tips:

Page 21: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.

Tips for searching for data (from the Data Journalism Handbook)

When searching for data, make sure that you include both search terms relating to the content of the data you’re trying to find as well as some information on the format or source that you would expect it to be in.

Google and other search engines allow you to search by file type.

http://datajournalismhandbook.org/1.0/en/getting_data_0.html

Page 22: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.

For example, you can look only for…

• Spreadsheets (by appending your search with ‘filetype:XLS filetype:CSV’)

• Geodata (‘filetype:shp’)

• Database extracts (‘filetype:MDB, filetype:SQL, filetype:DB’).

• PDFs (‘filetype:pdf’).

Page 23: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.

You can also search by part of a URL. Googling for ‘inurl:downloads filetype:xls’ will try to find all Excel files that have “downloads” in their web address

Another popular trick is not to search for content directly, but for places where bulk data may be available.

(if you find a single download, it’s often worth just checking what other results exist for the same folder on the web server). You can also limit your search to only those results on a single domain name, by searching for, e.g. ‘site:agency.gov’.

For example, ‘site:agency.gov Directory Listing’ may give you some listings generated by the web server with easy access to raw files, while ‘site:agency.gov Database Download’ will look for intentionally created listings.

Page 24: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.

Getting hold of new data

Page 25: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.

The information requested must be provided unless an exemption or exception allows the institution not to disclose it.

The request could be addressed to anyone in the University organisation, & there are only 20 working days to respond.

Freedom of Information (FoI) & Environmental Information (EIR) legislation provides the public with a right to access information (also research data) held by a UK public authority, which includes most universities, colleges, or publicly-funded

research institutions.

Page 26: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.

You can make an FOI request using a website whatdotheyknow.com

Page 27: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.
Page 28: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.

Statistics and Registration Services Act 2007

The Act is mainly concerned with the UK Statistics Authority and applies only to data designated as Official Statistics. It defines how 'personal information' can be disclosed to an 'Approved Researcher' i.e. an individual to whom the Statistics Authority has granted access, for the purposes of statistical research, to personal information held by it.

Although the Act does not apply to individual researchers managing confidential research data not designated as Official Statistics, such researchers might wish to adapt the Approved Researcher model for access to confidential data.

Page 29: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.

Environmental Information Regulations 2004This Act gives the public access rights to environmental information held by a public authority (including universities) in response to requests (similar to the Freedom of Information Act). Freedom of access does not imply free access. There are circumstances under which requests may or must be refused, for example if the data contain personal information.

Page 30: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.

Collecting data yourself –

This means gathering data and entering it into a database or a spreadsheet – whether you work alone or collaboratively.

Page 31: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.

Getting data in the format you need itFinding more Data using GoogleYou can search for CSV files on Google by typing +filetype:csv in the search bar. Searching for "South Africa +filetype:csv" will result in CSV files mentioning South Africa. You can try different other filetypes as well (such as: "xls" for excel spreadsheets or "pdf“)

Page 32: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.

Permissions and Licensing data

Page 33: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.

Licensing of Open Data - reuse• Public domain dedications, which also serve

as maximally permissive licenses; there are no conditions put upon using the work;

• Permissive or attribution-only licenses; giving credit is the only substantial condition;

• Copyleft, reciprocal, or share-alike licenses; these also require that modified works, if published, be shared under the same license.

As defined by Open Knowledge Foundation

Page 34: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.

Using data

Page 35: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.

Ethics of carrying out research with data

University's Research Ethics Committee (UREC) has specific responsibility for institutional oversight of matters relating to ethics and governance in research undertaken by both staff and postgraduate research students that involves 

human participation personal sensitive data or human material.

Further information from the Quality Assurance and Enhancement officer [email protected]://www.uel.ac.uk/qa/research/

Page 36: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.

Collecting personal or sensitive dataThe Data Protection Act (1998) covers personal or sensitive personal data, but not to all research data in general, nor to anonymised data or if the participants are no longer living

Page 37: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.

Questions to ask yourself in your data creation.

1.Is personal data needed? Names and addresses, for example? Store this data, if required, separately.

2.Inform your participants about use of personal data.

3.Not all research data obtained from participants constitute personal data. If data are anonymised!

Page 38: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.

Lastly: To gain access to police data or records you may be subject to a Disclosure or a DBS check (previously CRB check) which provides details of any criminal record data held on you.

https://www.gov.uk/government/organisations/disclosure-and-barring-service

Page 39: Data Management for Geoinformatics A short course on good data management for taught postgraduate students in geoinformatics and related data sciences.

Other sessions as part of Data Management in Geoinformatics:

•Data Integration•Data Management•Data Sharing

Data Management for Geoinformatics by John Murtagh as part of the Jisc funded project TraD (University of East London is licensed under a Creative Commons Attribution Share Alike Licence