introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
Data Analytics for Social ScienceIntroduction
Johan A. Elkink
School of Politics & International Relations
University College Dublin
23 January 2020
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
1 Introduction
2 Course outline
3 Software packages
4 Accessing data
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
Outline
1 Introduction
2 Course outline
3 Software packages
4 Accessing data
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
Statistics and politics
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
Statistics and politics
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
Trade-offs in research methods
In statistical analysis,observations are quantifiedand statistical models are usedto investigate relationshipsbetween variables.
Qualitative Quantitative
small number of cases many casesin-depth breadthmore accurate more generalizablemeasurement validity measurement reliability
(Gerring, 2001)
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
Typical data
• Survey data, where individuals are asked aboutdemographics, attitudes, behaviour, preferences, etc.
• National data, where characteristics of the institutionalregime or economic variables are recorded.
• Organisational data, e.g. political parties, companies, etc.
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
2016 Brexit Referendum questionnaire
“If you do vote in the referendum on Britain’s membership ofthe European Union, how do you think you will vote?”
1 Remain in the EU
2 Leave the EU
3 Don’t know
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
Example: age and voting in Brexit
0.0
0.2
0.4
0.6
25 50 75
Age
Pro
port
ion
votin
g fo
r le
ave
Support for Brexit by age
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
2016 ‘Brexit’ Referendum questionnaire
“How sure are you about what would happen to the UK if itleft the EU or if it remained in the EU?”(separately asked for leaving and for remaining in the EU)
1 Very unsure
2 Quite unsure
3 Quite sure
4 Very sure
5 Don’t know
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
Example: age and voting in Brexit
0.00
0.25
0.50
0.75
1.00
1 2 3 4
Uncertainty remain in EU
Pro
port
ion
votin
g fo
r le
ave
Uncertaintyleaving EU
1
2
3
4
Support for Brexit by levels of uncertainty
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
Data science
In academic research, our objective is to understand the socialworld. We typically want to identify causal relationshipsbetween variables. E.g. are voters with less knowledge ofpolitics less likely to vote?
Commercially, the objective is often to predict the social world.E.g. given that you bought this book, and others bought thefollowing books, which book are you most likely to want to buynext?
Not only does the high demand for data scientists mean morework for people who understand social science and statistics,there are also a lot of new tools developed.
E.g. statistical analysis of text, network analysis, deep learningtools.
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
Data science
In academic research, our objective is to understand the socialworld. We typically want to identify causal relationshipsbetween variables. E.g. are voters with less knowledge ofpolitics less likely to vote?
Commercially, the objective is often to predict the social world.E.g. given that you bought this book, and others bought thefollowing books, which book are you most likely to want to buynext?
Not only does the high demand for data scientists mean morework for people who understand social science and statistics,there are also a lot of new tools developed.
E.g. statistical analysis of text, network analysis, deep learningtools.
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
Data science
social science = substantive expertise
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
Outline
1 Introduction
2 Course outline
3 Software packages
4 Accessing data
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
Topics
looking at data
Introduction
Data inspection &visualisation
Comparing throughvisualisation
classifying groups
Linear regression
Logistic regression
Trees and forests
Networks and geography
mapping data
Cluster analysis
Principal components &multidimensional scaling
Wordscores
Topic models
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
Topics
visualisation
Introduction
Data inspection &visualisation
Comparing throughvisualisation
supervised learning
Linear regression
Logistic regression
Trees and forests
Networks and geography
unsupervised learning
Cluster analysis
Principal components &multidimensional scaling
Wordscores
Topic models
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
Topics
survey data
Introduction
Data inspection &visualisation
Comparing throughvisualisation
national data
Linear regression
Logistic regression
Trees and forests
Networks and geography
text data
Cluster analysis
Principal components &multidimensional scaling
Wordscores
Topic models
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
Textbook
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
Textbook
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
Assignments
1 The first analysis will concern survey data and make useprimarily of graphical and descriptive statistics, based onthe Brexit referendum survey. Deadline: 24 February.
2 The second analysis will focus on the use of regressionanalysis and classification, based on country-level data.Deadline: 6 April.
3 The third analysis will focus on the statistical analysis oftext. Deadline: 5 May.
Note that all assignments will be based on the statisticalanalyses performed during the lab sessions, in class—you willnot be required to perform new analysis (although trying someadditional variations can improve the quality of the submission).
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
Plagiarism
Is not allowed.Details can be found in the syllabus ...
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
Contact
Without office this year, so: [email protected]
http://www.joselkink.net/DASS-Spring-2020.php
Always check the course website!
Do not hesitate to get in touch when struggling with themodule!
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
Outline
1 Introduction
2 Course outline
3 Software packages
4 Accessing data
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
Software comparison
Source: http://r4stats.com/articles/popularity/, 12 June 2015
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
Software comparison (log scale)
Source: http://r4stats.com/articles/popularity/, 12 June 2015
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
Software and code
For the sake of replicability and transparency, saving commandsis key in the use of statistical software.
• Data preparation
• Data transformation
• Descriptives
• Analysis
Including clarifyingcommentary.
software format
SPSS .sps
Stata .do
R .R
Python .py
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
R
Developed by statisticians and extensively used in politicalscience, data science, statistics, etc.
pros consFree software Variable documentation qualityVery extensive package library Inconsistent interfacesReal programming language Steep learning curve at startLarge and active user-base No graphical user interface1
Multiple data setsHighest quality graphics
http://www.r-project.org
http://www.rstudio.com
1... but that’s why we use RStudio.
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
RStudio
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
RStudio
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
RStudio data view
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
RStudio with RMarkdown
See also the video on using Markdown in RStudio.
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
Data analysis process
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
Outline
1 Introduction
2 Course outline
3 Software packages
4 Accessing data
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
Example data set
Age Vote Party Education Sex
1 21 Yes FF 4 Male2 30 No 3 Female3 80 Yes FG 3 Male4 50 Yes Lab 2 Male5 33 No 5 Female6 20 No 2 Female7 43 Yes FF 5 Female8 42 Yes FF 2 Male
FF = Fianna Fail; FG = Fine Gael; Lab = LabourEducation: 1 = none; 2 = primary; 3 = secondary; 4 =tertiary; 5 = post-graduate
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
Data formats
• data prepared for competitor statistical packages such asStata and SPSS;
• data published in tables on the web, such as in Wikipedia;
• data published in raw text tabular format, especially forexample large surveys;
• data published in Excel or other spreadsheet (see video);
• data stored in relational or non-relational databases, suchas SQL, Redis, etc.;
• or just plain text files.
−→ package “rio”, command “import()”
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
Data formats
• data prepared for competitor statistical packages such asStata and SPSS;
• data published in tables on the web, such as in Wikipedia;
• data published in raw text tabular format, especially forexample large surveys;
• data published in Excel or other spreadsheet (see video);
• data stored in relational or non-relational databases, suchas SQL, Redis, etc.;
• or just plain text files.
−→ package “rio”, command “import()”
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
Lab
There is more to say about graphs and visualisation of data,but the best is to get familiar with some basic visualisationsfirst, through the use of R(Studio).
Demonstration:
• RStudio interface
• Data view
• Markdown syntax
• Installing R packages
• ... and error whenpackage is missing
introduction
Outline
Introduction
Course outline
Softwarepackages
Accessing data
References
Gerring, John. 2001. Social science methodology: A critical framework. Cambridge: Cambridge UniversityPress.