Top Banner
Dr. Claudia Wagner http://claudiawagner.info/ Web Science Summer School WS3 , Southampton, UK , 21th July 2014
86

Datascience Introduction WebSci Summer School 2014

Aug 17, 2014

Download

Education

Claudia Wagner

http://www.summerschool.websci.net/
WebScience Summer School Southampton
Data Science 2014
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Datascience Introduction WebSci Summer School 2014

Dr. Claudia Wagner http://claudiawagner.info/

Web Science Summer School WS3 , Southampton, UK , 21th July 2014

Page 2: Datascience Introduction WebSci Summer School 2014

source: Twitter 2

Page 3: Datascience Introduction WebSci Summer School 2014

Statistical computing is very central , but data science is more than statistics

Activities of data scientists: collection and generation,

preparation,

analysis,

visualization,

management and preservation of large collections of data

Jeffrey Stanton, Introduction to Data Science, free e-book 3

Page 4: Datascience Introduction WebSci Summer School 2014

Ask interesting question Why is it important? Which number answers your question?

Get or generate the data Which data will help answering you question? How is the data

generated? Are their any sampling biases? Ethical issues? Analyze the data

Are there any anomalies or regularities? Which hidden process has generated the data? Fit a model to the data and validate it

Visualize and communicate results What does 75% probability mean?

Preserve and share the data to make results reproducible

4

Page 5: Datascience Introduction WebSci Summer School 2014

Data is a collection of facts Facts can be numbers, words,

measurements, observations or even just descriptions of things

Qualitative data (e.g., β€œit was great”) Quantitative data

Discrete (e.g., 5)

Continuous (e.g., 3.723)

5

Page 6: Datascience Introduction WebSci Summer School 2014

6

Stevens, S. S. (1946). "On the Theory of Scales of Measurement". Science 103 (2684): 677–680.

Nominal (e.g., ethnic group, sex, nationality)

Ordinal (e.g., status)

Interval (e.g., temperature in Celsius)

Ratio (e.g., weight)

Observations are only named

Observations can be ordered

Distance is meaningful

Absolute zero

Page 7: Datascience Introduction WebSci Summer School 2014

7

Page 8: Datascience Introduction WebSci Summer School 2014

Random sample of Twitter users Random sample of tweets from the public timeline More active users are more likely to be included

Friendship Paradox Select a random sample of people and ask them to list

the people they know. Contact a sample of the listed friends and repeat the survey.

Sampling bias: people with more friends are more likely to show up in the friend lists which we generate at the first stage

8

Page 9: Datascience Introduction WebSci Summer School 2014

A study found that the profession with the lowest average age of death was student. Being a student does not cause you to die at an early

age. Being a student means you are young. This is what makes the average of those that die so low.

Amount of ice cream consumed per day is highly

correlated with number of drownings per day Both variables are correlated with the daily

temperature

9

"Teaching Statistics: A Bag of Tricks," by Gelman and Nolan (2002)

Page 10: Datascience Introduction WebSci Summer School 2014

A study found that only 1.5% of drivers in accidents reported that they were using a cell phone, whereas 10.9% reported that they were distracted by another occupant in the car.

Can we conclude that using a cell phone safer than speaking with another occupant? P(cellphone | accident) != P(accident | cellphone) Compare P(accident|cellphone) and P(accident|occupant) We need to know the prevalence of cell phone use It is likely that much more people talk to another occupant

in the car while driving than talking on the cell phone

10 Jessica Utts, What Educated Citizens Should Know about Statistics and Probability, The American Statistician, Vol. 57, No. 2 (May, 2003), pp. 74-79

Page 11: Datascience Introduction WebSci Summer School 2014

Ecological Fallacy

Illiteracy rate in each US state and the proportion of immigrants per state

Negative correlation of βˆ’0.53

β–ͺ The greater the proportion of immigrants in a state, the lower its average illiteracy.

When individuals are considered, the correlation was +0.12 β€” immigrants were on average more illiterate than native citizens.

11 Robinson, W.S. (1950). "Ecological Correlations and the Behavior of Individuals". American Sociological Review (American Sociological Review, Vol. 15, No. 3) 15 (3): 351–357.

Page 12: Datascience Introduction WebSci Summer School 2014

Data Collection Data Preprocessing Data Analysis Data Visualization Data Preservation

Page 13: Datascience Introduction WebSci Summer School 2014

Found data or observational data

Are observational data enough?

Are such data available?

Generate Data

Designs the data generation process

β–ͺ E.g., via surveys, experiments, crowdsourcing

13

Page 14: Datascience Introduction WebSci Summer School 2014

14 http://www.intel.com/content/www/us/en/communications/internet-minute-infographic.html

Page 15: Datascience Introduction WebSci Summer School 2014

Two general types of traces:

15

Accretion - a build-up of physical traces

Erosion - the wearing away of material

Webb, Eugene J. et al. Unobtrusive Measures: nonreactive research in the social sciences. Chicago: Rand McNally, 1966

Page 16: Datascience Introduction WebSci Summer School 2014

Bulk downloads

Wikipedia, IMDB, Million Song Database, etc.

API access

NY Times, Twitter, Facebook, Foursquare, etc.

Web scraping

Tools e.g., http://scrapy.org/

What data is ok to scrap?

β–ͺ Public, non-sensitive, anonymized, fully referenced information, Check terms of conditions!

16

Page 17: Datascience Introduction WebSci Summer School 2014

Takes time to accumulate

Conservative estimate

Only what happened counts! Intentions, motivations or internal states don’t count.

Inferentially weak

Cannot answer β€œwhat-if” questions

17

Page 18: Datascience Introduction WebSci Summer School 2014

Surveys

Simulations Model behavior of users/agents on a micro-level

Simulate what happens under different conditions

Empirical validation Experiments Keep all variables constant and only manipulate one

variable (e.g., emotions)

18

Page 19: Datascience Introduction WebSci Summer School 2014

Simulations Study of macro-phenomena

Difficult to validate empirically

Surveys and/or Experiments We only get data from those who are accessible and

willing to respond or participate

Responders provide answers that are in line with self-image and researcher’s expectations

Hawthorne effect, etc.

19

Page 20: Datascience Introduction WebSci Summer School 2014

Data Collection Data Preprocessing Data Analysis Data Visualization Data Preservation

Page 21: Datascience Introduction WebSci Summer School 2014

21

Data cleaning

Fill in missing values

Smooth noisy data

Identify or remove outliers

Resolve inconsistencies

Data integration

Integration of multiple databases, or files

Page 22: Datascience Introduction WebSci Summer School 2014

22

Data transformation Normalization: scaled to fall within a small, specified range

Standardization: how many standard deviations from the mean

lies each data point

Discretization: divide the range of a continuous attribute into intervals some algorithms require discrete attributes.

Data reduction Dimensionality reduction (remove unimportant attributes via

feature selection, group features into factors e.g. PCA, SVD)

Aggregation and clustering

Sampling

Page 23: Datascience Introduction WebSci Summer School 2014

Data Collection Data Preprocessing Data Mining Data Analysis Statistical Inference Data Visualization Machine Learning Data Preservation

Page 24: Datascience Introduction WebSci Summer School 2014

Problem:

Given high dimensional space (e.g., fb-user which are described via various attributes such as locations they visited)

Find pairs of data points (𝒙, y) that are within some distance threshold 𝒅(𝒙, y) ≀ 𝒔

We first need to decide what β€ždistanceβ€œ

means

24

Page 25: Datascience Introduction WebSci Summer School 2014

Distance Measures

Jaccard similarity between 2 sets of items I1, I2

sim(I1, I2) = |𝐼1 ∩ 𝐼2|

|𝐼1 βˆͺ 𝐼2|

dist(I1, I2) = 1- sim(I1, I2)

Euclidian distance, Hamming distance,

Cosine Similarity, etc.

25

Page 26: Datascience Introduction WebSci Summer School 2014

Goal: Given a set of items group the items into some number of clusters, so that

Members of a cluster are similar to each other

Members of different clusters are dissimilar

26 Anand Rajaraman, Jeffrey Ullman, Jure Leskovec, Mining of Massive Datasets, Cambridge University Press

Page 27: Datascience Introduction WebSci Summer School 2014

Not-Hierarchical / Point assignment:

Maintain a set of clusters

Point belong to β€œnearest” cluster

Hierarchical:

Agglomerative (bottom up):

β–ͺ Initially, each point is a cluster

β–ͺ Repeatedly combine the two β€œnearest” clusters into one

Divisive (top down):

β–ͺ Start with one cluster and recursively split it

27 Anand Rajaraman, Jeffrey Ullman, Jure Leskovec, Mining of Massive Datasets, Cambridge University Press

Page 28: Datascience Introduction WebSci Summer School 2014

28

Page 29: Datascience Introduction WebSci Summer School 2014

29

Page 30: Datascience Introduction WebSci Summer School 2014

30

Page 31: Datascience Introduction WebSci Summer School 2014

31

Page 32: Datascience Introduction WebSci Summer School 2014

32

Page 33: Datascience Introduction WebSci Summer School 2014

Try different k, looking at the change in the average distance to centroid as k increases

Average falls rapidly until right k, then changes little

33

Average Diameter

k

best k

Page 34: Datascience Introduction WebSci Summer School 2014

Aim: Find hidden concepts/groups in a matrix Method: Singular Value Decomposition (SVD)

34 Lescovec et al., Mining of Massive Datasets, p. 418

Page 35: Datascience Introduction WebSci Summer School 2014

Rank = 2 Rank denotes the

information content of the matrix.

For instance, a rank-1 matrix can be written as a product of one column and one vector

35

Page 36: Datascience Introduction WebSci Summer School 2014

36

Page 37: Datascience Introduction WebSci Summer School 2014

37 Lescovec et al., Mining of Massive Datasets, p. 418

Relates users and concepts

Relates movies to concepts

Strength of

concepts

Page 38: Datascience Introduction WebSci Summer School 2014

Data Collection Data Preprocessing Data Mining Data Analysis Statistical Inference Data Visualization Machine Learning Data Preservation

Page 39: Datascience Introduction WebSci Summer School 2014

Estimate population parameter from sample statistics

Sampling Distribution of statistic: Draw a finite set of samples of size n from the population

Computing the statistic on the sample

Repeat this process

The mean of the sampling distribution is the expected value of the statistic in the true population

SD of the sampling distribution is the standard error

39

Page 40: Datascience Introduction WebSci Summer School 2014

40

Page 41: Datascience Introduction WebSci Summer School 2014

Some descriptive statistics such as mean or median are unbiased estimators of central tendency

Expected value of the statistic is the true population parameter

Expected value of dispersion in a sample is an underestimate of the true population value

41

Page 42: Datascience Introduction WebSci Summer School 2014

True population size is N Sample size n < N (e.g., n=100)

Correction factor : 𝑛

π‘›βˆ’1

For n=100 the correction factor is ~ 1.01 For n=100.000 our correction factor is

~1.00001

Estimate Population Var: (

𝑛

π‘›βˆ’1) βˆ— (π‘₯π‘–βˆ’πœ‡ 𝑛

𝑖=1 )

𝑛

42

Page 43: Datascience Introduction WebSci Summer School 2014

Specify the range of values that have a high probability of containing the true population parameter

Confidence level Ξ±: the probability that confidence interval contains true population parameter

43

Page 44: Datascience Introduction WebSci Summer School 2014

CI = sample statistic + MOE MOE = SE * Critical value

MOE = 𝜎

π‘›βˆ— 𝑧𝛼/2

Critical Value: how far away from the mean

must a point lie in order to be considered as β€œextreme” or β€œunexpected”?

44

n … sample size Οƒ … standard deviation z Ξ±/2 … confidence coefficient

Page 45: Datascience Introduction WebSci Summer School 2014

45

Page 46: Datascience Introduction WebSci Summer School 2014

Area under the curve is 0.475 What’s the z-score?

46

Page 47: Datascience Introduction WebSci Summer School 2014

Select 1000 fb-user randomly Average number of bar visits per year X = 78

Standard Deviation: (π‘₯π‘–βˆ’πœ‡ 𝑛

𝑖=1 )2

𝑛 = 30

Confidence level is 95% divide 0.95 by 2 to get

0.475 Check out the z table z = 1.98

MOE =

𝜎

π‘›βˆ— 𝑧𝛼/2 =

30

1000 βˆ— 1.98= 1.88

78 +/- 1.88 CI: [76.12 ; 79.88]

47

Page 48: Datascience Introduction WebSci Summer School 2014

Exact CI can only be computed when the sampling distribution and SD of sampling distribution (i.e., SE) are known

Otherwise we have to estimate the Standard Error (SE) Bootstrap

48

Page 49: Datascience Introduction WebSci Summer School 2014

Sampling with replacement Population is unknown But we observe one sample from the population of

size n=4: {2, 3, 8, 8} We use this sample to generate a large number of

bootstrap samples of size n: β–ͺ 8, 8, 8, 3 β–ͺ 3, 3, 8, 2 β–ͺ …

Compute statistic (e.g. ,mean) for each bootstrap sample

Estimate SE from the bootstrap distribution

49

Page 50: Datascience Introduction WebSci Summer School 2014

50

Population

Sample

Bootstrap Sample

Bootstrap Sample

Bootstrap Sample

Bootstrap Sample

Calculate statistic for each bootstrap sample

Statistic +/- MOE

MOE for 95% CI = 2 * SE

Bootstrap Distribution

Standard Error (SE): SD of bootstrap distribution

Page 51: Datascience Introduction WebSci Summer School 2014

Randomly selected sample of fb-user

Have they ever checked in at a nightclub?

Democrats: 100/1000 yes

Republican: 90/1000 yes

Do the nightlife preferences differ

significantly across political parties? Give 95% CI for difference in proportions

51

Page 52: Datascience Introduction WebSci Summer School 2014

dems = rep( c(0,1), c(1000-100, 100) ) repubs = rep( c(0,1), c(1000-90, 90) ) mean(dems) #0.1 mean(repubs) #0.09 del.p = mean(dems) - mean(repubs) #0.01 (point estimate)

reps = replicate( 1000, { ds = sample( dems, 1000, replace=TRUE ) rs = sample( repubs, 1000, replace=TRUE ) mean( ds ) - mean( rs ) } ) SE = sd( reps ) # 0.0131 c( del.p - 2*SE, del.p + 2*SE ) #-0.0162 0.0362 (interval estimate)

52

Page 53: Datascience Introduction WebSci Summer School 2014

H1: political party affects the nightlife-preferences H0: political party does not affects the nightlife-

preferences Proportion of users who visited nightclubs not matter

which party they belong to: 190/2000 = 0.095

If political affinities have no effect, we would expect the following frequencies:

53

Democrats Republicans

yes 100 90 190

no 900 910 1810

Democrats Republicans

yes 95 95 190

no 905 905 1810

Page 54: Datascience Introduction WebSci Summer School 2014

Ο‡2= π‘œβˆ’π‘’ 2

𝑒 = 0.5815

DF = (number of rows – 1) x (number of columns – 1) = 1

Critical value of Ο‡2 at 5% significance and 1 DF is 3.84

Our Ο‡2 does not exceed the critical value

We cannot reject H0 54

Democrats Republicans

yes 100 90 190

no 900 910 1810

Page 55: Datascience Introduction WebSci Summer School 2014

If Ξ±=0.05 then 95% of all values fall in this interval

Two-tail test: 2.5% of values in the

upper tail and 2.5% of the lower tail are considered as so extreme that we reject H0 if we observe them

55

Page 56: Datascience Introduction WebSci Summer School 2014

Test if democrats on fb, on average, have more than 60 bar visits per year H1: Β΅ > 60 H0: Β΅ <= 60

Random sample of 20 democratic fb-user: {65 73 51 67 48 80 69 53 59 62 71 67 64 78 65 490

80 60 51 70} Sample mean πœ‡ =64.1 Assume we know SD in population = 10

𝑧 = πœ‡ βˆ’ πœ‡

𝑆𝐸 𝑆𝐸 =

𝑆𝐷

𝑛 𝑧 =

64.1βˆ’60

10/ 20 = 1.8336

56

Page 57: Datascience Introduction WebSci Summer School 2014

Would we expect that? How extreme is this observation? If H0 is true (mean<=60) in which area

around the mean do 95% of all points lie

Pick alpha level Ξ±=0.05 that’s the maximum probability where you reject the null hypothesis if the null hypothesis is true

Right-tail test: find our critical value for 0.45 using the z-distribution

If the z-score of our observed data exceed

this value we have to reject H0

57

1.8336 > 1.645 reject the null hypothesis

Page 58: Datascience Introduction WebSci Summer School 2014

Large Effects, Small Samples: In small samples it is easy to overestimate an effect which

might have happened by chance Small Effects, Large Samples:

The smaller the effect you want to measure the larger the sample size you need to prove it significant!

Example: Assume a coin is biased: 10% head and 90% tail

Tossing the coin 10 times should be enough to convince people that the coin is biased.

Example: Assume a coin is biased: 51% head and 49% tail

Minimum sample size increases with decreasing effect size which one wants to demonstrate

58

Page 59: Datascience Introduction WebSci Summer School 2014

The more we analyze, the more we find by chance!

If you calculate correlation between 10 variables (i.e., 44 different correlation coefficients) you should expect that at least 2 correlations are significant with p < 0.05 by chance (one in every 20)

Corrections or adjustments for the total number of comparison are needed!

59

Page 60: Datascience Introduction WebSci Summer School 2014

Many tests such as z-test, t-test, ANOVA make the normality assumption.

If true population is very skewed (e.g. power law) the sampling distribution of the statistic will not be normal

Nonparametric methods like sign-test use e.g. median rather than the mean Hypothesis about the median of the true population (e.g. H1:

median < 100, H0: median = 100) Count number of measurements that favor the null hypothesis If H0 is true half of the measurement should fall on each side.

60

Page 61: Datascience Introduction WebSci Summer School 2014

Data Collection Data Preprocessing Data Mining Data Analysis Statistical Inference Data Visualization Machine Learning Data Preservation

Page 62: Datascience Introduction WebSci Summer School 2014

Aim Find a function that describes the relation between X

(e.g. bar visits) and Y (e.g. new friends)

Given X predict Y Problem Infinite number of ways X and Y could be related

Idea Reduce space of possible function and start with the

simplest one (linear relation)

Y= 𝑏0 + 𝑏1 𝑋

62

Page 63: Datascience Introduction WebSci Summer School 2014

Y = 2 + 0.5 X

63

6 4 2 0

Y

X

0 2 4 6 8

Page 64: Datascience Introduction WebSci Summer School 2014

Use Gradient Descent to minimize Cost function C 𝑏0, 𝑏1

C 𝑏0, 𝑏1 = 1

2𝑁 (π‘Œπ‘–βˆ’π‘Œ 𝑖)

2𝑁𝑖=1

C 𝑏0, 𝑏1 = 1

2𝑁 (π‘Œπ‘– βˆ’ 𝑏0 βˆ’ 𝑏1𝑋)2𝑁

𝑖=1

Start with some guess for 𝑏0, 𝑏1 Keep changing 𝑏0, 𝑏1 to reduce C 𝑏0, 𝑏1 until

we hopefully end up at a minimum

64

Page 65: Datascience Introduction WebSci Summer School 2014

𝑏0 ≔ 𝑏0 βˆ’ π›Όπœ•

πœ•π‘π‘—C 𝑏0, 𝑏1

𝑏1 ≔ 𝑏1 βˆ’ π›Όπœ•

πœ•π‘π‘—C 𝑏0, 𝑏1

Simultaneous updates of b0 and b1

65

Derivative of cost function informs us about the slope of

the cost function

Learning rate

Page 66: Datascience Introduction WebSci Summer School 2014

66

C(b)

b

Page 67: Datascience Introduction WebSci Summer School 2014

Residuals: deviation between the observed and the predicted values

Residual sum of squares:

67

Is this a good measure?

No it depends on the number of observations N

What if we multiply it with

1/N?

Page 68: Datascience Introduction WebSci Summer School 2014

𝑦𝑖… observed value 𝑦 … value predicted by the model 𝑦 … mean of observed data

68

Total variability in the outcome

that needs to be explained

Unexplained variability! Residuals: difference

between the observed value and the estimated value

Proportion of the total variability unexplained by the model

Page 69: Datascience Introduction WebSci Summer School 2014

Independent variable is binary (e.g., went to nightclub or not)

We can group users by number of new friends year (20-25, 25-30, 30-35, etc.) and compute the proportion of people with high β€œnightclub-probability”

69

Page 70: Datascience Introduction WebSci Summer School 2014

Logistic Regression:

Maximum Likelihood Estimator

Estimate unknown coefficients by

maximizing the log likelihood function

Coefficient is interpreted as the rate of change in the "log odds" as X changes

70

ln𝑃(π‘Œ = 1)

1 βˆ’ 𝑃(π‘Œ = 1)= 𝑏0 + 𝑏1X + Ο΅

Page 71: Datascience Introduction WebSci Summer School 2014

Simple Example: You have a coin that you know is biased towards

heads and you want to know what the probability of heads (p) is.

We want to estimate the unknown parameter p!

71

Page 72: Datascience Introduction WebSci Summer School 2014

You flip the coin 10 times and the coin comes

up head 7 times. What’s your best guess for p?

72

Page 73: Datascience Introduction WebSci Summer School 2014

3737 )1(!3!7

!10)1(

7

10)heads 7( ppppP

Find the value for p that makes our data most likely!

The probability of observing 7 times head when tossing a coin 10 times is given by this binomial distribution:

73

Page 74: Datascience Introduction WebSci Summer School 2014

)1log(3log7!3!7

!10loglog ppLikelihood

Set the derivative equal to 0 and solve for p.

Derivative with respect to p.

ppLikelihood

dp

d

1

370log

10

7

107377

3)1(70)1(

3)1(70

1

37

p

ppp

pppp

pp

pp

*derivative of a constant is 0

*derivative 7f(x)=7f '(x)

*derivative of log x is 1/x

3737 )1(!3!7

!10)1(

7

10ppppLikelihood

74

web.stanford.edu/~kcobb/hrp261/lecture4.ppt

Page 75: Datascience Introduction WebSci Summer School 2014

267.)3(.)7(.120)3(.)7(.7

10Likelihood theof Value 3737

Likelihood of observing 7 times head when tossing a

biased coin with p(head) = 0.7 and p(tail)=0.3 10 times

is:

75

Page 76: Datascience Introduction WebSci Summer School 2014

Linear Regression (R-squared)

Logistic Regression (pseudo R-squared)

76

Page 77: Datascience Introduction WebSci Summer School 2014

you can β€œprove” anything with graphics

Data Collection Data Preprocessing Data Analysis Data Visualization Data Preservation

Page 78: Datascience Introduction WebSci Summer School 2014

78

Page 79: Datascience Introduction WebSci Summer School 2014

79 http://www.motherjones.com/kevin-drum/2012/01/lying-charts-global-warming-edition

Page 80: Datascience Introduction WebSci Summer School 2014

80 http://www.motherjones.com/kevin-drum/2012/01/lying-charts-global-warming-edition

Page 81: Datascience Introduction WebSci Summer School 2014

Be careful when drawing conclusions from graphs

Size of effect shown in graphic != Size of effect in sample data != Size of the effect in the true population Scale Disorting (e.g., bar charts not starting with

zero)

Snapshot

…

81

Page 82: Datascience Introduction WebSci Summer School 2014

Data Collection Data Preprocessing Data Analysis Data Visualization Data Preservation

Page 83: Datascience Introduction WebSci Summer School 2014

GESIS Data Archives & Data Centers

Preserve research data and make them accessible for reuse.

Competencies and infrastructure

β–ͺ e.g. https://datorium.gesis.org/xmlui/

CESSDA:

umbrella organisation for the European national data archives (http://www.cessda.net/)

Re3data

browse data archives by topic: http://www.re3data.org/

83

DPC Digital Preservation Handbook: http://www.dpconline.org/advice/preservationhandbook

Page 84: Datascience Introduction WebSci Summer School 2014

Legal and regulatory framework including open access and licenses

Incentives to share data Credentials? Citation principles under development (see

e.g. http://www.datacite.org/). Long term preservation strategies software and hardware changes, documentation,

metadata and retrieval/access Data preservation starts at an individual level Reasons for data loss often on an individual level,

e.g. broken hardware, researchers leaving a group. 84

Page 85: Datascience Introduction WebSci Summer School 2014

http://claudiawagner.info/teaching/WebSciSS2014/

Page 86: Datascience Introduction WebSci Summer School 2014

Vasant Dhar. Data Science and Prediction. In: Communications of

the ACM, December 2013, Vol. 56, No. 12, pp. 64-73

Anand Rajaraman, Jeffrey Ullman, Jure Leskovec, Mining of Massive Datasets, Cambridge University Press (free download)

Jeffrey Stanton, Introduction to Data Science (free download) Steffen Staab, Data Science Course University Koblenz-Landau,

https://www.uni-koblenz-landau.de/campus-koblenz/fb4/west/teaching/ss14/data-science/data-science1

Serious Stats, Thom Baguley

86