Top Banner
Data science Data Science An emerging area of work concerned with the collection, preparation, analysis ,visualization, management, and preservation of large collections of information . 1
18
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to data science intro,ch(1,2,3)

Data science

Data Science

An emerging area of work concerned with the collection, preparation, analysis ,visualization, management, and preservation of large collections of information .

1

Page 2: Introduction to data science intro,ch(1,2,3)

Web page

much of the data in the world is non-numeric and unstructured.

unstructured means that the data are not arranged in neat rows and columns. Think of a web page

2

Page 3: Introduction to data science intro,ch(1,2,3)

$

3

Page 4: Introduction to data science intro,ch(1,2,3)

Data architecture

Data

acquisition

Data

analysis

Data

archiving

4

Page 5: Introduction to data science intro,ch(1,2,3)

Data architect

providing input on how the data would need to be routed and organized to support the analysis, visualization, and presentation of the data to the

appropriate people.

5

Page 6: Introduction to data science intro,ch(1,2,3)

Data acquisition

focuses on how the data are collected, and importantly , how the data are represented prior to analysis and presentation.

Tool example :barcode

Different barcodes are used for the same product. (for example, for different sized boxes of cereal).

6

Page 7: Introduction to data science intro,ch(1,2,3)

Data analysis

using portions of data (samples) to make inferences about the larger context, and visualization of the data by presenting it in tables, graphs, and even animations.

7

Page 8: Introduction to data science intro,ch(1,2,3)

Data archiving

Preservation of collected data in a form that makes it highly reusable ,so "data curation" is

a difficult challenge because it is so hard to anticipate all of the future uses of the data.

Example(Twitter):

Geocodes : data that shows the geographical location from which a tweet was sent could be a useful element to store with the data.

8

Page 9: Introduction to data science intro,ch(1,2,3)

Learning the application domain

Communicating with data users

Seeing the big picture of a complex system

Knowing how data can be represented :metadata

Data transformation and analysis

Visualization and presentation

Attention to quality

Ethical reasoning :privacy 9

Page 10: Introduction to data science intro,ch(1,2,3)

About Data •Data comes from the Latin word, "datum,"

meaning a "thing given“

10

Page 11: Introduction to data science intro,ch(1,2,3)

za15id05v2005kamel

11

Page 12: Introduction to data science intro,ch(1,2,3)

“The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point”

CLAUDE SHANNON

yes

1

0

No

Maybe 01

ASCII

12

Page 13: Introduction to data science intro,ch(1,2,3)

Identifying Data Problems Data Science is an applied activity and data scientists serve the needs and solve the problems of data users.

Hint:

The data scientist may never actually become a farmer, but if you are going to identify a data problem that a farmer has, you have to learn to think like a farmer, to some degree.

3 questions:

subject matter experts.

ask about anomalies

ask about risks and uncertainty

13

Page 14: Introduction to data science intro,ch(1,2,3)

Introduction To R R is an integrated suite of software facilities for data manipulation, calculation , graphical Display and other things it has .

"R" is an open source software program

an effective data handling and storage facility.

a suite of operators for calculations on arrays, in particular matrices,

a large, coherent, integrated collection of intermediate tools for data analysis,

graphical facilities for data analysis and display either directly at the computer or on hardcopy.

14

Page 15: Introduction to data science intro,ch(1,2,3)

Additional Pros: R was among the first analysis programs to

integrate capabilities for drawing data directly from the Twitter(r) social media platform

The extensibility of R means that new modules are being added all the time by volunteers

the lessons one learns in working with R are almost universally applicable to other programs and environments.

15

Page 16: Introduction to data science intro,ch(1,2,3)

CONS:

R is "command line" oriented

R is not especially good at giving feedback or error messages.

16

Page 17: Introduction to data science intro,ch(1,2,3)

How to write a text

myText <- "this is a piece of text" Create Data Set :

myFamilyAges <- c(43, 42, 12, 8, 5)

c(): Concatenates data elements together Assignment arrow: <-

Some mathematical function :

sum():Adds data elements

range():Min value and max value

mean():The average

17

Page 18: Introduction to data science intro,ch(1,2,3)

18