Top Banner
Algorithm Foundations of Data Science and Engineering Lecture 0: Course Introduction MING GAO DaSE @ ECNU (for course related communications) [email protected] Sep. 14, 2020
33

Algorithm Foundations of Data Science and Engineeringdase.ecnu.edu.cn/mgao/teaching/DataSci_2020_Fall/slides/...Overview of This Course What Is Data Science? Course Schedule Take-aways

Oct 16, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Algorithm Foundations of Data Science and Engineeringdase.ecnu.edu.cn/mgao/teaching/DataSci_2020_Fall/slides/...Overview of This Course What Is Data Science? Course Schedule Take-aways

Algorithm Foundations of Data Science and EngineeringLecture 0: Course Introduction

MING GAO

DaSE @ ECNU(for course related communications)

[email protected]

Sep. 14, 2020

Page 2: Algorithm Foundations of Data Science and Engineeringdase.ecnu.edu.cn/mgao/teaching/DataSci_2020_Fall/slides/...Overview of This Course What Is Data Science? Course Schedule Take-aways

Outline

Textbooks and References

Requirements and Assessment

Office Hour and Contact Information

Overview of This CourseWhat Is Data Science?Course Schedule

Take-aways

2 / 22

Page 3: Algorithm Foundations of Data Science and Engineeringdase.ecnu.edu.cn/mgao/teaching/DataSci_2020_Fall/slides/...Overview of This Course What Is Data Science? Course Schedule Take-aways

Required sources

Required sources

� Ming Gao, Huiqi Hu, Lecture notes.

� John Hopcroft and Ravindran Kannan, Foundations of DataScience.

� Anand Rajaraman and Jeffrey D. Ullman, Mining of MassiveDatasets.

References

� Daphne Koller and Nir Friedman, Probabilistic Graphical Models:Principles and Techniques.

� Gilbert Strang, Linear Algebra and Its Applications(FourthEdition).

3 / 22

Page 4: Algorithm Foundations of Data Science and Engineeringdase.ecnu.edu.cn/mgao/teaching/DataSci_2020_Fall/slides/...Overview of This Course What Is Data Science? Course Schedule Take-aways

Requirements

1. Slides and lecture notes will be posted 1-2 days before lecture, but

2. Students are expected to

� take notes during lecture� read the assigned readings before and after the lecture� think through the answers of tutorial (a set of questions) every week

before the lecture

3. Implement a technique published in the top venues, such as KDD,ICDM, SIGMOD, SIGIR, ACL, etc. (honestly and independently)

4 / 22

Page 5: Algorithm Foundations of Data Science and Engineeringdase.ecnu.edu.cn/mgao/teaching/DataSci_2020_Fall/slides/...Overview of This Course What Is Data Science? Course Schedule Take-aways

Assessment

5 / 22

Page 6: Algorithm Foundations of Data Science and Engineeringdase.ecnu.edu.cn/mgao/teaching/DataSci_2020_Fall/slides/...Overview of This Course What Is Data Science? Course Schedule Take-aways

Contact informationLecturer: GAO Ming—-

� Office: Rm. East 115, Math. Building

� Phone: 6223 2061

� Mobile: 189 1694 3299

� Email: [email protected]

� TA: Tingting Liu and Lei Li—-

� Course homepage: http://dase.ecnu.edu.cn/mgao/teaching/

DataSci_2020_Fall/ADS.html

� Research interests:� Data platform� Knowledge graph and knowledge engineering� Computational pedagogy� Streaming and social data mining

6 / 22

Page 7: Algorithm Foundations of Data Science and Engineeringdase.ecnu.edu.cn/mgao/teaching/DataSci_2020_Fall/slides/...Overview of This Course What Is Data Science? Course Schedule Take-aways

Data science and big data

� How to understand big data?� Volume: 100PB and 20PB data daily processing for Baidu and

Google, respectively; Alibaba and Tecent have data more than 100PB.� Velocity: Large Hadron Collider generates PB data in seconds; many

streaming such as clickstream, log, RFID, Twitter, etc. #Trans. isalmost 100,000 per second in Taobao during “Double 11”.

� Variety: structured, semi-structured and non-structured, includingtext, logs, video, voice and image etc.

� Value: interests, behaviors, trustworthiness, and preference, etc.

� Fragmentation of information:� Telecom� E-commerce� Social media� Internet of things (IOT)� · · ·

7 / 22

Page 8: Algorithm Foundations of Data Science and Engineeringdase.ecnu.edu.cn/mgao/teaching/DataSci_2020_Fall/slides/...Overview of This Course What Is Data Science? Course Schedule Take-aways

Birth of data science

� Reasons� Challenges of 4V� Hardware updating� Open sources, including Hadoop, Spark, Storm, and so on.� Applications, such as E-commerce, sharing economy, industry 4.0,

smart city, and intelligent education, etc.

8 / 22

Page 9: Algorithm Foundations of Data Science and Engineeringdase.ecnu.edu.cn/mgao/teaching/DataSci_2020_Fall/slides/...Overview of This Course What Is Data Science? Course Schedule Take-aways

Data importance

� Data becomes an independent factor of production.

� In 2017, in the age of the Internet economy, data is a new factor ofproduction, a fundamental resource and a strategic resource.

� On April 9, 2020, data becomes a new factor of production, just likeland, labor, capital and technology

� Data is the foundational resource of the digital economy, facingmany challenges such as data silos, digital divide, data privacy anddata security.

9 / 22

Page 10: Algorithm Foundations of Data Science and Engineeringdase.ecnu.edu.cn/mgao/teaching/DataSci_2020_Fall/slides/...Overview of This Course What Is Data Science? Course Schedule Take-aways

Data is power

10 / 22

Page 11: Algorithm Foundations of Data Science and Engineeringdase.ecnu.edu.cn/mgao/teaching/DataSci_2020_Fall/slides/...Overview of This Course What Is Data Science? Course Schedule Take-aways

What is data science?

DefinitionData science is an interdisciplinary field, which is a continuation ofsome of the data analysis fields such as mathematics, statistics,machine learning, data mining, and parallel computing, similar toKnowledge Discovery in Databases (KDD).

Objective

Data science goals to:

� extract knowledge

� insight from data in variousforms, either structured orunstructured

� help users to understandmassive data

11 / 22

Page 12: Algorithm Foundations of Data Science and Engineeringdase.ecnu.edu.cn/mgao/teaching/DataSci_2020_Fall/slides/...Overview of This Course What Is Data Science? Course Schedule Take-aways

DS co-evolution

� Data science was mentioned by John W. Tukey in 1962 (“TheFuture of Data Analysis” ).

� Data science was defined by Peter Naur in 1974 (“Concise Surveyof Computer Methods”)

� Many data mining approaches were proposed in the 1980s of the20th century.

� In 1996, international federation of classification societies issue setup a conference, namely Data Science, Classification and RelatedMethods.

� In June 2009, Nathan Yau published a paper talking about therising of data science.

� Data scientist is the sexiest job in the 21st century (Hal Varian onSep. 2012).

12 / 22

Page 13: Algorithm Foundations of Data Science and Engineeringdase.ecnu.edu.cn/mgao/teaching/DataSci_2020_Fall/slides/...Overview of This Course What Is Data Science? Course Schedule Take-aways

Types of data scientists

� Data developer: data acquisition, organization and management.

� Data researcher: statisticians, social scientist, computer scientist,etc.

� Data creative: experts in machine learning, data mining, andprogramming, etc., contributor in open-source community,

� Data businessman: project manager, Chief Data Officer (CDO)

� Mixed/Generic type: deep-understand in business, professional intechnology, good at programming, etc.

13 / 22

Page 14: Algorithm Foundations of Data Science and Engineeringdase.ecnu.edu.cn/mgao/teaching/DataSci_2020_Fall/slides/...Overview of This Course What Is Data Science? Course Schedule Take-aways

Why do we need to learn this course?

Remarks

1. Most popular among new options added in 2016 are K-nearestneighbors, PCA, Random Forests, Optimization, Neural networks,Deep Learning, and Singular Value Decomposition

2. The biggest declines are Association rules, statistics, and DecisionTrees14 / 22

Page 15: Algorithm Foundations of Data Science and Engineeringdase.ecnu.edu.cn/mgao/teaching/DataSci_2020_Fall/slides/...Overview of This Course What Is Data Science? Course Schedule Take-aways

Course features

Features

1. Algorithms for data science involve in many disciplines, such asdata mining, machine learning, statistics, visualization, NLP, datamanagement, optimization, and algebra, etc.

2. Tasks in data science problems are various in data types.

15 / 22

Page 16: Algorithm Foundations of Data Science and Engineeringdase.ecnu.edu.cn/mgao/teaching/DataSci_2020_Fall/slides/...Overview of This Course What Is Data Science? Course Schedule Take-aways

Four paradigms of scientific research

� Experimental science

� Theoretical science

� Computational science� Data science?

� It was firstly proposed by Jim Gray (a database researcher) in 2009.� The Forth Paradigm: Data-Intensive Scientific Discovery was wrote by

Tony Hey (vice president of Microsoft) et al. in 2009.� Thus, the capability for big data processing is important to scientific

researchers.

16 / 22

Page 17: Algorithm Foundations of Data Science and Engineeringdase.ecnu.edu.cn/mgao/teaching/DataSci_2020_Fall/slides/...Overview of This Course What Is Data Science? Course Schedule Take-aways

The shortage of data scientists

17 / 22

Page 18: Algorithm Foundations of Data Science and Engineeringdase.ecnu.edu.cn/mgao/teaching/DataSci_2020_Fall/slides/...Overview of This Course What Is Data Science? Course Schedule Take-aways

Schedule

Background

DS overview

Probabilistic and Statistical algorithm

� Probabilistic inequality

� Hashing algorithm

� Sampling

� Sketch

� Random Walk

� EM algorithm

18 / 22

Page 19: Algorithm Foundations of Data Science and Engineeringdase.ecnu.edu.cn/mgao/teaching/DataSci_2020_Fall/slides/...Overview of This Course What Is Data Science? Course Schedule Take-aways

Schedule

Background

DS overview

Probabilistic and Statistical algorithm

� Probabilistic inequality

� Hashing algorithm

� Sampling

� Sketch

� Random Walk

� EM algorithm

18 / 22

Page 20: Algorithm Foundations of Data Science and Engineeringdase.ecnu.edu.cn/mgao/teaching/DataSci_2020_Fall/slides/...Overview of This Course What Is Data Science? Course Schedule Take-aways

Schedule

Linear Algebra

� Eigenvalue computation

� SVD and PCA

� Matrix factorization

Combinatorial Optimization

� Integer programming

� Submodular

� Community

19 / 22

Page 21: Algorithm Foundations of Data Science and Engineeringdase.ecnu.edu.cn/mgao/teaching/DataSci_2020_Fall/slides/...Overview of This Course What Is Data Science? Course Schedule Take-aways

Schedule

Linear Algebra

� Eigenvalue computation

� SVD and PCA

� Matrix factorization

Combinatorial Optimization

� Integer programming

� Submodular

� Community

19 / 22

Page 22: Algorithm Foundations of Data Science and Engineeringdase.ecnu.edu.cn/mgao/teaching/DataSci_2020_Fall/slides/...Overview of This Course What Is Data Science? Course Schedule Take-aways

A project to subvert the traditional learning manner

Randomized learning

Randomized learning is a new way of learning initiated by learnerswithout the purpose of knowledge teaching, and the learning ofknowledge is realized in the process of students browsing emergingmedia.

� Classroom teaching is a kind of learning method with the aim ofknowledge transfer.

� Classroom teaching cannot avoid the contradiction ofindividualization and scale.

� But, randomize learning is a good solution to address thecontradiction.� It is the result of the students’ independent choice;� It can support thousands of students to learn online at the same time;� The learning process can be conducted anytime and anywhere.

20 / 22

Page 23: Algorithm Foundations of Data Science and Engineeringdase.ecnu.edu.cn/mgao/teaching/DataSci_2020_Fall/slides/...Overview of This Course What Is Data Science? Course Schedule Take-aways

A project to subvert the traditional learning manner

Randomized learning

Randomized learning is a new way of learning initiated by learnerswithout the purpose of knowledge teaching, and the learning ofknowledge is realized in the process of students browsing emergingmedia.

� Classroom teaching is a kind of learning method with the aim ofknowledge transfer.

� Classroom teaching cannot avoid the contradiction ofindividualization and scale.

� But, randomize learning is a good solution to address thecontradiction.� It is the result of the students’ independent choice;� It can support thousands of students to learn online at the same time;� The learning process can be conducted anytime and anywhere.

20 / 22

Page 24: Algorithm Foundations of Data Science and Engineeringdase.ecnu.edu.cn/mgao/teaching/DataSci_2020_Fall/slides/...Overview of This Course What Is Data Science? Course Schedule Take-aways

A project to subvert the traditional learning manner

Randomized learning

Randomized learning is a new way of learning initiated by learnerswithout the purpose of knowledge teaching, and the learning ofknowledge is realized in the process of students browsing emergingmedia.

� Classroom teaching is a kind of learning method with the aim ofknowledge transfer.

� Classroom teaching cannot avoid the contradiction ofindividualization and scale.

� But, randomize learning is a good solution to address thecontradiction.� It is the result of the students’ independent choice;� It can support thousands of students to learn online at the same time;� The learning process can be conducted anytime and anywhere.

20 / 22

Page 25: Algorithm Foundations of Data Science and Engineeringdase.ecnu.edu.cn/mgao/teaching/DataSci_2020_Fall/slides/...Overview of This Course What Is Data Science? Course Schedule Take-aways

A project to subvert the traditional learning manner

Randomized learning

Randomized learning is a new way of learning initiated by learnerswithout the purpose of knowledge teaching, and the learning ofknowledge is realized in the process of students browsing emergingmedia.

� Classroom teaching is a kind of learning method with the aim ofknowledge transfer.

� Classroom teaching cannot avoid the contradiction ofindividualization and scale.

� But, randomize learning is a good solution to address thecontradiction.� It is the result of the students’ independent choice;� It can support thousands of students to learn online at the same time;� The learning process can be conducted anytime and anywhere.

20 / 22

Page 26: Algorithm Foundations of Data Science and Engineeringdase.ecnu.edu.cn/mgao/teaching/DataSci_2020_Fall/slides/...Overview of This Course What Is Data Science? Course Schedule Take-aways

A project to subvert the traditional learning manner

Randomized learning

Randomized learning is a new way of learning initiated by learnerswithout the purpose of knowledge teaching, and the learning ofknowledge is realized in the process of students browsing emergingmedia.

� Classroom teaching is a kind of learning method with the aim ofknowledge transfer.

� Classroom teaching cannot avoid the contradiction ofindividualization and scale.

� But, randomize learning is a good solution to address thecontradiction.

� It is the result of the students’ independent choice;� It can support thousands of students to learn online at the same time;� The learning process can be conducted anytime and anywhere.

20 / 22

Page 27: Algorithm Foundations of Data Science and Engineeringdase.ecnu.edu.cn/mgao/teaching/DataSci_2020_Fall/slides/...Overview of This Course What Is Data Science? Course Schedule Take-aways

A project to subvert the traditional learning manner

Randomized learning

Randomized learning is a new way of learning initiated by learnerswithout the purpose of knowledge teaching, and the learning ofknowledge is realized in the process of students browsing emergingmedia.

� Classroom teaching is a kind of learning method with the aim ofknowledge transfer.

� Classroom teaching cannot avoid the contradiction ofindividualization and scale.

� But, randomize learning is a good solution to address thecontradiction.� It is the result of the students’ independent choice;

� It can support thousands of students to learn online at the same time;� The learning process can be conducted anytime and anywhere.

20 / 22

Page 28: Algorithm Foundations of Data Science and Engineeringdase.ecnu.edu.cn/mgao/teaching/DataSci_2020_Fall/slides/...Overview of This Course What Is Data Science? Course Schedule Take-aways

A project to subvert the traditional learning manner

Randomized learning

Randomized learning is a new way of learning initiated by learnerswithout the purpose of knowledge teaching, and the learning ofknowledge is realized in the process of students browsing emergingmedia.

� Classroom teaching is a kind of learning method with the aim ofknowledge transfer.

� Classroom teaching cannot avoid the contradiction ofindividualization and scale.

� But, randomize learning is a good solution to address thecontradiction.� It is the result of the students’ independent choice;� It can support thousands of students to learn online at the same time;

� The learning process can be conducted anytime and anywhere.

20 / 22

Page 29: Algorithm Foundations of Data Science and Engineeringdase.ecnu.edu.cn/mgao/teaching/DataSci_2020_Fall/slides/...Overview of This Course What Is Data Science? Course Schedule Take-aways

A project to subvert the traditional learning manner

Randomized learning

Randomized learning is a new way of learning initiated by learnerswithout the purpose of knowledge teaching, and the learning ofknowledge is realized in the process of students browsing emergingmedia.

� Classroom teaching is a kind of learning method with the aim ofknowledge transfer.

� Classroom teaching cannot avoid the contradiction ofindividualization and scale.

� But, randomize learning is a good solution to address thecontradiction.� It is the result of the students’ independent choice;� It can support thousands of students to learn online at the same time;� The learning process can be conducted anytime and anywhere.

20 / 22

Page 30: Algorithm Foundations of Data Science and Engineeringdase.ecnu.edu.cn/mgao/teaching/DataSci_2020_Fall/slides/...Overview of This Course What Is Data Science? Course Schedule Take-aways

Details of the project

Start-up project

� System implementation

� It is similar to TikTok, but it is not;� System architecture;� Back-end development;� Front-end APP;� DaSE am recruiting undergraduate students for this start-up project.

If you are interested, don’t hesitate to contact me.

� Content production� Every student in this course needs to upload at least two videos into

the system (included in your usual performance);� The length of each video is not less than 2 minutes.

21 / 22

Page 31: Algorithm Foundations of Data Science and Engineeringdase.ecnu.edu.cn/mgao/teaching/DataSci_2020_Fall/slides/...Overview of This Course What Is Data Science? Course Schedule Take-aways

Details of the project

Start-up project

� System implementation� It is similar to TikTok, but it is not;� System architecture;� Back-end development;� Front-end APP;� DaSE am recruiting undergraduate students for this start-up project.

If you are interested, don’t hesitate to contact me.

� Content production� Every student in this course needs to upload at least two videos into

the system (included in your usual performance);� The length of each video is not less than 2 minutes.

21 / 22

Page 32: Algorithm Foundations of Data Science and Engineeringdase.ecnu.edu.cn/mgao/teaching/DataSci_2020_Fall/slides/...Overview of This Course What Is Data Science? Course Schedule Take-aways

Details of the project

Start-up project

� System implementation� It is similar to TikTok, but it is not;� System architecture;� Back-end development;� Front-end APP;� DaSE am recruiting undergraduate students for this start-up project.

If you are interested, don’t hesitate to contact me.

� Content production� Every student in this course needs to upload at least two videos into

the system (included in your usual performance);� The length of each video is not less than 2 minutes.

21 / 22

Page 33: Algorithm Foundations of Data Science and Engineeringdase.ecnu.edu.cn/mgao/teaching/DataSci_2020_Fall/slides/...Overview of This Course What Is Data Science? Course Schedule Take-aways

Take-aways

Course homepage:http://dase.ecnu.edu.cn/mgao/

teaching/DataSci_2020_Fall/ADS.html

Advices to learning algorithm foundations of data science andengineering

� Not a reading course.

� More than a programming course, though it is project-heavy

� No standard answers

22 / 22