Algorithm Foundations of Data Science and Engineering Lecture 0: Course Introduction MING GAO DaSE @ ECNU (for course related communications) [email protected] Sep. 14, 2020
Algorithm Foundations of Data Science and EngineeringLecture 0: Course Introduction
MING GAO
DaSE @ ECNU(for course related communications)
Sep. 14, 2020
Outline
Textbooks and References
Requirements and Assessment
Office Hour and Contact Information
Overview of This CourseWhat Is Data Science?Course Schedule
Take-aways
2 / 22
Required sources
Required sources
� Ming Gao, Huiqi Hu, Lecture notes.
� John Hopcroft and Ravindran Kannan, Foundations of DataScience.
� Anand Rajaraman and Jeffrey D. Ullman, Mining of MassiveDatasets.
References
� Daphne Koller and Nir Friedman, Probabilistic Graphical Models:Principles and Techniques.
� Gilbert Strang, Linear Algebra and Its Applications(FourthEdition).
3 / 22
Requirements
1. Slides and lecture notes will be posted 1-2 days before lecture, but
2. Students are expected to
� take notes during lecture� read the assigned readings before and after the lecture� think through the answers of tutorial (a set of questions) every week
before the lecture
3. Implement a technique published in the top venues, such as KDD,ICDM, SIGMOD, SIGIR, ACL, etc. (honestly and independently)
4 / 22
Assessment
5 / 22
Contact informationLecturer: GAO Ming—-
� Office: Rm. East 115, Math. Building
� Phone: 6223 2061
� Mobile: 189 1694 3299
� Email: [email protected]
� TA: Tingting Liu and Lei Li—-
� Course homepage: http://dase.ecnu.edu.cn/mgao/teaching/
DataSci_2020_Fall/ADS.html
� Research interests:� Data platform� Knowledge graph and knowledge engineering� Computational pedagogy� Streaming and social data mining
6 / 22
Data science and big data
� How to understand big data?� Volume: 100PB and 20PB data daily processing for Baidu and
Google, respectively; Alibaba and Tecent have data more than 100PB.� Velocity: Large Hadron Collider generates PB data in seconds; many
streaming such as clickstream, log, RFID, Twitter, etc. #Trans. isalmost 100,000 per second in Taobao during “Double 11”.
� Variety: structured, semi-structured and non-structured, includingtext, logs, video, voice and image etc.
� Value: interests, behaviors, trustworthiness, and preference, etc.
� Fragmentation of information:� Telecom� E-commerce� Social media� Internet of things (IOT)� · · ·
7 / 22
Birth of data science
� Reasons� Challenges of 4V� Hardware updating� Open sources, including Hadoop, Spark, Storm, and so on.� Applications, such as E-commerce, sharing economy, industry 4.0,
smart city, and intelligent education, etc.
8 / 22
Data importance
� Data becomes an independent factor of production.
� In 2017, in the age of the Internet economy, data is a new factor ofproduction, a fundamental resource and a strategic resource.
� On April 9, 2020, data becomes a new factor of production, just likeland, labor, capital and technology
� Data is the foundational resource of the digital economy, facingmany challenges such as data silos, digital divide, data privacy anddata security.
9 / 22
Data is power
10 / 22
What is data science?
DefinitionData science is an interdisciplinary field, which is a continuation ofsome of the data analysis fields such as mathematics, statistics,machine learning, data mining, and parallel computing, similar toKnowledge Discovery in Databases (KDD).
Objective
Data science goals to:
� extract knowledge
� insight from data in variousforms, either structured orunstructured
� help users to understandmassive data
11 / 22
DS co-evolution
� Data science was mentioned by John W. Tukey in 1962 (“TheFuture of Data Analysis” ).
� Data science was defined by Peter Naur in 1974 (“Concise Surveyof Computer Methods”)
� Many data mining approaches were proposed in the 1980s of the20th century.
� In 1996, international federation of classification societies issue setup a conference, namely Data Science, Classification and RelatedMethods.
� In June 2009, Nathan Yau published a paper talking about therising of data science.
� Data scientist is the sexiest job in the 21st century (Hal Varian onSep. 2012).
12 / 22
Types of data scientists
� Data developer: data acquisition, organization and management.
� Data researcher: statisticians, social scientist, computer scientist,etc.
� Data creative: experts in machine learning, data mining, andprogramming, etc., contributor in open-source community,
� Data businessman: project manager, Chief Data Officer (CDO)
� Mixed/Generic type: deep-understand in business, professional intechnology, good at programming, etc.
13 / 22
Why do we need to learn this course?
Remarks
1. Most popular among new options added in 2016 are K-nearestneighbors, PCA, Random Forests, Optimization, Neural networks,Deep Learning, and Singular Value Decomposition
2. The biggest declines are Association rules, statistics, and DecisionTrees14 / 22
Course features
Features
1. Algorithms for data science involve in many disciplines, such asdata mining, machine learning, statistics, visualization, NLP, datamanagement, optimization, and algebra, etc.
2. Tasks in data science problems are various in data types.
15 / 22
Four paradigms of scientific research
� Experimental science
� Theoretical science
� Computational science� Data science?
� It was firstly proposed by Jim Gray (a database researcher) in 2009.� The Forth Paradigm: Data-Intensive Scientific Discovery was wrote by
Tony Hey (vice president of Microsoft) et al. in 2009.� Thus, the capability for big data processing is important to scientific
researchers.
16 / 22
The shortage of data scientists
17 / 22
Schedule
Background
DS overview
Probabilistic and Statistical algorithm
� Probabilistic inequality
� Hashing algorithm
� Sampling
� Sketch
� Random Walk
� EM algorithm
18 / 22
Schedule
Background
DS overview
Probabilistic and Statistical algorithm
� Probabilistic inequality
� Hashing algorithm
� Sampling
� Sketch
� Random Walk
� EM algorithm
18 / 22
Schedule
Linear Algebra
� Eigenvalue computation
� SVD and PCA
� Matrix factorization
Combinatorial Optimization
� Integer programming
� Submodular
� Community
19 / 22
Schedule
Linear Algebra
� Eigenvalue computation
� SVD and PCA
� Matrix factorization
Combinatorial Optimization
� Integer programming
� Submodular
� Community
19 / 22
A project to subvert the traditional learning manner
Randomized learning
Randomized learning is a new way of learning initiated by learnerswithout the purpose of knowledge teaching, and the learning ofknowledge is realized in the process of students browsing emergingmedia.
� Classroom teaching is a kind of learning method with the aim ofknowledge transfer.
� Classroom teaching cannot avoid the contradiction ofindividualization and scale.
� But, randomize learning is a good solution to address thecontradiction.� It is the result of the students’ independent choice;� It can support thousands of students to learn online at the same time;� The learning process can be conducted anytime and anywhere.
20 / 22
A project to subvert the traditional learning manner
Randomized learning
Randomized learning is a new way of learning initiated by learnerswithout the purpose of knowledge teaching, and the learning ofknowledge is realized in the process of students browsing emergingmedia.
� Classroom teaching is a kind of learning method with the aim ofknowledge transfer.
� Classroom teaching cannot avoid the contradiction ofindividualization and scale.
� But, randomize learning is a good solution to address thecontradiction.� It is the result of the students’ independent choice;� It can support thousands of students to learn online at the same time;� The learning process can be conducted anytime and anywhere.
20 / 22
A project to subvert the traditional learning manner
Randomized learning
Randomized learning is a new way of learning initiated by learnerswithout the purpose of knowledge teaching, and the learning ofknowledge is realized in the process of students browsing emergingmedia.
� Classroom teaching is a kind of learning method with the aim ofknowledge transfer.
� Classroom teaching cannot avoid the contradiction ofindividualization and scale.
� But, randomize learning is a good solution to address thecontradiction.� It is the result of the students’ independent choice;� It can support thousands of students to learn online at the same time;� The learning process can be conducted anytime and anywhere.
20 / 22
A project to subvert the traditional learning manner
Randomized learning
Randomized learning is a new way of learning initiated by learnerswithout the purpose of knowledge teaching, and the learning ofknowledge is realized in the process of students browsing emergingmedia.
� Classroom teaching is a kind of learning method with the aim ofknowledge transfer.
� Classroom teaching cannot avoid the contradiction ofindividualization and scale.
� But, randomize learning is a good solution to address thecontradiction.� It is the result of the students’ independent choice;� It can support thousands of students to learn online at the same time;� The learning process can be conducted anytime and anywhere.
20 / 22
A project to subvert the traditional learning manner
Randomized learning
Randomized learning is a new way of learning initiated by learnerswithout the purpose of knowledge teaching, and the learning ofknowledge is realized in the process of students browsing emergingmedia.
� Classroom teaching is a kind of learning method with the aim ofknowledge transfer.
� Classroom teaching cannot avoid the contradiction ofindividualization and scale.
� But, randomize learning is a good solution to address thecontradiction.
� It is the result of the students’ independent choice;� It can support thousands of students to learn online at the same time;� The learning process can be conducted anytime and anywhere.
20 / 22
A project to subvert the traditional learning manner
Randomized learning
Randomized learning is a new way of learning initiated by learnerswithout the purpose of knowledge teaching, and the learning ofknowledge is realized in the process of students browsing emergingmedia.
� Classroom teaching is a kind of learning method with the aim ofknowledge transfer.
� Classroom teaching cannot avoid the contradiction ofindividualization and scale.
� But, randomize learning is a good solution to address thecontradiction.� It is the result of the students’ independent choice;
� It can support thousands of students to learn online at the same time;� The learning process can be conducted anytime and anywhere.
20 / 22
A project to subvert the traditional learning manner
Randomized learning
Randomized learning is a new way of learning initiated by learnerswithout the purpose of knowledge teaching, and the learning ofknowledge is realized in the process of students browsing emergingmedia.
� Classroom teaching is a kind of learning method with the aim ofknowledge transfer.
� Classroom teaching cannot avoid the contradiction ofindividualization and scale.
� But, randomize learning is a good solution to address thecontradiction.� It is the result of the students’ independent choice;� It can support thousands of students to learn online at the same time;
� The learning process can be conducted anytime and anywhere.
20 / 22
A project to subvert the traditional learning manner
Randomized learning
Randomized learning is a new way of learning initiated by learnerswithout the purpose of knowledge teaching, and the learning ofknowledge is realized in the process of students browsing emergingmedia.
� Classroom teaching is a kind of learning method with the aim ofknowledge transfer.
� Classroom teaching cannot avoid the contradiction ofindividualization and scale.
� But, randomize learning is a good solution to address thecontradiction.� It is the result of the students’ independent choice;� It can support thousands of students to learn online at the same time;� The learning process can be conducted anytime and anywhere.
20 / 22
Details of the project
Start-up project
� System implementation
� It is similar to TikTok, but it is not;� System architecture;� Back-end development;� Front-end APP;� DaSE am recruiting undergraduate students for this start-up project.
If you are interested, don’t hesitate to contact me.
� Content production� Every student in this course needs to upload at least two videos into
the system (included in your usual performance);� The length of each video is not less than 2 minutes.
21 / 22
Details of the project
Start-up project
� System implementation� It is similar to TikTok, but it is not;� System architecture;� Back-end development;� Front-end APP;� DaSE am recruiting undergraduate students for this start-up project.
If you are interested, don’t hesitate to contact me.
� Content production� Every student in this course needs to upload at least two videos into
the system (included in your usual performance);� The length of each video is not less than 2 minutes.
21 / 22
Details of the project
Start-up project
� System implementation� It is similar to TikTok, but it is not;� System architecture;� Back-end development;� Front-end APP;� DaSE am recruiting undergraduate students for this start-up project.
If you are interested, don’t hesitate to contact me.
� Content production� Every student in this course needs to upload at least two videos into
the system (included in your usual performance);� The length of each video is not less than 2 minutes.
21 / 22
Take-aways
Course homepage:http://dase.ecnu.edu.cn/mgao/
teaching/DataSci_2020_Fall/ADS.html
Advices to learning algorithm foundations of data science andengineering
� Not a reading course.
� More than a programming course, though it is project-heavy
� No standard answers
22 / 22