Top Banner
Math & Data Science Dr June Andrews July 29, 2015 Dr June Andrews Math & Data Science July 29, 2015 1 / 59
60
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Math in data

Math & Data Science

Dr June Andrews

July 29, 2015

Dr June Andrews Math & Data Science July 29, 2015 1 / 59

Page 2: Math in data

Table of contents

1 Data ScienceOriginsPeopleWork

2 Math Behind Data ScienceExperimentationGrowth NormalizationIf Time

Dr June Andrews Math & Data Science July 29, 2015 2 / 59

Page 3: Math in data

First Data Science Job Rec

Be challenged at LinkedIn. We’re looking for superb analyticalminds of all levels to expand our small team that will build someof the most innovative products at LinkedIn.

No specific technical skills are required (we’ll help you learnSQL, Python, and R). You should be extremely intelligent, havequantitative background, and be able to learn quickly and workindependently. This is the perfect job for someone who’s reallysmart, driven, and extremely skilled at creatively solvingproblems. You’ll learn statistics, data mining, programming, andproduct design, but you’ve gotta start with what we can’t teach- intellectual sharpness and creativity.

Figure: LinkedIn Job Posting April 2008

Dr June Andrews Math & Data Science July 29, 2015 3 / 59

Page 4: Math in data

Latest Data Science Job Rec

Data Scientist – Growth Analytics at LinkedInData Scientists on our team partner with product managers, engineers and a cross-functional team to driveLinkedIn membership growth and connectivity. We inform product strategy and product decisions by:

Extracting and analyzing LinkedIn data to derive actionable insights.Formulating success metrics for completely novel products and creating dashboards/reports to monitorthem.Designing and analyzing experiments to test new product ideas.Developing models and data-driven solutions that add material lift to principal performance metrics.

LinkedIn member data is amazingly rich and provides a fantastic opportunity for Data Scientists to explore andcreate, ultimately developing ways for members to improve their professional lives. Youll have the opportunity towork with some of the best data people anywhere in an environment which truly values data-driven decisions.Required qualifications include:

BS/MS in a quantitative discipline: Statistics, Applied Mathematics, Operations Research, ComputerScience, Engineering, Economics, etc.1+ years experience working with large amounts of real data with SQL (Teradata, Oracle, or MySQL)and R, or other statistical package.1+ years work experience programming in Java or Python - Pig experience desired.Proficiency in a Unix/Linux environment for automating processes with shell scripting.Able to translate business objectives into actionable analyses.Able to communicate findings clearly to both technical and non-technical audiences

Preferred Qualifications include:

Experience with Consumer Internet products.Knowledge in one of the following areas is a strong plus: Viral Growth mechanisms, user acquisition inInternational markets, Search Engine Optimization (SEO)Expertise in applied statistics, understanding of controlled experiments.

Figure: LinkedIn Job Posting July 2015

Dr June Andrews Math & Data Science July 29, 2015 4 / 59

Page 5: Math in data

Latest Data Science Job Rec - Applicants

Figure: Applicants now have SQL, Python, and R. 702 applicants in 5 months.

Dr June Andrews Math & Data Science July 29, 2015 5 / 59

Page 6: Math in data

Trend is to Demand More

Definition (Data Science as a Victim of Success)

When use of a skill demonstrates improvements in support and innovation,it is added to the next job rec.

Rule of thumb when hiring, does your favorite colleague pass yourinterview?

Dr June Andrews Math & Data Science July 29, 2015 6 / 59

Page 7: Math in data

Goals

Invariant

Use data to support colleagues: marketing, finance, engineering, . . .

Use data to innovate: products, strategies, performance, . . .

Cherry on Top

Do what it takes to drive company success.

Dr June Andrews Math & Data Science July 29, 2015 7 / 59

Page 8: Math in data

Progress

1 Data ScienceOriginsPeopleWork

2 Math Behind Data ScienceExperimentationGrowth NormalizationIf Time

Dr June Andrews Math & Data Science July 29, 2015 8 / 59

Page 9: Math in data

LinkedIn Data

Dr June Andrews Math & Data Science July 29, 2015 9 / 59

Page 10: Math in data

Source of 125k Data Professionals

Figure: Incredibly diverse.

Dr June Andrews Math & Data Science July 29, 2015 10 / 59

Page 11: Math in data

Data Professionals on LinkedIn

> 2k degree fields (after standardization)

16% are Unique Degrees:

Oral Surgery

Phytopathology

Wedding Planning

Ground Transportation

Library Sciences

Turfgrass Management

Embryology

Fire Fighting

Stagecraft

Art Conservation

Dr June Andrews Math & Data Science July 29, 2015 11 / 59

Page 12: Math in data

Data Science Homogenization Trend

Dr June Andrews Math & Data Science July 29, 2015 12 / 59

Page 13: Math in data

Uneven Growth of Top 10 Backgrounds

Dr June Andrews Math & Data Science July 29, 2015 13 / 59

Page 14: Math in data

Uneven Growth of Top 10 Backgrounds

Figure: Increased recruitment of economists and statisticans.

Dr June Andrews Math & Data Science July 29, 2015 14 / 59

Page 15: Math in data

Destinations of Data Professionals

Dr June Andrews Math & Data Science July 29, 2015 15 / 59

Page 16: Math in data

Industry Diversification of Data Professionals

Dr June Andrews Math & Data Science July 29, 2015 16 / 59

Page 17: Math in data

Uneven Growth of Top 10 Industries

Dr June Andrews Math & Data Science July 29, 2015 17 / 59

Page 18: Math in data

Trends

Homogenization of Sources of Data Professionals

Diversification of Industry Destinations of Data Professionals

Dr June Andrews Math & Data Science July 29, 2015 18 / 59

Page 19: Math in data

Progress

1 Data ScienceOriginsPeopleWork

2 Math Behind Data ScienceExperimentationGrowth NormalizationIf Time

Dr June Andrews Math & Data Science July 29, 2015 19 / 59

Page 20: Math in data

Product Cycle

Figure: What portion of work data scientists do on a daily basis depends onproduct life cycle.

Dr June Andrews Math & Data Science July 29, 2015 20 / 59

Page 21: Math in data

Content

Ask - Make content go big.

Dr June Andrews Math & Data Science July 29, 2015 21 / 59

Page 22: Math in data

Connection Network

Figure: Content spreads along existing connection network.

Dr June Andrews Math & Data Science July 29, 2015 22 / 59

Page 23: Math in data

Follow Network

Figure: Change the game. Increase readership and visibilty via follows.

Dr June Andrews Math & Data Science July 29, 2015 23 / 59

Page 24: Math in data

Product Cycle - Follow Network

Stage Work Time

IdeationExplore how to make content go big.Follows.

2 weeks

Design & SpecDefine a Follow for security, PR,marketing, all teams possibly affected.

3 weeks

DevelopmentDatabase engineering, rollback safe,experimental framework.

6 months

Test & Iterate Slow release experiment. 3 months

Release Clean up code, outline fast follows 1 month

Table: Follow Network, slow and steady development cycle.

Dr June Andrews Math & Data Science July 29, 2015 24 / 59

Page 25: Math in data

Types of Work

Area of Data GoalAnalyze UnderstandVisualize CommunicateBusiness Decisions Orchestrate ActionPrototype Product Demonstrate UsefulnessRefine Product Maximize UsefulnessDesign Experiment Measure ChangesAnalyze Experiment LearnLog Save EverythingProcess Make Data UseableLoad to Server/DB Make Data Accessible

Table: General data science stack.

Dr June Andrews Math & Data Science July 29, 2015 25 / 59

Page 26: Math in data

Who does What

Figure: Depth v. breadth of different fields.

Dr June Andrews Math & Data Science July 29, 2015 26 / 59

Page 27: Math in data

Skills of Data Professionals

Languages Tools Hard Skills Soft Skills

SQLMicrosoft (Office,Excel, SQL, Visio)

Research Management

Java Oracle Statistics LeadershipMatlab SAS ETL Process ImprovementJavascript SharePoint Data Modeling Customer ServiceR SAP Software Dev Software DocsPython Cisco Data Mining StrategyC++ Salseforce Forecasting Public SpeakingXML Six Sigma Database Design Team Leadership

Table: From LinkedIn’s 125k Data Professionals.

Dr June Andrews Math & Data Science July 29, 2015 27 / 59

Page 28: Math in data

Network Product Development

1 Data ScienceOriginsPeopleWork

2 Math Behind Data ScienceExperimentationGrowth NormalizationIf Time

Dr June Andrews Math & Data Science July 29, 2015 28 / 59

Page 29: Math in data

Traditional A/B Testing

Figure: Traditional ab testing. [Salesforce]

High Level

Randomly divides users into two groups for different treatments.

Dr June Andrews Math & Data Science July 29, 2015 29 / 59

Page 30: Math in data

Social Influence

Figure: Users can communicate experiences in social networks.

Cross Over

Testing interaction features such as messaging, connections, and profileviews inherently have cross cohort communication.

Dr June Andrews Math & Data Science July 29, 2015 30 / 59

Page 31: Math in data

Elegant Solution

Figure: See geographical bounds. [Ugander et al]

High Level

Partition network into relatively low intra communication groups.

Dr June Andrews Math & Data Science July 29, 2015 31 / 59

Page 32: Math in data

Elegant Solution

Downside

Costly to implement and assign elegant solution. Limited number ofexperiments can run simultaneousl.

Cohort Actual Performance Observed Performance Observed DiffA x zB y c · z c − 1

Table: What exists and is observed.2 equations, 3 variables, can compute upper bound for x

y

Dr June Andrews Math & Data Science July 29, 2015 32 / 59

Page 33: Math in data

Elegant v. Brute Force Tradeoff

Bound

Actual impact a is bounded by observed impact c & viral coefficient V :

a = c−V1−cV

Figure: Small impact for low viral products. [Andrews]

Dr June Andrews Math & Data Science July 29, 2015 33 / 59

Page 34: Math in data

Alternative Brute Force

Control Interactions

Split on the interaction at the cost of inconsistent user experience. Benefitis test the impact of sending or receiving.

Sender / Receiver A B

A Treatment ControlB Control Control

Dr June Andrews Math & Data Science July 29, 2015 34 / 59

Page 35: Math in data

Progress

1 Data ScienceOriginsPeopleWork

2 Math Behind Data ScienceExperimentationGrowth NormalizationIf Time

Dr June Andrews Math & Data Science July 29, 2015 35 / 59

Page 36: Math in data

Health Care’s Relations with other Industries

Figure: Since 2008 Health Care has increased relationships with Recruiters.

Dr June Andrews Math & Data Science July 29, 2015 36 / 59

Page 37: Math in data

Not so Fast

Figure: Growth of relationships is dominated by LinkedIn’s growth.

Dr June Andrews Math & Data Science July 29, 2015 37 / 59

Page 38: Math in data

Confounding or Masking Variables

Control Confounding Variables

Data quality and growth can dominate underlyingtrends.

LinkedIn’s Network Growth is massive anddiverse

Venture Capitalists and Recruiters are hyperconnectors

Figure: Stan Lee

Dr June Andrews Math & Data Science July 29, 2015 38 / 59

Page 39: Math in data

Control for Growth and Behavioral Variables

Approach

Set as constants the number of users in an industry and how manyconnections they have. Then reconnect connections at random.

Figure: Break edges and reconnect randomly.

Dr June Andrews Math & Data Science July 29, 2015 39 / 59

Page 40: Math in data

Expected Connections

Closed Form Solution

Reducible to pulling red and blue balls from a bag without replacement.The solution is the expectation of the Hypergeometric distribution.

E [‖Edges(Health Care, I)‖] = ‖Edges(Health Care)‖‖Edges(I )‖∑i,j ‖Edges(i ,j)‖−‖Edges(Health Care)‖

Dr June Andrews Math & Data Science July 29, 2015 40 / 59

Page 41: Math in data

Expected Connections

Figure: Given growth and behavioral patterns, we expect some industries to havea dramatic number of connections to health care professionals.

Dr June Andrews Math & Data Science July 29, 2015 41 / 59

Page 42: Math in data

Significant Relations with Health Care Appear

Figure: Venture Capitalists and Recruiters are no longer in the top rankings.

Dr June Andrews Math & Data Science July 29, 2015 42 / 59

Page 43: Math in data

Significant Relations with Health Care Appear

Relations Now Reflect the LargerEconomy

City programs have increasedinhome and preventative care

Many hospitals are named afterSaints and affliated withReligious Denominations

Medical Devices andPharmaceuticals have and havealways had strong connectionto Health Care

Figure: Industries with SignificantConnections to Health Care

Dr June Andrews Math & Data Science July 29, 2015 43 / 59

Page 44: Math in data

Significant Relations with Realtors

Figure: Period of dramatic growth for real estate

Dr June Andrews Math & Data Science July 29, 2015 44 / 59

Page 45: Math in data

Significant Relations with Realtors

Figure: Period of economic change

Dr June Andrews Math & Data Science July 29, 2015 45 / 59

Page 46: Math in data

Significant Relations with Construction

Figure: Symmetric relationship between real estate and construction.Construction workers migrate between real estate and oil and mining.

Dr June Andrews Math & Data Science July 29, 2015 46 / 59

Page 47: Math in data

Industry Migration - Mechanics

Figure: Construction workers connecting with Oil & Mining over Real Estate

Dr June Andrews Math & Data Science July 29, 2015 47 / 59

Page 48: Math in data

Industry Migration - Mechanics

How?

Is migration prompted by influential people?

Is migration independent pockets of movement?

Dr June Andrews Math & Data Science July 29, 2015 48 / 59

Page 49: Math in data

Industry Migration - Cascades

Figure: Median is 4 neighbors migrated before conversion

Dr June Andrews Math & Data Science July 29, 2015 49 / 59

Page 50: Math in data

Industry Migration - Mechanics

Figure: Size of bubble is proportional to size of complete cascade.

How?

Migration is largely independent, with some cascades.

Dr June Andrews Math & Data Science July 29, 2015 50 / 59

Page 51: Math in data

Wrap

1 Data ScienceOriginsPeopleWork

2 Math Behind Data ScienceExperimentationGrowth NormalizationIf Time

Dr June Andrews Math & Data Science July 29, 2015 51 / 59

Page 52: Math in data

Data & Computing Growth

Figure: Data growth is exponential. Rule of thumb is doubles every 4-8 months.

Dr June Andrews Math & Data Science July 29, 2015 52 / 59

Page 53: Math in data

Linearity Wins

Figure: Linear algorithms are fast, predictable, and complete.

Dr June Andrews Math & Data Science July 29, 2015 53 / 59

Page 54: Math in data

Takes a Village - Thank You!

Figure: Every project involved at least 3 people.Dr June Andrews Math & Data Science July 29, 2015 54 / 59

Page 55: Math in data

Progress

1 Data ScienceOriginsPeopleWork

2 Math Behind Data ScienceExperimentationGrowth NormalizationIf Time

Dr June Andrews Math & Data Science July 29, 2015 55 / 59

Page 56: Math in data

MAP

MAP combines:

Precision - Give me only what I wantRecall - Give me everything I want

Figure: (Precision, Recall) values with same MAP score.Dr June Andrews Math & Data Science July 29, 2015 56 / 59

Page 57: Math in data

MAP

Figure: Two sets of (Precision, Recall) values with same MAP score.

Dr June Andrews Math & Data Science July 29, 2015 57 / 59

Page 58: Math in data

MAP

Figure: Improve a search algorithm from point a with either small increase inRecall or a large increase in Precision.

Dr June Andrews Math & Data Science July 29, 2015 58 / 59

Page 59: Math in data

MAP

Figure: Additional points.

Dr June Andrews Math & Data Science July 29, 2015 59 / 59

Page 60: Math in data

MAP

Snake Oil

When Precision and Recall values are not balanced, MAP only respondsto changes in the lower one.

North Star

When Precision and Recall values are balanced, promotes improvement ofboth Precision and Recall.

Dr June Andrews Math & Data Science July 29, 2015 60 / 59