Top Banner
Measuring the digital economy using big data Prash Majmudar Growth Intelligence @growthintel @prashmaj
36

Measuring the Digital Economy using Big Data by Prash Majmudar

Jan 27, 2015

Download

Technology

PyData

Measuring the Digital Economy using Big Data by Prash Majmudar
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Measuring the Digital Economy using Big Data by Prash Majmudar

Measuring the digital economy using big data

Prash Majmudar – Growth Intelligence

@growthintel

@prashmaj

Page 2: Measuring the Digital Economy using Big Data by Prash Majmudar

Overview

• Background

• Approach (Data + Python)

• Sizing the economy - Results

• Examples

Page 3: Measuring the Digital Economy using Big Data by Prash Majmudar

Background

Page 4: Measuring the Digital Economy using Big Data by Prash Majmudar

Project background

• Research project supported by NESTA, Google

• Worked with independent economists at the National Institute of Economic and Social Research (NIESR) – Max Nathan, Anna Rosso

• Published report in 2013

• Further phases of work underway

Page 5: Measuring the Digital Economy using Big Data by Prash Majmudar

5

Research questions

• What’s the most appropriate definition of UK ‘digital

companies’? Cleaner definitions, company counts

• What do the UK’s ‘digital companies’ (really) look like? Key

characteristics, focus on start-ups, innovating and ‘high-

growth’ companies, spatial footprint

• What drives innovation and/or high-growth status in digital

companies? Performance analysis and characteristics. Sample

historic data to investigate causality

Page 6: Measuring the Digital Economy using Big Data by Prash Majmudar

Why?

• The digital economy is poorly served by conventional definitions and datasets.

• Reliance on Companies House (historic data)

• Standard definitions used for:

– Credit / risk

– Government policy (e.g. focus on Tech City)

– Economic productivity measures

– Companies that sell / market to other companies

Page 7: Measuring the Digital Economy using Big Data by Prash Majmudar

SIC - Standard Industrial Classification

• Brought into being in 1948– Since 1948 the classification has been revised in

1958, 1968, 1980, 1992, 1997, and 2003

• Latest version is “SIC 2007”– adopted by UK in 2008.

– adopted by Companies House in October 2011.

• 731 SIC codes, but not without issues– Self-classification

– Emerging sectors e.g. no codes for Nanotechnology

Page 8: Measuring the Digital Economy using Big Data by Prash Majmudar

SIC

• 77220 Renting of video tapes and disks

• 81223 Furnace and chimney cleaning services

• 01440 Raising of camels and camelids

• 32110 Striking of coins – Royal Mint

• 38310 Dismantling of wrecks

• 01260 Growing of oleaginous fruits

• 82990 Other business support service activities n.e.c. – 10% of Businesses

• 20% not classified

Page 9: Measuring the Digital Economy using Big Data by Prash Majmudar
Page 10: Measuring the Digital Economy using Big Data by Prash Majmudar

Challenge

• The ‘digital economy’ is not straightforward to define

• Refers to:– a set of sectors,

– a set of outputs (products and services),

– and a set of inputs (production and distribution tools, underpinned by information and communication technologies).

• Mapping the digital economy onto industries is necessarily imprecise.

• Government defines it as ‘information’ and ‘digital content’ industries (BIS 2012, 2013)

• Data driven methods can provide richer, more informative and more up to date analysis.

Page 11: Measuring the Digital Economy using Big Data by Prash Majmudar

Data driven approach

Page 12: Measuring the Digital Economy using Big Data by Prash Majmudar

All Companiesin the Economy~ 3M companies

Online activity

News / Events

Technologies

Classifications

Financials

TMs / Patents

UNUSUALDATA

Trade activity

UNIQUEDATA

COMPANIES

USER DATA

Linked datasets and algorithms

Enterprise users

Tech Users

Medium company

users

Page 13: Measuring the Digital Economy using Big Data by Prash Majmudar

Approach

• Classification system is multi-dimensional:

– Sector: vertical they operate in

– Product type: principal output (services / physical goods)

– Client type: business or consumer focussed

– Sales process: how they sell / route to market

Page 14: Measuring the Digital Economy using Big Data by Prash Majmudar

IT Film Telco PublishingOil &Gas

Architecture

Software– web

Consultancy

Hardware / tools

Electronics

Media distribution

Page 15: Measuring the Digital Economy using Big Data by Prash Majmudar

Approach

Crowd sourced labelled data

Crawl / APIs

Pre-labelled data

Feature generation /

selection

Model training

FeatureExtraction /

pre-processing

Scrapy

Processing

Python scikit-learn / pandas

Training set

Page 16: Measuring the Digital Economy using Big Data by Prash Majmudar

Building training sets

Crowd sourcing –create

classification tasks

Expert panels Pre-labelled data

• Using crowd sourcing

– Users follow pre-defined instructions – are rewarded for successfully completing tasks

– Can put in place qualification tests etc.

– Vote to produce labels – majority of 5

• Used expert panel when large number of classes

Page 17: Measuring the Digital Economy using Big Data by Prash Majmudar

Feature engineering

– Multiple sources of features

• Free text (News / Web)

• Structured datasets (e.g. patent filings etc.)

– Cleaning data

• Malformed HTML

• Stripping out HTML, Javascript

– Tokenising and calculating TF-IDF weights

Page 18: Measuring the Digital Economy using Big Data by Prash Majmudar

Modelling

• Supervised learning classification problem

• Scikit learn (fast iteration on different models). Use of Linear SVMs and processing pipelines

– One vs many classifier

• Pandas plays well here – can quickly build up feature sets

• Large number of features (thousands) – linear models are fast.

Page 19: Measuring the Digital Economy using Big Data by Prash Majmudar

0 0.2 0.4 0.6 0.8 1 1.2 1.4

cables

smes

termination

ip

networking

server

sap

consultant

ethernet

installer

fault

cloud

remote

setup

ict

servers

copper

telecom

wireless

hardware

conferencing

desk

disruption

crm

infrastructure

hosting

fibre

cisco

switches

cabling

0 0.2 0.4 0.6 0.8 1 1.2 1.4

luxurious

quantity

footwear

collection

cotton

courier

shirts

stockists

cart

logo

satin

wholesale

hats

nylon

wear

workwear

bridal

womens

designs

socks

accessories

lace

mens

clothing

fashion

apparel

FashionComputer networkingclf.coef_

Page 20: Measuring the Digital Economy using Big Data by Prash Majmudar

Summary

• Use multiple datasets as an input

• Build multi-class classifiers for sector, product, client, sales process

• Apply classifiers to 3M companies in the UK

Page 21: Measuring the Digital Economy using Big Data by Prash Majmudar

Sizing the digital economy

Page 22: Measuring the Digital Economy using Big Data by Prash Majmudar

Challenges

• Sole traders are not observed

• Registered company addresses are not always trading

addresses

• Understanding company structure

• Employee coverage is limited – gaps in data due to reliance on

historic filing data traditionally

Page 23: Measuring the Digital Economy using Big Data by Prash Majmudar

23

Cleaning the company data

• Aim = build a benchmarking sample

• Include only observations with SIC and GI info => smaller than ‘true’

- Step 1: drop non-trading, dormant, dissolved companies or those in

administration

- Step 2: drop holding companies

- Step 3: identify groups of linked companies (via

name, postcode), keep the unit that reports highest revenue

• Benchmarking sample = 1.868m companies

• Validate ‘true’ sample (2.254m) vs. BPS enterprise counts

Page 24: Measuring the Digital Economy using Big Data by Prash Majmudar

24

Identifying ‘digital companies’

• Aim = more robust definition, compare against SIC-based

• Use ‘sector’ and ‘product’ categories

• Intuition = we want companies in ‘digital’ sectors’ that also do

‘digital’ things (e.g. digital publishing, media, design …)

- Step 1: Identify GI sector and product categories

- Steps 2-5: clean out ‘non-digital’ GI sectors, products combinations

- Step 6: Count companies

- E.g. Process designed to exclude large proportion of architecture

firms, except those whose principal product type is software for CAD /

technical drawing

Page 25: Measuring the Digital Economy using Big Data by Prash Majmudar

25

Company counts Observations %

A. SIC 07

Other 1,681,151 89.96

Digital Economy 187,616 10.04

B.GI sector and product

Other 1,599,072 85.57

Digital Economy 269,695 14.43

Note: Panel A follows the BIS (2009) definition. Panel B defines the digital economy using GI digital sector by digital product "cells".

Page 26: Measuring the Digital Economy using Big Data by Prash Majmudar

Classifications:Sector – Oil and Energy

Product – Computer SoftwareClient – Businesses

Sales process – ProjectBased in Aberdeen

SIC Code: 82990 - Other business supportservice activities

Page 27: Measuring the Digital Economy using Big Data by Prash Majmudar

Company counts are highest in London.

But we also find large counts in Manchester, Birmingham, Bristol and Brighton...

... as well as the wider Greater South East.

Page 28: Measuring the Digital Economy using Big Data by Prash Majmudar

280.000 0.200 0.400 0.600 0.800 1.000 1.200 1.400 1.600 1.800

Livingston & Bathgate

Crawley

Oxford

Southampton

Coventry

Middlesbrough & Stockton

Cheltenham & Evesham

Swindon

Cambridge

Andover

Brighton

Bournemouth

Wycombe & Slough

Luton & Watford

Stevenage

Guildford & Aldershot

Poole

Milton Keynes & Aylesbury

Newbury

Reading & Bracknell

Basingstoke

Page 29: Measuring the Digital Economy using Big Data by Prash Majmudar

Guildford

consultancy

custom software development digital media

media distribution

peer to peer communications photography

printing services

software desktop or server

software web application web hosting

animation 1

architecture 178

computer games 2 80

computer hardware 12 7 1

computer network security 7 1

computer networking 23 5

computer software 88 459 70

defense space 37

electrical electronic manufacturing 13 72 1

entertainment film production 6 33

financial services 820

information services 8 3

information technology 2756 6 94

internet 14 15 1 16

marketing advertising 192

photography 74 7 1

printing 12 2 63

publishing 29

semiconductors 3

telecommunications 58 9 31 1 1

Page 30: Measuring the Digital Economy using Big Data by Prash Majmudar

Additional findings

Page 31: Measuring the Digital Economy using Big Data by Prash Majmudar

31

Digital companies’ revenue growth in 2010-2012 is faster than non-digital ...

A. Annual Revenues

B. Annual

Revenue Growth

mean median mean median

Other 18,380,097 110,048 15.68 1.70

Digital Economy 10,547,218 123,388 20.21 4.17

Note: Sub-sample of those companies who report revenue. Companies House average revenues are averaged over the period

2010 to 2012. If for each company there is more than one observation, only the most recent is kept. Average annual revenue growth

is computed on a smaller sample, as information for at least two consecutive years is needed.

Page 32: Measuring the Digital Economy using Big Data by Prash Majmudar

32

... and digital employers have higher average staff levels.

Employees per company

Mean Median % of all employment

A. Official / SIC07

Other 20.94 4 94.92

Digital Economy 17.23 3 5.08

B. GI sector and product

Other 20.40 4 88.67

Digital Economy 23.37 4 11.33

Note: sub-sample of firms reporting employment to Companies House. Data is averaged over 2010-2012.

Page 33: Measuring the Digital Economy using Big Data by Prash Majmudar

Further work

• Drivers of innovation / growth

• Use of ‘tags’ to provide further descriptive analysis of digital companies

• Unsupervised approach to identify clusters

• Extension to sole traders

• Extending this approach to Europe – e.g. Belgium, France, Germany, Italy

Page 34: Measuring the Digital Economy using Big Data by Prash Majmudar

Questions?@growthintel

@prashmaj

Page 35: Measuring the Digital Economy using Big Data by Prash Majmudar

SIC – ICT Sector

28230 MANUFACTURE OF OFFICE MACHINERY AND COMPUTERS

26200 MANUFACTURE OF COMPUTERS AND OTHER INFORMATION PROCESSING EQUIPMENT

27320 INSULATED WIRE AND CABLE

26110 ELECTRONIC VALVES AND TUBES AND OTHER ELECTRONIC COMPONENTS

33200 TELEVISION, RADIO TRANSMITTERS AND APPARATUS FOR TELEPHONY AND TELEGRAPHY

26400 TELEVISION AND RADIO RECEIVERS, SOUND OR VIDEO RECORDING OR PRODUCING APPARATUS AND ASSOCIATED GOODS

26511 INSTRUMENTS AND APPLIANCES FOR MEASURING, CHECKING, TESTING AND NAVIGATING AND OTHER PURPOSES

26512 INDUSTRIAL PROCESS EQUIPMENT

46439 WHOLESALE OF ELECTRICAL HOUSEHOLD APPLIANCES

46510 WHOLESALE OF COMPUTERS, COMPUTER PERIPHERAL EQUIPMENT AND SOFTWARE

46660 WHOLESALE OF OTHER OFFICE MACHINERY AND EQUIPMENT

46520 WHOLESALE OF OTHER ELECTRONIC PARTS AND EQUIPMENT

46690 WHOLESALE OF OTHER MACHINERY FOR USE IN INDUSTRY, TRADE AND NAVIGATION

61900 TELECOMMUNICATIONS SERVICES

77330 RENTING OF OFFICE MACHINERY AND EQUIPMENT INCLUDING COMPUTERS

62020 COMPUTER HARDWARE CONSULTANCY

95110 MAINTENANCE AND REPAIR OF OFFICE, ACCOUNTING AND COMPUTING MACHINERY

62090 OTHER COMPUTER RELATED ACTIVITIES

Page 36: Measuring the Digital Economy using Big Data by Prash Majmudar

SIC – Digital content industries

58110 PUBLISHING OF BOOKS

58130 PUBLISHING OF NEWSPAPERS

58142 PUBLISHING OF JOURNALS AND PERIODICALS

59200 PUBLISHING OF SOUND RECORDINGS

58190 OTHER PUBLISHING

18110 PRINTING OF NEWSPAPERS

18129 PRINTING N.E.C

18130 PRE-PRESS ACTIVITIES

18130 ANCILLARY ACTIVITIES RELATING TO PRINTING

18201 REPRODUCTION OF SOUND RECORDING

18202 REPRODUCTION OF VIDEO RECORDING

18203 REPRODUCTION OF COMPUTER MEDIA

58290 PUBLISHING OF SOFTWARE

62020 OTHER SOFTWARE CONSULTANCY AND SUPPLY

63110 DATA PROCESSING

63110 DATABASE ACTIVITIES

73110 ADVERTISING

74209 PHOTOGRAPHIC ACTIVITIES

59111 MOTION PICTURE AND VIDEO PRODUCTION

59131 MOTION PICTURE AND VIDEO DISTRIBUTION

59140 MOTION PICTURE PROJECTION

59113 RADIO & TV (DCMS ESTIMATES)

63910 NEWS AGENCY ACTIVITIES