Top Banner
THE FUTURE OF DATA
8

THE FUTURE OF D ATA

Feb 22, 2023

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: THE FUTURE OF D ATA

T H EF U T U R E O FD A T A

Page 2: THE FUTURE OF D ATA

DAVID WHITEIMPORT.IO

At its core, import.io is the manifestation of the belief

that integrated access to data, especially from the web, is go-

ing to be the most important technological advantage in the

next 10 years.

The internet represents the largest repository of information

in the world. Experts estimate that the sum total of informa-

tion available online amounts to be around 1,200 exabytes

of data which, if you placed it all on CD’s, would reach to the

moon five times. The amount of digital information doubles

roughly every two years, and today, less than two percent of

information is non-digital.

The age of Big Data is here. The world is quickly becoming

“datafied”. Locations, prices, words, relationships and even

the things you like on Facebook can all be stored, measured

and analysed - potentially for profit. The possibilities of what

could be achieved if all that information was available pro-

grammatically at an affordable price was the driving motiva-

tion for founding import.io.

We believe that data is the future. It’s widespread use will pro-

duce a marked transformation in how society processes in-

formation; not just in the business sector, but in every aspect

of human life. To give you an example, experts estimate that

simply by using data to drive efficiency, the US Healthcare

Industry could create more that $300 billion in value - every

year.

The possibilities of what we can achieve with data are end-

less. It will develop new sectors of the economy, disrupt in-

dustries, drive innovation and create an estimated 4.4 million

jobs in the next year alone. As a society, we have only just

scratched the surface of what access to data can do and the

insights it can bring.

Already we’ve seen a few early pioneers who are using web

data to do some amazing things and generate some very real

value (some of which you’ll read about in this paper). So far

we’ve helped a number of large corporations improve their

bottom line and save countless man hours; allowed hun-

dreds of small startups to launch a business or disrupt an

industry; and given a major charity the evidence it need-

ed to lobby for real policy changes; all due to the ability

to seamlessly access critical data that wasn’t previously

available.

This is just the beginning. Data belongs in the hands of

people who can, and will, do amazing things with it. Al-

ready, our users have created over 100,000 APIs on our

platform, which means they’ve opened up more data in

the last 6 months than has existed in the last 10 years.

The Structured Web is well on it’s way to becoming a re-

ality, and as you’ll hear from these top-level experts in the

data industry, access to it is only the first step to unlocking

its true potential. Import is merely a tool, but we hope that

it is one that will act as a catalyst for change, innovation

and growth. In this paper we’ve brought together some of

the brightest minds in the data space as a way of inspiring

you to keep finding new and innovative ways to use data.

Page 3: THE FUTURE OF D ATA

GECKOBOARDGROWING INDUSTRIES

THROUGH DATA

Big Data Never Sleeps. Too many organisations fly

blind when it comes to data. Yet using it smartly, can both

transform businesses - and boost the wider economy. While

there is still very much a place in business for intuition, instinct

and eureka moments, it’s a certainty that a decision made with

even a small amount of data is substantially better than one

made without any at all.

It’s estimated that the world generates 2.5 quintillion bytes of

data every day, with 90% of it created in the last two years

alone [1]. This deluge of machine- readable ones and zeroes

streams from every corner of daily life and work, growing

steadily richer and more voluminous. But while data’s transfor-

mative power in fields ranging from governments to medicine

and finance to business are well-documented, the truth is that

the overwhelming majority of companies are more or less in

the dark about how to use the information they collect.

Torrents of data, in themselves, are of little value - they are

white noise, without signal. In order to sift through and analyse

them, to find nuggets of insight or correlations within them,

data scientists or expensive enterprise software are required.

But few organisations have access to either. So instead, as the

information accumulates, many companies are simply fro-

zen like rabbits trapped in headlights.

It’s no wonder then that research by IBM [2] recently found

that 71% of CMOs they interviewed said they were “unpre-

pared to manage the Big Data Explosion” - and that despite

the fact that 54% of “outperforming CEOs [say they] distin-

guish their organisations by ensuring increased access to

data”. The truth of the matter is that in the digital econo-

my, running businesses without access to data is a bit like

flying blind. Being equipped with real-time, accurate and

meaningful data means problems can often be spotted be-

fore they occur, plans can be put in place and, of course,

bottom-lines boosted.

If a retail business, for example, has a sudden spike in traf-

fic on its website, having data at its fingertips means that

the height and length of that surge can be increased to

maximise its effect. Knowing about it even hours (let alone

a day or two) later is probably going to be too late.

To create what we call ‘informed businesses’, three things

need to happen. First and most fundamental, the data has

to be there, which means reliably collecting the right raw

material, at the right frequency so that it’s ready for use.

Second, an organisation needs the correct tools in or-

der to be able to sift and analyse the data - transforming

them into a form human beings, rather than computers,

can use. Third, an organisation needs to have a culture in

place in which data informs their entire decision-making

process.

A truly informed business has each of these elements,

which ensures data flows directly to people, instead of

them having to dig it out for themselves. To do that an

organisation needs first to establish which datasets

they require in order to improve what they do. Their goal

might be something as simple as enhanced brand rec-

ognition, improving customer service response times or

selling more widgets.

Second, once they have the answers, a business needs

to decide on the metrics that properly reflect its goals at

that point in time - so that it can be guided by the data

and know whether it is on track, or otherwise, with its pri-

orities. (Of course, goals change between teams, and as

businesses evolve - so the metrics will too.) Crucially, a

data-dashboard - whether on a giant screen on the office

wall or on a device - should be limited to no more than

ten discrete pieces of information, so that a quick glance

will be enough to glean that organisation’s vital signs.

However, there will always be those for whom any data is

too much. Organisations still abound with strong-mind-

ed individuals who think they don’t need data or that they

have what it takes to make decisions without it. Indeed,

people sometimes say to me that the boardroom giants

and great industrialists of the past didn’t have access to

data and yet they built vast empires and global corpora-

tions anyway.

True, I usually reply, but if they achieved all that without

data, just imagine what they could have done with it.

While there is still very much a place in business for intuition,

instinct and eureka moments, it’s a certainty that a decision

made with even a small amount of data is substantially better

than one made without any at all.

In fact, nothing substitutes for cold hard numbers. Used effec-

tively, they boost growth in businesses - and the wider econo-

my - in two main ways. First, by motivating people. Providing

real time data dashboards, spurs co-workers on to improve

their performance. After all, if you see a number, it’s human

instinct to want to improve upon it. Second, data enables a

constant process of optimisation. By measuring before and af-

ter you try something, you can accurately gauge whether the

change you made is working or not. If it isn’t, you discard it -

and iterate until you find something that does. If it works, then

you double down on it.

This way, an online retailer, for example, can marginally in-

crease their conversion rate by a percentage point here and

there, right across the board. Optimise continually and that re-

tailer suddenly looks like a very different business. Repeat that

across 100,000 SMEs and you are suddenly talking about a

very different economy.

In the end, only data allows you to do understand how a set of

factors shapes and drives a business. It’s the glue that holds

it all altogether. By comparison, everything else is just opinion.

Paul Joyce is co-founder and CEO of Geckoboard, a data com-

munication company. www.geckoboard.com @geckoboard [1]

Source IBM[2] Source IBM Global CEO study 2012

Page 4: THE FUTURE OF D ATA

BIGML USING DATA TO PREDICT THE FUTURE

With enough data anything in possible. Even predict-

ing the future. Using advanced machine learning techniques

and predictive analytics we can start to not only identify past

trends, but also uncover future ones. At BigML, we believe

the time is right to introduce large scale machine learning to

the masses, and show all of the value that can be extracted

from data.

One of the great aspects of BigML’s machine learning plat-

form is that you can use it not only to build predictive analyt-

ics, but also to uncover hidden relationships within a dataset.

These relationships can then be examined using BigML’s in-

tuitive decision tree visualizations and related functions with-

in the BigML interface.

As demonstrated at the Data Summit London, we decided to

put over 58,000 Kindle book titles through the machine learn-

ing test to examine what features have the greatest impact

on a Kindle title’s rating.

To do this, we first leveraged import.io to pull key statistics on

the Kindle titles directly from Amazon’s website. The fields

we examined were: URL, title, author, price, save (whether or

not the book was saved), pages, book description, size, pub-

lisher, language, text-to-speech enabled (y/n), x-ray enabled

(y/n), lending enabled (y/n), number of reviews and stars (the

rating).

The resulting data set was put into a ~70MB .csv file, which

we quickly uploaded into BigML. This data source includes

both text, numeric and categorical fields. The ability to an-

alyze free text alongside other types of fields is one of the

features that sets BigML apart in the marketplace.

For the model, we tried various iterations of the data, before

settling upon the following fields: price, pages, description,

lending, and number of reviews. We were then able to gener-

ate a decision tree from which we could easily analyse and

identify patterns in the data.

With only a cursory analysis, by simply mousing through

the decision tree we were able to uncover a variety of in-

sights on which fields lend themselves to positive or less

positive Kindle reviews. From a practical standpoint, a

publisher could build a similar model and factor it into its

decision-making before green-lighting a book.

This was a fun exercise which also demonstrates that

machine learning and decision trees can be informative

beyond the predictive models that they create. It stands

to illustrate an important point: the amount of data is sur-

passing the human skill necessary to analyze it.

Cheap cloud computing and storage services have led to

a huge increase in the amount of data accumulated. Com-

panies with large numbers of analysts can gain valuable

insight from their data. However, not everyone can afford

such a team, let alone find the individuals with the right

skill. There is a strong need for a new services (such as

BigML) that put predictive power in the hands of many.

Page 5: THE FUTURE OF D ATA

INFOGR.AMTHE IMPORTANCE

OF DATA LITERACY

The ability to convincingly communicate one’s opin-

ion in public as well as critically evaluate the messages of

others is vital for both democracies and businesses.

In democracies, we want to empower everyone, not only the

rich and the educated, to defend their interests with argu-

ments and stories. We also want everyone to have highly

developed critical thinking skills, so that they are not swayed

by faulty arguments and appealing messages that are not

based on reasoning and evidence.

In businesses, we want to make sure all employees can com-

municate their ideas and their views about various business

questions. We want business organisations that are based

on speaking up, transparency and management teams that

do not ignore the views of employees, yet are able to critical-

ly evaluate their validity.

WE NOW LIVE IN A WORLD OF DATA . The explosion of data collection and

analysis technologies has made a lot of previously unan-

swerable data questions answerable. For example, take

global warming. What is the exact extent of global warm-

ing? What are the economic, environmental and social

costs of global warming? What is the likely impact of differ-

ent interventions? Which parts of the world are likely to be

more affected? All of these are questions that have to be

answered with data.

It is important to stop and think about what this emergence

of data means for democracies and businesses. It all boils

down to the same two principles - making sure all people

have the ability to communicate their views with data as

well as critically evaluate the data stories of others.

We at Infogr.am have thought long and hard about how to

solve the first problem - creating a tool that empowers peo-

ple to tell convincing stories with the help of data. Infogr.

am has created a tool for building beautiful, shareable in-

fographics. We launched two years ago and already have

more than 1.2m registered users with more than 2.5m info-

graphics created on our site.

The vision of Infogr.am is that any tool that empowers peo-

ple to tell data stories should be - extremely simple to use,

freely available to everyone, beautiful and social. We at In-

fogr.am think that we have managed to create a tool that

embodies these four principles. The positive response from

our users - including The Verge, TechCrunch, The Next Web,

BBC and others - also suggests that we are moving in the

right direction.

A problem that is still largely unsolved is critical thinking when it

comes to data - data literacy. Most people can still be tricked by

simple things like cherry-picked data, disproportionate scaling

of x and y-axes and other misleading tactics. Of course, educa-

tion is part of the solution, but that it can not stand alone. We

need to develop tools that nudge people in the critical direction

when they see data; otherwise becomes too easy for politicians

or marketers to manipulate people with data for personal gain.

I encourage everyone reading this paper to think long and hard

about how they can contribute to the spread of data literacy.

CHARLES ARTHURTHE GUARDIANDATA PRIVACY CONCERNS US ALL

In an age where information is power, data has be-

come the ultimate currency. In today’s online society, people

(often the younger generation) are increasingly putting their

lives online - they post information about who they are, where

they are, what they’re doing and who they’re with.

The problem with this is that many of them do not realize that

it isn’t just their friends who are seeing this information. For

example, the popular online-content site, Buzzfeed, collects 49

different variables about you as a visitor to their website. And

they are certainly not the only ones.

The truth of the matter is that your data is valuable. Very valu-

able in fact. Data points that are individually innocuous

can become enormously powerful and revealing when

aggregated - that is the essence of Big Data.

And this isn’t necessarily a bad thing. If amazon could al-

ways guarantee you the lowest price on your online shop-

ping, would you be happy to share your personal informa-

tion with them? The answer probably depends on what

data they were asking for, but my guess would be that for

most people the answer is yes. To some extent, the more

companies understand about you the better they can ser-

vice your needs by providing you with exactly what you’re

looking for.

Page 6: THE FUTURE OF D ATA

It’s also what allows services like Google and Facebook

to remain free to use. If they couldn’t sell your data (or at

least the aggregate of it) to advertisers and other compa-

nies, then they’d have to start charging you. Would you

look everything up on Google if you had to pay $0.05 per

search? Probably not.

The real problem isn’t that companies are accessing your

data, it’s that many of them aren’t doing a good job of stat-

ing their data-usage intentions. The growth of big data

presents a greater need for data collectors to provide full

disclosure, by alerting their visitors what data is being col-

lected and what it’s for.

If privacy is going to exist in the world of big data it’s going

to require both transparency and literacy. People need to

be aware of when they are giving their data away and what

it is for. They need to be conscious of the fact that it is a

tradeoff. Yes companies will use your personal data, but

you can also benefit from things like targeted marketing

or recommendations.

At the same time, as people become more aware of giving

away their personal data, companies will need to become

more transparent about how they are using that data to

make sure that the tradeoff of data for a service remains

beneficial to both parties. Eventually, this will probably

lead to some type of opt in or out system, similar to that

used to placing you on email lists.

Online data privacy isn’t a one way street. It requires

changes on the part of both the individual and the website.

ANDREW FOGGIMPORT.IO

WEB DATA IN GOVERNMENT

Governments provide a large variety of programmes

and services, which both produce and require massive

amounts of data, often unstructured and increasingly in re-

al-time. As such, governments have a responsibility to make

sure that they are using data efficiently and openly.

In a digital age, data is a key resource for social and econom-

ic activities. Everything from finding your local post office to

starting a company requires access to data, much of which

is created or held by our governments. By opening up data,

governments can help drive the creation of innovative busi-

nesses and services that deliver social and economic value.

The Open Data movement has centered largely around con-

vincing governments to make their data publicly available in

a format that is useable. This movement is vital because it

will allow citizens to more easily hold our governments to

account, improve transparency and contribute to economic

growth.

Even more critical, however, is that governments embrace

the web as a source of data. Web data can enable govern-

ments to do existing things more cheaply, do existing things

better and do new things that they do not currently do. By

accessing web data sources, governments can be more effi-

cient, save money, identify fraud and help public bodies bet-

ter serve their citizens.

Recent developments in big data analysis has taught us that

having more data points – even if not all of them are perfect

– is often preferable to trying to collect a small but perfect

sample. By leveraging the vast amount of data on the web,

governments can make better decisions and more accurate

predictions. And because web data is live data, web data can

help governments make decisions faster.

For example, the UK government now include the sale of

drugs and prostitution in the calculation of gross domestic

product (GDP). The Office of National Statistics (ONS) es-

timates that prostitution adds a whopping £5.314 billion to

the economy (0.4% of GDP).

The calculation of this estimate is based on a single survey

conducted 10 years ago that tried to count the number of fe-

male prostitutes in London. Without access to web data, the

ONS had no choice but to use this single study and a bunch

of assumptions in order to extrapolate from the number of

female prostitutes in London to the total number of prosti-

tutes in the whole of the UK.

Using web data at Import.io, we were able to pull information

from public websites where sex workers advertise their ser-

vices and calculate a new GDP estimate that uses the most

up to date numbers and relies on fewer assumptions. We

found that male sex workers were missing from the ONS

Page 7: THE FUTURE OF D ATA

numbers despite the fact that they constitute 42% of the

number of sex workers advertising online. In addition

we found that the average price that female sex work-

ers charge for their services is double the number used

by the ONS in their calculations. Using our new data we

were able to re-run the ONS calculations and estimate that

prostitution contributes £12.374 billion (0.9% of GDP) to

the UK economy.

This is only a preliminary analysis but the large difference

between the two values stands to highlight the impor-

tance of having good, accurate data. You can read more

about this work at http://go.import.io/prostitution and

http://go.import.io/genderdifferences.

Web data has great potential to make governments more

efficient and to improve citizens’ lives. By incorporating a

greater variety of structured and unstructured data from

both internal and external sources, governments can

improve efficiency and effectiveness across their broad

range responsibilities.

SALESFORCEDATA WILL CHANGE EVERYTHING

We live in increasingly challenging times. On a glob-

al level, climate change continues apace while our need for

energy, water and food continue to rise; we are in a con-

stant battle with disease; our political climate has become

overshadowed by war, terrorism and failed states; and our

economy is emerging from the largest recession to date

with increased wealth inequality. On a personal level we are

constantly bombarded with advice on how to conduct our

lives: what to eat, what to drink, how to work and how to

rest.

Solving these problems will require the answers to some

tough questions: How much of climate change is man-

made? What’s the right response? Can we solve our energy

problems by fracking? Is nuclear dead? Should we eat fat?

Or sugar? Both? Should we all be vegan in order to solve the

water crisis? What about GM foods? Will the pharmaceuti-

cal industry continue to invest in antibiotics? Can it afford

to? Can we afford not to?

Debate on these issues has traditionally been polarised, ex-

treme and often emotional. Our answers need to be based

on evidence. On data. Data that is both reliable and accu-

rate. Data from which real insight can be gained, and sensi-

ble, sustainable decisions made. Decisions that are tracta-

ble despite the polarised nature of the debates.

That is why data is so important today, and why we need

to change the way we think about data: what it is and how

we use it.

Soon everyone and pretty much everything will be connect-

ed. The amount of data this will generate is unimaginable

and ever expanding. But, unless we can harness and manip-

ulate this data, it is useless.

The real key to being able to answer these vital questions

is understanding the provenance of data, being able to val-

idate its accuracy and veracity, as well as knowing how to

extract value from it. This will require evolving ontologies

and taxonomies that allow us to do this on a global scale,

across culture and time zone, we will need to allow for differ-

ences in perceptions and collection methods. Learning how

to make sense of collective data is the true challenge.

Given the mushrooming of data sources, collectors and sen-

sors, we need to get far better at filtering and routing data to

make the right data available in the right format where and

when it is needed.

Thinking differently about data, where it comes from; where

it goes; how it is collected, maintained, aggregated; what it

is used for; and what it shouldn’t be used for; requires a fun-

damental shift in our collective thinking. Only then will we be

able to gain insight so as to make the right decisions at the

right time in the right place.

-JP Rangaswami, chief scientist, Salesforce.com

Page 8: THE FUTURE OF D ATA

P R E S E N T E D B Y