THE FUTURE OF DATA
DAVID WHITEIMPORT.IO
At its core, import.io is the manifestation of the belief
that integrated access to data, especially from the web, is go-
ing to be the most important technological advantage in the
next 10 years.
The internet represents the largest repository of information
in the world. Experts estimate that the sum total of informa-
tion available online amounts to be around 1,200 exabytes
of data which, if you placed it all on CD’s, would reach to the
moon five times. The amount of digital information doubles
roughly every two years, and today, less than two percent of
information is non-digital.
The age of Big Data is here. The world is quickly becoming
“datafied”. Locations, prices, words, relationships and even
the things you like on Facebook can all be stored, measured
and analysed - potentially for profit. The possibilities of what
could be achieved if all that information was available pro-
grammatically at an affordable price was the driving motiva-
tion for founding import.io.
We believe that data is the future. It’s widespread use will pro-
duce a marked transformation in how society processes in-
formation; not just in the business sector, but in every aspect
of human life. To give you an example, experts estimate that
simply by using data to drive efficiency, the US Healthcare
Industry could create more that $300 billion in value - every
year.
The possibilities of what we can achieve with data are end-
less. It will develop new sectors of the economy, disrupt in-
dustries, drive innovation and create an estimated 4.4 million
jobs in the next year alone. As a society, we have only just
scratched the surface of what access to data can do and the
insights it can bring.
Already we’ve seen a few early pioneers who are using web
data to do some amazing things and generate some very real
value (some of which you’ll read about in this paper). So far
we’ve helped a number of large corporations improve their
bottom line and save countless man hours; allowed hun-
dreds of small startups to launch a business or disrupt an
industry; and given a major charity the evidence it need-
ed to lobby for real policy changes; all due to the ability
to seamlessly access critical data that wasn’t previously
available.
This is just the beginning. Data belongs in the hands of
people who can, and will, do amazing things with it. Al-
ready, our users have created over 100,000 APIs on our
platform, which means they’ve opened up more data in
the last 6 months than has existed in the last 10 years.
The Structured Web is well on it’s way to becoming a re-
ality, and as you’ll hear from these top-level experts in the
data industry, access to it is only the first step to unlocking
its true potential. Import is merely a tool, but we hope that
it is one that will act as a catalyst for change, innovation
and growth. In this paper we’ve brought together some of
the brightest minds in the data space as a way of inspiring
you to keep finding new and innovative ways to use data.
GECKOBOARDGROWING INDUSTRIES
THROUGH DATA
Big Data Never Sleeps. Too many organisations fly
blind when it comes to data. Yet using it smartly, can both
transform businesses - and boost the wider economy. While
there is still very much a place in business for intuition, instinct
and eureka moments, it’s a certainty that a decision made with
even a small amount of data is substantially better than one
made without any at all.
It’s estimated that the world generates 2.5 quintillion bytes of
data every day, with 90% of it created in the last two years
alone [1]. This deluge of machine- readable ones and zeroes
streams from every corner of daily life and work, growing
steadily richer and more voluminous. But while data’s transfor-
mative power in fields ranging from governments to medicine
and finance to business are well-documented, the truth is that
the overwhelming majority of companies are more or less in
the dark about how to use the information they collect.
Torrents of data, in themselves, are of little value - they are
white noise, without signal. In order to sift through and analyse
them, to find nuggets of insight or correlations within them,
data scientists or expensive enterprise software are required.
But few organisations have access to either. So instead, as the
information accumulates, many companies are simply fro-
zen like rabbits trapped in headlights.
It’s no wonder then that research by IBM [2] recently found
that 71% of CMOs they interviewed said they were “unpre-
pared to manage the Big Data Explosion” - and that despite
the fact that 54% of “outperforming CEOs [say they] distin-
guish their organisations by ensuring increased access to
data”. The truth of the matter is that in the digital econo-
my, running businesses without access to data is a bit like
flying blind. Being equipped with real-time, accurate and
meaningful data means problems can often be spotted be-
fore they occur, plans can be put in place and, of course,
bottom-lines boosted.
If a retail business, for example, has a sudden spike in traf-
fic on its website, having data at its fingertips means that
the height and length of that surge can be increased to
maximise its effect. Knowing about it even hours (let alone
a day or two) later is probably going to be too late.
To create what we call ‘informed businesses’, three things
need to happen. First and most fundamental, the data has
to be there, which means reliably collecting the right raw
material, at the right frequency so that it’s ready for use.
Second, an organisation needs the correct tools in or-
der to be able to sift and analyse the data - transforming
them into a form human beings, rather than computers,
can use. Third, an organisation needs to have a culture in
place in which data informs their entire decision-making
process.
A truly informed business has each of these elements,
which ensures data flows directly to people, instead of
them having to dig it out for themselves. To do that an
organisation needs first to establish which datasets
they require in order to improve what they do. Their goal
might be something as simple as enhanced brand rec-
ognition, improving customer service response times or
selling more widgets.
Second, once they have the answers, a business needs
to decide on the metrics that properly reflect its goals at
that point in time - so that it can be guided by the data
and know whether it is on track, or otherwise, with its pri-
orities. (Of course, goals change between teams, and as
businesses evolve - so the metrics will too.) Crucially, a
data-dashboard - whether on a giant screen on the office
wall or on a device - should be limited to no more than
ten discrete pieces of information, so that a quick glance
will be enough to glean that organisation’s vital signs.
However, there will always be those for whom any data is
too much. Organisations still abound with strong-mind-
ed individuals who think they don’t need data or that they
have what it takes to make decisions without it. Indeed,
people sometimes say to me that the boardroom giants
and great industrialists of the past didn’t have access to
data and yet they built vast empires and global corpora-
tions anyway.
True, I usually reply, but if they achieved all that without
data, just imagine what they could have done with it.
While there is still very much a place in business for intuition,
instinct and eureka moments, it’s a certainty that a decision
made with even a small amount of data is substantially better
than one made without any at all.
In fact, nothing substitutes for cold hard numbers. Used effec-
tively, they boost growth in businesses - and the wider econo-
my - in two main ways. First, by motivating people. Providing
real time data dashboards, spurs co-workers on to improve
their performance. After all, if you see a number, it’s human
instinct to want to improve upon it. Second, data enables a
constant process of optimisation. By measuring before and af-
ter you try something, you can accurately gauge whether the
change you made is working or not. If it isn’t, you discard it -
and iterate until you find something that does. If it works, then
you double down on it.
This way, an online retailer, for example, can marginally in-
crease their conversion rate by a percentage point here and
there, right across the board. Optimise continually and that re-
tailer suddenly looks like a very different business. Repeat that
across 100,000 SMEs and you are suddenly talking about a
very different economy.
In the end, only data allows you to do understand how a set of
factors shapes and drives a business. It’s the glue that holds
it all altogether. By comparison, everything else is just opinion.
Paul Joyce is co-founder and CEO of Geckoboard, a data com-
munication company. www.geckoboard.com @geckoboard [1]
Source IBM[2] Source IBM Global CEO study 2012
BIGML USING DATA TO PREDICT THE FUTURE
With enough data anything in possible. Even predict-
ing the future. Using advanced machine learning techniques
and predictive analytics we can start to not only identify past
trends, but also uncover future ones. At BigML, we believe
the time is right to introduce large scale machine learning to
the masses, and show all of the value that can be extracted
from data.
One of the great aspects of BigML’s machine learning plat-
form is that you can use it not only to build predictive analyt-
ics, but also to uncover hidden relationships within a dataset.
These relationships can then be examined using BigML’s in-
tuitive decision tree visualizations and related functions with-
in the BigML interface.
As demonstrated at the Data Summit London, we decided to
put over 58,000 Kindle book titles through the machine learn-
ing test to examine what features have the greatest impact
on a Kindle title’s rating.
To do this, we first leveraged import.io to pull key statistics on
the Kindle titles directly from Amazon’s website. The fields
we examined were: URL, title, author, price, save (whether or
not the book was saved), pages, book description, size, pub-
lisher, language, text-to-speech enabled (y/n), x-ray enabled
(y/n), lending enabled (y/n), number of reviews and stars (the
rating).
The resulting data set was put into a ~70MB .csv file, which
we quickly uploaded into BigML. This data source includes
both text, numeric and categorical fields. The ability to an-
alyze free text alongside other types of fields is one of the
features that sets BigML apart in the marketplace.
For the model, we tried various iterations of the data, before
settling upon the following fields: price, pages, description,
lending, and number of reviews. We were then able to gener-
ate a decision tree from which we could easily analyse and
identify patterns in the data.
With only a cursory analysis, by simply mousing through
the decision tree we were able to uncover a variety of in-
sights on which fields lend themselves to positive or less
positive Kindle reviews. From a practical standpoint, a
publisher could build a similar model and factor it into its
decision-making before green-lighting a book.
This was a fun exercise which also demonstrates that
machine learning and decision trees can be informative
beyond the predictive models that they create. It stands
to illustrate an important point: the amount of data is sur-
passing the human skill necessary to analyze it.
Cheap cloud computing and storage services have led to
a huge increase in the amount of data accumulated. Com-
panies with large numbers of analysts can gain valuable
insight from their data. However, not everyone can afford
such a team, let alone find the individuals with the right
skill. There is a strong need for a new services (such as
BigML) that put predictive power in the hands of many.
INFOGR.AMTHE IMPORTANCE
OF DATA LITERACY
The ability to convincingly communicate one’s opin-
ion in public as well as critically evaluate the messages of
others is vital for both democracies and businesses.
In democracies, we want to empower everyone, not only the
rich and the educated, to defend their interests with argu-
ments and stories. We also want everyone to have highly
developed critical thinking skills, so that they are not swayed
by faulty arguments and appealing messages that are not
based on reasoning and evidence.
In businesses, we want to make sure all employees can com-
municate their ideas and their views about various business
questions. We want business organisations that are based
on speaking up, transparency and management teams that
do not ignore the views of employees, yet are able to critical-
ly evaluate their validity.
WE NOW LIVE IN A WORLD OF DATA . The explosion of data collection and
analysis technologies has made a lot of previously unan-
swerable data questions answerable. For example, take
global warming. What is the exact extent of global warm-
ing? What are the economic, environmental and social
costs of global warming? What is the likely impact of differ-
ent interventions? Which parts of the world are likely to be
more affected? All of these are questions that have to be
answered with data.
It is important to stop and think about what this emergence
of data means for democracies and businesses. It all boils
down to the same two principles - making sure all people
have the ability to communicate their views with data as
well as critically evaluate the data stories of others.
We at Infogr.am have thought long and hard about how to
solve the first problem - creating a tool that empowers peo-
ple to tell convincing stories with the help of data. Infogr.
am has created a tool for building beautiful, shareable in-
fographics. We launched two years ago and already have
more than 1.2m registered users with more than 2.5m info-
graphics created on our site.
The vision of Infogr.am is that any tool that empowers peo-
ple to tell data stories should be - extremely simple to use,
freely available to everyone, beautiful and social. We at In-
fogr.am think that we have managed to create a tool that
embodies these four principles. The positive response from
our users - including The Verge, TechCrunch, The Next Web,
BBC and others - also suggests that we are moving in the
right direction.
A problem that is still largely unsolved is critical thinking when it
comes to data - data literacy. Most people can still be tricked by
simple things like cherry-picked data, disproportionate scaling
of x and y-axes and other misleading tactics. Of course, educa-
tion is part of the solution, but that it can not stand alone. We
need to develop tools that nudge people in the critical direction
when they see data; otherwise becomes too easy for politicians
or marketers to manipulate people with data for personal gain.
I encourage everyone reading this paper to think long and hard
about how they can contribute to the spread of data literacy.
CHARLES ARTHURTHE GUARDIANDATA PRIVACY CONCERNS US ALL
In an age where information is power, data has be-
come the ultimate currency. In today’s online society, people
(often the younger generation) are increasingly putting their
lives online - they post information about who they are, where
they are, what they’re doing and who they’re with.
The problem with this is that many of them do not realize that
it isn’t just their friends who are seeing this information. For
example, the popular online-content site, Buzzfeed, collects 49
different variables about you as a visitor to their website. And
they are certainly not the only ones.
The truth of the matter is that your data is valuable. Very valu-
able in fact. Data points that are individually innocuous
can become enormously powerful and revealing when
aggregated - that is the essence of Big Data.
And this isn’t necessarily a bad thing. If amazon could al-
ways guarantee you the lowest price on your online shop-
ping, would you be happy to share your personal informa-
tion with them? The answer probably depends on what
data they were asking for, but my guess would be that for
most people the answer is yes. To some extent, the more
companies understand about you the better they can ser-
vice your needs by providing you with exactly what you’re
looking for.
It’s also what allows services like Google and Facebook
to remain free to use. If they couldn’t sell your data (or at
least the aggregate of it) to advertisers and other compa-
nies, then they’d have to start charging you. Would you
look everything up on Google if you had to pay $0.05 per
search? Probably not.
The real problem isn’t that companies are accessing your
data, it’s that many of them aren’t doing a good job of stat-
ing their data-usage intentions. The growth of big data
presents a greater need for data collectors to provide full
disclosure, by alerting their visitors what data is being col-
lected and what it’s for.
If privacy is going to exist in the world of big data it’s going
to require both transparency and literacy. People need to
be aware of when they are giving their data away and what
it is for. They need to be conscious of the fact that it is a
tradeoff. Yes companies will use your personal data, but
you can also benefit from things like targeted marketing
or recommendations.
At the same time, as people become more aware of giving
away their personal data, companies will need to become
more transparent about how they are using that data to
make sure that the tradeoff of data for a service remains
beneficial to both parties. Eventually, this will probably
lead to some type of opt in or out system, similar to that
used to placing you on email lists.
Online data privacy isn’t a one way street. It requires
changes on the part of both the individual and the website.
ANDREW FOGGIMPORT.IO
WEB DATA IN GOVERNMENT
Governments provide a large variety of programmes
and services, which both produce and require massive
amounts of data, often unstructured and increasingly in re-
al-time. As such, governments have a responsibility to make
sure that they are using data efficiently and openly.
In a digital age, data is a key resource for social and econom-
ic activities. Everything from finding your local post office to
starting a company requires access to data, much of which
is created or held by our governments. By opening up data,
governments can help drive the creation of innovative busi-
nesses and services that deliver social and economic value.
The Open Data movement has centered largely around con-
vincing governments to make their data publicly available in
a format that is useable. This movement is vital because it
will allow citizens to more easily hold our governments to
account, improve transparency and contribute to economic
growth.
Even more critical, however, is that governments embrace
the web as a source of data. Web data can enable govern-
ments to do existing things more cheaply, do existing things
better and do new things that they do not currently do. By
accessing web data sources, governments can be more effi-
cient, save money, identify fraud and help public bodies bet-
ter serve their citizens.
Recent developments in big data analysis has taught us that
having more data points – even if not all of them are perfect
– is often preferable to trying to collect a small but perfect
sample. By leveraging the vast amount of data on the web,
governments can make better decisions and more accurate
predictions. And because web data is live data, web data can
help governments make decisions faster.
For example, the UK government now include the sale of
drugs and prostitution in the calculation of gross domestic
product (GDP). The Office of National Statistics (ONS) es-
timates that prostitution adds a whopping £5.314 billion to
the economy (0.4% of GDP).
The calculation of this estimate is based on a single survey
conducted 10 years ago that tried to count the number of fe-
male prostitutes in London. Without access to web data, the
ONS had no choice but to use this single study and a bunch
of assumptions in order to extrapolate from the number of
female prostitutes in London to the total number of prosti-
tutes in the whole of the UK.
Using web data at Import.io, we were able to pull information
from public websites where sex workers advertise their ser-
vices and calculate a new GDP estimate that uses the most
up to date numbers and relies on fewer assumptions. We
found that male sex workers were missing from the ONS
numbers despite the fact that they constitute 42% of the
number of sex workers advertising online. In addition
we found that the average price that female sex work-
ers charge for their services is double the number used
by the ONS in their calculations. Using our new data we
were able to re-run the ONS calculations and estimate that
prostitution contributes £12.374 billion (0.9% of GDP) to
the UK economy.
This is only a preliminary analysis but the large difference
between the two values stands to highlight the impor-
tance of having good, accurate data. You can read more
about this work at http://go.import.io/prostitution and
http://go.import.io/genderdifferences.
Web data has great potential to make governments more
efficient and to improve citizens’ lives. By incorporating a
greater variety of structured and unstructured data from
both internal and external sources, governments can
improve efficiency and effectiveness across their broad
range responsibilities.
SALESFORCEDATA WILL CHANGE EVERYTHING
We live in increasingly challenging times. On a glob-
al level, climate change continues apace while our need for
energy, water and food continue to rise; we are in a con-
stant battle with disease; our political climate has become
overshadowed by war, terrorism and failed states; and our
economy is emerging from the largest recession to date
with increased wealth inequality. On a personal level we are
constantly bombarded with advice on how to conduct our
lives: what to eat, what to drink, how to work and how to
rest.
Solving these problems will require the answers to some
tough questions: How much of climate change is man-
made? What’s the right response? Can we solve our energy
problems by fracking? Is nuclear dead? Should we eat fat?
Or sugar? Both? Should we all be vegan in order to solve the
water crisis? What about GM foods? Will the pharmaceuti-
cal industry continue to invest in antibiotics? Can it afford
to? Can we afford not to?
Debate on these issues has traditionally been polarised, ex-
treme and often emotional. Our answers need to be based
on evidence. On data. Data that is both reliable and accu-
rate. Data from which real insight can be gained, and sensi-
ble, sustainable decisions made. Decisions that are tracta-
ble despite the polarised nature of the debates.
That is why data is so important today, and why we need
to change the way we think about data: what it is and how
we use it.
Soon everyone and pretty much everything will be connect-
ed. The amount of data this will generate is unimaginable
and ever expanding. But, unless we can harness and manip-
ulate this data, it is useless.
The real key to being able to answer these vital questions
is understanding the provenance of data, being able to val-
idate its accuracy and veracity, as well as knowing how to
extract value from it. This will require evolving ontologies
and taxonomies that allow us to do this on a global scale,
across culture and time zone, we will need to allow for differ-
ences in perceptions and collection methods. Learning how
to make sense of collective data is the true challenge.
Given the mushrooming of data sources, collectors and sen-
sors, we need to get far better at filtering and routing data to
make the right data available in the right format where and
when it is needed.
Thinking differently about data, where it comes from; where
it goes; how it is collected, maintained, aggregated; what it
is used for; and what it shouldn’t be used for; requires a fun-
damental shift in our collective thinking. Only then will we be
able to gain insight so as to make the right decisions at the
right time in the right place.
-JP Rangaswami, chief scientist, Salesforce.com