December 2016, IDC #US41841116 IDC TechScape IDC TechScape: Internet of Things Analytics and Information Management Maureen Fleming Stewart Bond Carl W. Olofson David Schubmehl Dan Vesset Chandana Gopal Carrie Solinger IDC TECHSCAPE FIGURE FIGURE 1 IDC TechScape: Internet of Things Analytics and Information Management — Current Adoption Patterns Note: The IDC TechScape represents a snapshot of various technology adoption life cycles, given IDC's current market analysis. Expect, over time, for these technologies to follow the adoption curve on which they are currently mapped. Source: IDC, 2016
37
Embed
IDC TechScape: Internet of Things Analytics and ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
December 2016, IDC #US41841116
IDC TechScape
IDC TechScape: Internet of Things Analytics and Information Management
Maureen Fleming Stewart BondCarl W. Olofson David SchubmehlDan Vesset Chandana GopalCarrie Solinger
IDC TECHSCAPE FIGURE
FIGURE 1
IDC TechScape: Internet of Things Analytics and Information Management —Current Adoption Patterns
Note: The IDC TechScape represents a snapshot of various technology adoption life cycles, given IDC's current market analysis.
Expect, over time, for these technologies to follow the adoption curve on which they are currently mapped.
Managed data transport technology picks data up from files or databases populated by the IoT
collector or historian and subsequently sends the files to the target central data processing facility.
Managed data transport technology is more likely to be used:
To support applications where batch or microbatch frequencies meet the IoT data latency requirements of the solution
As a rudimentary bridge between the collector and streaming data technologies, where incompatibility is an issue or where decoupling of the two components is desired
To periodically send data from the historian to the data stores used for the discovery and training required for analytics
We use the term managed data transport because there are underlying choices about what technology
to use. It is common to use managed file transfer (MFT) software and also reasonable to use the
extract, transform, and load (ETL) technology. A file sync and share service can also be used in some
applications.
Examples of software vendors and products in this category include but are not limited to Attunity's
MFT and Replicate; Axway; IBM's Sterling MFT, Aspera, and Datastage; Informatica PowerCenter;
Box; Dropbox; and Egnyte.
Pros:
Managed data transport technology is relatively mature, and many organizations with IoT projects are likely to already have ETL, MFT, or file sync and share in their portfolio. The
issue is implementing agents at the stationary or mobile edge to handle secure transport.
Managed data transport technology can also facilitate data transport from the edge to
target processing facilities if data streaming technology is not available or feasible.
Managed data transport technology can be used to decouple the collector from transport,
offering a higher quality of service in situations where network connectivity is low or unstable.
Batch or microbatch will increase data latency between the edge and target processing facilities.
Central processing facilities may need to accommodate spikes of activity with each batch, depending on data volume.
Decoupling may help to one end, but it implicitly adds another component to the solution that will need to be monitored, managed, and maintained.
Depending on the software used for the implementation, it may require a heavier footprint implying sufficient processing and persistence capacity at the edge.
Streaming Data
FIGURE 6
Streaming Data Markers of Momentum
Source: IDC, 2016
Streaming data is the transport that facilitates the flow of data from a source to a target or, in some
cases, multiple targets. Streaming data software transports data that is generated continuously and
transmitted simultaneously in small sizes (order of kilobytes). Transmission is handled by messaging
technology, by specialized agents that forward data, or through APIs that continuously post data to
deliver to a target and in some cases, by application-level coordination of communication using lower-
level protocols, such as HTTP or MQTT. Some solutions handle streaming sensor events directly from
the sensor client through a gateway to the targets and from a central source through a forwarder to the
sensor client. Other solutions pick up sensor events from the collector, which has already converted
the protocol to an IP-compatible format. The messaging that transports data streams also may serve
as the queuing mechanism at the target to receive and queue data from multiple data streams.
Many organizations have already adopted streaming data for IoT. We list it as transformational
because it is a core component of an event-driven architecture, which in its entirety is considered
Examples of technology in this category include but are not limited to messaging software such as
Apache's ActiveMQ, Apache Kafka, MQTT-S, RabbitMQ, and ZeroMQ. Software for posting sensor
events via REST APIs include Google Apigee Link and Red Hat 3Scale. IBM offers Bluemix Message
Hub to connect its IoT platform to IBM's Hadoop Bluemix service.
Pros:
Message queuing technologies offer higher quality of service levels over base transport protocols such as HTTP and MQTT.
Message queueing technologies are not new and as such have a lower level of risk associated with them.
Cons:
HTTP and MQTT methods can result in tightly coupled systems, requiring the source to maintain history in the event of data transmission issues. Sending applications will need to manage potential breaks in network connectivity.
Message queuing services add another layer of complexity into the end-to-end solution deriving additional requirements for monitoring, management, and maintenance.
Message queueing services may insert additional latency into the data transmission.
Streaming Integration
FIGURE 7
Streaming Integration Markers of Momentum
Source: IDC, 2016
Streaming integration technologies are used to provide intermediary functionality between the edge
and central processing facilities. Intermediary functionality may be required to perform protocol
conversion, data normalization, and/or filtering. Streaming integration technology sits between the
collector and the data stream, such as an API gateway into the data stream or a change-data-capture
component listening to a collector's local database. It could also be a component that intercepts
messages from the stream or the target message queue, processes the data, and forwards or puts the
StreamAnalytix, StreamSets Data Collector, and Striim.
Pros:
Streaming integration is useful if transport and/or data protocol conversion is required
between the edge and the stream.
Streaming integration can also be useful to filter, normalize, and reduce the volume of
data, relieving the pressure on stream bandwidth and central processing capacities.
Much of the functionality is borrowed from existing segments of data and application
integration software markets, so there is low risk associated with the technology.
Cons:
Streaming integration adds more components in the end-to-end solution, resulting in more points of failure that need to be monitored, managed, and maintained, and can increase
latency in the data transmission process.
IoT Data Event Services
Thing Event Store
FIGURE 8
Thing Event Store Markers of Momentum
Source: IDC, 2016
Event stores capture and organize sensor data, adding to the store when new sensor data is delivered.
A key attribute is the creation timestamp of the sensor event. Event stores are also created when
streaming analytics is deployed. The event store can be queried by end users, applications, and time
series–based analytical software. Event stores can also be used to backstream for testing and auditing
purposes. Event stores are offered by some vendors as part of their IoT portfolio and also can be
implemented using an in-memory time series database, data grid, or a general-purpose database that
Examples of products include Amazon's AWS Thing Registry, Bosch's IoT Things and IoT Remote
Manager, GE Digital's Predix Edge Manager, IBM Watson IoT Platform Foundation Device
Management, Microsoft Azure's IoT Hub Device Identity Registry, and PTC's ThingWorx Foundation.
Pros:
A registry provides a central repository of things connected to the network.
A registry can be used for analytics of lifetime, runtimes, service history, and inventories.
A registry can be used to identify location for MRO.
Cons:
The registry will need to be maintained, and unless the things themselves are providing the data for the attributes in the registry, manual maintenance could become
overwhelming.
Thing State Machine
FIGURE 10
Thing State Machine Markers of Momentum
Source: IDC, 2016
A thing state machine maintains the current status of a thing's sensors. While the thing registry
maintains static information about a thing, the state model maintains the current status of information.
Depending on the complexity, a thing state model may also consist of a series of state models.
Depending on product capabilities, state machines can consist of direct sensor readings as well as
calculated — or derived — state. This derived state may also use analytics to arrive at the state, for
example, scoring the status of a derived property in the state model. Using an event-driven
architecture built around publish-and-subscribe provides a way for multiple thing state models to
subscribe to the same sensor data event, depending on the use case. State models may also be
propagated from edge to cloud and across clouds. Ultimately, state models provide the thing state data
required for custom and packaged IoT-related applications.
Not all IoT platforms have a state model construct and may choose to store state data in a time series
database or a relational database. Depending on the complexity of the use case, enterprises may
choose to build their own state models using NoSQL database technology.
Example of products include Amazon's Device Shadows for AWS IoT, PTC's ThingWorx Thing Model,
and Salesforce's Thunder and IoT Cloud.
Pros:
The thing state model is an important asset in an event-driven architecture and for low-code environments, particularly for application development and where nontechnical subject matter experts (SMEs) are developing condition detection and response logic.
The thing state model makes it easier to distribute sensor data to all systems that need the data, particularly in decentralized systems where the design of the system has multiple
tiers managed by different vendors or products, such as an edge tier or a middle tier for machine-specific use cases or an interaction tier for customer experience–centric use cases, where there is an advantage in splitting up the design based on assets required in
each tier.
Cons:
Not all IoT platforms have this capability and may require internal skills to develop and manage on an ongoing basis.
Not all organization working on IoT projects are structuring around events and may be more comfortable using more familiar databases.
IoT Data Services
Dynamic Data Management
FIGURE 11
Dynamic Data Management Markers of Momentum
Source: IDC, 2016
A dynamic data management system can accept data without requiring that the structure and elements
of the data be defined in advance. These include scalable data collection managers (the most common
being Hadoop) and dynamic DBMSs. Because they do not require the use of SQL, dynamic DBMSs
are sometimes called NoSQL database systems. There are two categories of dynamic DBMS:
Semischematic, where the data may be governed by a schema, but one is not required (Any
data may be entered into the database that conforms to the general data format of the DBMS if no schema is present. If a schema is present, it governs the data and optimizes database operation on that basis.)
Nonschematic, where no schema is required, and any data conforming to the general format of the database may be added
The resulting collection of data may end up being rationalized under a schematic structure (in the case
of semischematic), mapped on the basis of field names and values or simply accessed by means of
key-value pairs. Types of dynamic data management systems include:
Document-oriented database systems: Document-oriented database systems manage data blocks containing fields that are identified according to a generally accepted document format. The two most common such formats are Extensible Markup Language (XML) and JavaScript
Object Notation (JSON). Examples of products include Amazon DynamoDB, Couchbase, IBM Cloudant, and MongoDB.
Key accessible database systems: Key accessible databases are nonschematic and store data in a way that supports random retrieval by key value or retrieval in key-value order. They are not true database management systems because they merely facilitate the storage and
retrieval of data according to certain optimized techniques but do not actually manage the database per se — the applications do that. Examples of products include Amazon SimpleDB, Apache HBase, Basho's Riak, and Oracle NoSQL Database.
This category also includes graph databases and Hadoop, which are covered separately.
Pros:
Faster, more flexible way to manage data, particularly data structures that change rapidly or do not lend themselves to an RDBMS
Low-latency response times
High scalability
Cons:
This technology can't be used for applications that query using SQL.
There are skills gaps compared with SQL-based systems.
Graph Database
FIGURE 12
Graph Database Markers of Momentum
Source: IDC, 2016
Graph DBMS software manages data as graph structures. These contain objects sometimes called
"nodes" or "vertices" with recursive attributed relationships, sometimes called "edges." The attributes
of the objects and relationships are called "properties." Unlike a fully schematic database, the structure
of a graph database is derived from the relationship structure that is found in the instance data.
Graph databases are used to capture and analyze extremely complex relationship instance structures.
For example, a thing registry could logically be built in a graph database to make it easier to show
relationships between things and networks of things as well as data flows. Graph databases are also
used to support some types of machine learning.
Graphs are especially useful for discovering previously unknown or little understood relationships.
These relationships can include those arising from behavioral patterns or coincident patterns of
change. With respect to connected devices, these could be such things as tracking customers through
shopping areas using their cell phone location data and correlating this tracking data with that of others
to find useful patterns.
Another example comes from the automotive industry where new cars are heavily instrumented,
regularly transmitting data about the condition of the engine and various wear on parts of the vehicle.
Combine that with geospatial data, and the geolocation data from vehicles with coincident data about
weather and traffic conditions, and it becomes possible to find patterns of relationships between
engine and drivetrain wear, fuel consumption, and various combinations of weather (hot versus cold
and dry versus wet) and traffic (heavy versus light). These patterns, in turn, may be analyzed to a level
of detail that can better inform maintenance service intervals for specific locales and even future
design changes.
Examples of products include Neo Technology's Neo4j, IBM's Bluemix Graph, Blazegraph, Ontotext
Graph DB, OrientDB, Objectivity's ThingSpan (formerly known as InfiniteGraph), and DataStax's DSE
Graph.
Pros:
Unlike other NoSQL DBMSs, a graph DBMS is driven by instance relationships and so makes analysis of patterns and combinations of relationships relatively easy and fast. Unlike an RDBMS, which requires data to conform to a fixed relationship structure, a
graph database reveals the relationships inherent in the data, with very little preparation ahead of the data load.
Because actions and consequences in a complex system generally result in changes to data relationship patterns, graph databases can help drive machine learning and other AI-related operations.
Cons:
Because graph databases can make no assumptions about relationships and patterns of relationships in the data, preloading query optimization is not possible. This is different from an RDBMS, where the relationship structures are fixed in the schema, so query plans
are typically optimized. This means that the work of graph databases must be focused on situations where relationship pattern discovery is primary; it is not a substitute for anRDBMS. Because of the overhead involved in relationship management, it is also not a
substitute for the relatively simple object-by-object processing of a document-orienteddatabase system (e.g., JSON or XML).
Not all graph databases are good at all graph workloads. Some graph databases do text graphing well but fall down with large volume object graphs. Some graph databases are better for relationship traversal (such as finding all objects with at least a fifth-degree
relationship to a given object), while others are good at statistical patterns based on large numbers of related objects.
This area is still evolving. There is no one standard graph query language (such as SQL for relational), although TinkerPop is emerging as a framework, and Gremlin is its
language. Neo4j offers a language called Cypher. SPARQL is sometimes used for graphs that represent semantic information structures. GraphQL is a graph data access method that uses a RESTful API, though its name would suggest a query language. There are
various efforts under way to develop a common query language.
Hadoop
FIGURE 13
Hadoop Markers of Momentum
Source: IDC, 2016
Apache Hadoop is a cluster-based platform for the ingesting and processing of large volumes of data
using a massively parallel processing (MPP) approach. It exists through a group of closely related
Apache open source projects that provide software to manage the cluster and handle the consolidation
of result data across the cluster and various administrative functions. Closely related to Apache
Hadoop are HDFS, which acts as a cluster-based file system, and HBase, which runs on top of HDFS
and acts as a key-value store (a simple NoSQL database that randomly stores and retrieves blocks of
data based on unique key-value pairs). Also, commonly used in this context is Apache Hive, a facility
for defining the data in HBase for retrieval using standard SQL.
The normal mode of processing data, especially new data, in Hadoop is a programming technique
called MapReduce. For IoT and machine learning cases, MapReduce has fallen out of favor as more
users are turning to the high-speed in-memory processing of Spark, either coding natively or in
conjunction with a query processing layer such as Spark SQL. Apache Spark is described in the data
services section under in-memory data processing.
Hadoop is commonly used in the following ways:
As an initial ingest engine, accepting data as well as ordering, filtering, and formatting it and then delivering a subset for further processing either in HDFS or on another platform
For the one-time or limited frequency analysis of very large amounts of data
For the long-term storage of data that ought to be retained but is accessed only occasionally
As a clearinghouse or transformation platform as data is moved from system to system, sometimes as a substitute or replacement for an extract, transform, and load facility
As a combination of the aforementioned bullets, commonly called a "data lake"
Apache Hadoop may be downloaded and used directly from the Apache website, but this requires
considerable technical expertise and a willingness on the part of the enterprise to act as its own
software tech support organization. Most enterprises choose instead to use a commercial packaged
distribution of Hadoop, which comes with advanced management tools, professional support, and
regular software updates ready to install.
Examples of commercial Hadoop distributions include Cloudera Enterprise, Hortonworks Data
Platform (HDP), and MapR Converged Data Platform (which includes an indexed file system called
MapR-FS and its companion NoSQL DBMS MapR-DB as substitutes for HDFS and HBase). Also, IBM
bundles Hadoop into IBM BigInsights, Oracle bundles it in Oracle Big Data Appliance (OBDA), and
similarly, Microsoft offers HDInsight. Amazon offers an AWS-optimized variant called Elastic
MapReduce (EMR).
Pros:
Is ultimately flexible and scalable; can accept any data of any size because the processing
details depend on code.
Is cost effective as a storage platform for huge amounts of searchable data, which is
particularly useful for IoT long-term storage of sensor event data
Supports IoT discovery and training, which is critical to the ultimate success of IoT projects
but is not part of an IoT platform
Cons:
Hadoop applications must be coded. There is no schema and no optimizer. The user is responsible for the maintenance of the system and must do work that DBMSs normally do,
such as data structure management and access optimization.
This is a batch-oriented system, so real-time processing of streaming data is not possible.
Where streaming data is involved, it is usually a companion to some stream data processing engine, serving as a back-end storage facility for later processing of historical data after the fact.
Hadoop in its native form is not suitable for random data update and so should not be considered for transaction processing.
In-memory data processing platforms enable large-scale data-centric operations to be carried out
entirely in memory, without reference to storage. This sometimes takes the form of loading the data
from some source (such as a database) into memory and maintaining it there for analytic query
processing. It can also take place by managing the data in memory as a database, using snapshots
and logs, or replication to prevent data loss in case of system failure.
The most common of the former type of in-memory data processing platform is Apache Spark. This
facility is run on a cluster, holds data in memory, and performs MPP-based queries on the data. It is
optimized for speed. Spark is most commonly deployed on a Hadoop cluster, using the HDFS (or
HBase) layer for its storage, but it is also run on top of the wide column database, Apache Cassandra,
and can even run on its own clusters. This last configuration is becoming more and more common on
AWS, where it uses the S3 layer for its storage.
Spark is popular for data operations on large data collections where an outcome is expected
immediately or nearly immediately or to speed up time-consuming analytics training. This contrasts it
with Hadoop MapReduce, which is not typically used for interactive query because of the batch nature
of its processing. Spark is also used to collect streaming data, making it available for nearly immediate
use.
Examples of in-memory data processing include Apache Spark, Apache Flink, Apache Ignite,
Databricks, and GridGrain. In addition to the Hadoop distributors listed in the Hadoop section, there
are many commercial Spark distributions. Databricks is a pure-play Spark distributor.
Pros:
In-memory data processing is much faster than Hadoop MapReduce and is assuming increasing amounts of the latter's workloads.
Spark has a range of other projects and a growing ecosystem around it that are designed to add value and functionality to the basic platform. These include MLib for machine
learning, GraphX for graph support, Spark Streaming for streaming data ingestion, and Spark SQL. There are also examples of using Spark in combination with GPU to speed up model training, particularly for highly complex use cases.
Cons:
Like Hadoop, Spark and similar products require a lot of hand coding to make solutions work.
This category is still evolving. Spark, in particular, is evolving rapidly, and new versions are not always fully compatible with previous versions, which means that some adaptation of applications to successive versions of Spark may be necessary.
The in-memory relational technology is found in memory-optimized RDBMSs (i.e., they are optimized
for the management of data in memory as opposed to in storage). Some of these databases are
designed mainly for transaction processing, some mainly for analytical processing, and some do both.
Typically, the analytic RDBMSs in this category are columnar, and most use a compression technique
that not only saves memory but ensures that the data is organized optimally for query processing by
enabling the use of the entire microprocessor data cache to be used with data test operators (e.g.,
equals, not equals, greater than, and less than). This makes the use of single instruction multiple data
(SIMD) operations possible, greatly increasing processing speed. RDBMSs that mainly process
transactions typically hold the data in rows. Those that handle mixed workloads may hold some data in
rows, some in columns, or in some cases, other formats designed to minimize instructions and
memory access.
Some of these in-memory relational databases can accept streaming data at speed, allowing queries
that include current and previously collected data to execute on a very timely basis. Other databases
are simply designed to process transactions very quickly or support complex queries very quickly. All
of these RDBMSs use various techniques including persistent transaction logging and snapshotting to
ensure recoverability so that data loss is no more a concern with them than with storage-based
RDBMSs.
Examples of in-memory RDBMSs include Altibase, deepSQL, MemSQL, Oracle TimesTen, SAP
HANA, and VoltDB.
Pros:
SQL is the most commonly understood query language in the IT world, and these products are optimized for it.
In-memory relational technology delivers speed with structure in a familiar format.
Cons:
Requires the data to conform to the schema of the database, so it is really only usable where the data is well understood and its format does not change much
Requires systems with large amounts of memory, which could be a cost concern
Open data platforms are early in their development and deployment. Many of these offer a set of capabilities that need to be assembled for technical and business use cases, and assembly may not be trivial. As these platforms become more widely used, standard architectures and
best practices will emerge, but for now, this represents a high-risk component.
Nontrivial assemblies lead to complex monitoring, management, and maintenance.
IoT Value-Added Data Services
Blockchain
FIGURE 17
Blockchain Markers of Momentum
Source: IDC, 2016
Blockchain provides a decentralized chain of trust for transactions against an object. Blockchain
originates from bitcoin, and many of the first applications of blockchain technologies are focused on
financial services: payments, equities, and money transfers. However, blockchain can be applied
beyond financial transaction use cases to provide a chain of trust for any type of transaction against
any type of object — real or virtual. The value of the blockchain is that it can be trusted, and it is
distributed, not centralized, providing full provenance of the data on the chain.
Blockchain in IoT can be used to validate that data being received from a thing is actually from that
thing and not an imposter. Likewise, instructions from a source to update a thing can also use
blockchain for validation. Blockchain can also be used to represent the most recent state of a thing,
potentially as an alternative to the thing registry and state model because the blockchain keeps an
immutable record of the history of the thing, and could represent the current state. Every trusted
application that needs access to data about the thing will have a local copy of the thing's chain. When
new blocks are added, the distributed chain is also updated. However, these are still speculations on
how the technology could be applied. There is a lot of work and innovation yet to happen before the
most appropriate use cases of blockchain in IoT emerge.
Blockchain in IoT is still very much in its infancy, although some vendors are releasing technology
building blocks, including the IBM Watson IoT Platform and Chronicled, which has launched a
Ethereum IoT registry based on blockchain. Stock.it is a start-up at the intersection of blockchain and
IoT applications in the sharing economy.
It is also not clear whether the blockchain technology used in bitcoin will be exactly the same used in
IoT use cases or whether the term will be closer to a term used to represent an ultra-secure method for
For example, inclusion of weather or location data — two ubiquitous DaaS options — can enhance
predictive asset maintenance or logistics optimization processes. Organizations providing DaaS
include those in commercial enterprises and government agencies that generate the original raw data
and companies that locate, extract, mine, aggregate, enrich, and/or curate data for resale. There is a
broad range of data providers, brokers, and marketplaces.
In IoT, there are a handful of general-purpose DaaS, such as weather and location data, but there are
also many other specialized, industry-specific, and business process–specific data services. Examples
of data services include GE SmartSignal, Michelin solutions, Volkswagen Car-Net, IBM's The Weather
Company, Pirelli, MyJohnDeere.com, and Verizon's Precision Market Insights.
Pros:
As consumers of external (third-party) data, organizations can enhance their analytic models with the availability of more data and augment their things master data.
As producers of data or various derived value-added content, organizations have the opportunity to monetize such data assets either directly (by selling data to third parties) or
indirectly (by incorporating data into other services they provide).
Cons:
Use of external data can create additional challenges in data integration and data integrity management.
Monetizing one's data is a complex task that requires creation of a strategy and specific plans for packing, pricing, and ongoing maintenance and delivery of such data products.
IoT Analytics
Rich Media Analytics
FIGURE 19
Rich Media Analytics Markers of Momentum
Source: IDC, 2016
Rich media analytics solutions identify objects, entities, events, attributes, or patterns of behavior
(including temporal and special events either in real time or post event) through the detection,
determination, and analysis of video and image data. Use cases for these solutions include security,
object identification, video monitoring/tracking, image search, automatic alerting, forensic analysis,
image categorization, pattern, image, and shape recognition.
The amount of rich media data that needs to be analyzed and understood is increasing exponentially
with growth of the internet and mobile devices that capture images and videos on a more or less
constant basis. However, IDC estimates that much of this data is useless unless some type of
analytics is applied to it.
The market and opportunities for image and video analytics is growing significantly. Many
organizations would like to be able to monetize images for ecommerce. In addition, there is increased
interest in automated solutions for video surveillance — of human and nonhuman activity.
Organizations are also looking at using video and images as part of the data needed to understand
and improve customer experiences, along with social media data, geolocation information, and
transactional sales data. Video data and video surveillance are being used in a variety of ways by
many different organizations. Governments and enterprises are primary users of image and video
analytics today.
Companies offering image and video analytics include Hitachi, Fujitsu, NEC, Sony, JustVisual, HPE,
IBM, Clarifai, Cortexica, Ramp, Aventure, IntelliVision, 3VR, Accenture, and ObjectVideo.
Pros:
The exponentially increasing amount of image and video content offers an opportunity to apply rich media analytics technology to extract valuable information and knowledge.
Image and video analytics can add another dimension to text-based knowledge in diverse areas from healthcare to terrorism investigation to Internet of Things.
Image and video analytics can provide real-time feedback and information for cognitive decision making in areas such as robotics, drones, and driverless vehicles.
Cons:
This area is still emerging, and identifying objects, patterns, and visual cues correctly can be prone to errors depending on the algorithms and tools used.
Many of these tools use extensive amounts of machine learning, which is highly processing intensive.
Relating entities and objects from video to textual records and information can be challenging.
Statistical analytics software includes packages that use a range of statistical techniques to create,
test, and execute models on analyzing IoT data. This genre falls into the advanced and predictive
analytics software segment of business intelligence and analytics tools. Sample techniques used
include descriptive and predictive analysis, regression, and clustering.
Statistical analytics is used to discover relationships in data and make predictions that are hidden, not
apparent, or too complex to be extracted, or when there is not enough data for other types of modeling.
An example of use case in IoT would be in predictive maintenance, where analysis of sensor data
would provide predictions on components that will be in imminent need of maintenance.
Most statistical analytics packages use programming languages that might be proprietary or open
source or a combination. Most packages also include a graphical user interface that allows analysts to
interact with the software and build models with no or minimal programming.
Examples of products include SAS Analytics and SAS Enterprise Miner, IBM SPSS, SAP Predictive
Analytics and SAP Infinite Insight, and Oracle Data Mining. SPSS is a component of IBM Watson IoT
Platform. In addition, open source modeling languages like R are commonly used by advanced data
analysts.
Pros:
This technology can be used where large gaps exist in data models or where data models
are incomplete.
Data can be easily imported from excel files or other formats.
A variety of statistical techniques for analyzing data can be used. Most packages allow power users to use programming languages for complex analyses that cannot be done
with graphical user interfaces.
Cons:
These tools can be fairly complex to use. Users need to be sophisticated in concepts of statistics and data mining and programming in order to be able to take full advantage of
the capabilities of these tools.
Statistical packages typically cannot account for all factors that might affect an outcome,
especially those that cannot be expressed as structured data.
These tools are not suitable for data manipulation or data preparation. They assume that
the data is cleansed, validated, and prepared, and hence bad data will result in poor predictions.
Examples of products in this category include Apache Kafka Streams, Apache Storm, Apache Spark
Streaming, AWS Kinesis, IBM Streams, Microsoft Azure Stream Analytics, Salesforce Thunder, SAS
Event Stream Processing, SQLStream, and TIBCO's Streambase and BusinessEvents.
Pros:
Can be used when requirements call for low-latency detection of conditions, particularly
under high data volume conditions
A central component of event-driven design that is oriented to decision support and
decision automation
Can be used in a compact way at the edge compared with other analytical techniques
Can plug in machine learning as part of a stream, supporting hybrid cognitive/programmatic use cases
Can be used for preprocessing events that need to be correlated before moving to a different analytical environment
Cons:
Some organizations may opt not to use when they have many applications that rely on mature data management systems. Instead, the organizations will improve their data refresh rates and forego event-driven design.
Popular open source–based streaming analytics software is less mature, missing many of the key elements that are present in proprietary streaming analytics.
There is a scarcity of developer skills in use of streaming analytics.
Supervised Learning
FIGURE 22
Supervised Learning Markers of Momentum
Source: IDC, 2016
Supervised machine learning begins with examples of training data paired with identifying labels (e.g.,
right or wrong and positive or negative) selected from the categories to be learned. Using these pairs
of example data and labels ("training data"), the system learns parameters of statistical models that it
can then generalize to unlabeled examples of data items that were not seen in the training data ("test
data"). In most cases, the learned models improve over time via a feedback loop that adjusts the
model parameters to better reflect additional sets of training or production data. The performance of a
learned model can be measured by simple prediction accuracy or by the particular business metric the
learned model is designed to support. Performance depends on the degree to which the training data
matches the real world, the choice of algorithm, the algorithm's parameters, and the quantity of data.
Companies like IBM, IPsoft, Wipro, Intel's Saffron Technologies, Infosys, CognitiveScale, and Tata
Consultancy Services include machine learning capabilities in their cognitive system platforms that
allow developers and enterprises to build cognitively enabled "smart" applications that learn over time.
In addition, vendors such as Google, Amazon, Microsoft, and Skytree offer commercial machine
learning libraries as standalone tools. There are also many free and open source machine learning
packages, including Apache Spark's MLlib, which is designed to make machine learning easy and
useful inside the popular Apache Spark framework for cluster computing. In addition, Microsoft
recently open sourced its distributed machine learning library, DMTK, under an MIT License.
Additional open source software includes Waikato Environment for Knowledge Analysis (Weka) and
Massive Online Analysis (MOA) from the University of Waikato and H2O.
Deep learning is a particular type of supervised machine learning based on neural network algorithms,
which has seen recent commercial success. Google released its second-generation deep learning
library, TensorFlow, to open source. Other open source deep learning libraries include Caffe from the
University of California, Berkeley; Theano from the University of Montreal; and Torch from Idiap, used
extensively by Google and Facebook as well as Weka and H2O.
Pros:
Supervised learning algorithms can learn quickly from examples and self-correct when changing trends are reflected in new sets of labeled data.
Advances in computing power and ever expanding sources of data make advanced algorithms possible.
Lots of investment by vendors and venture capital firms are leading to rapid progress.
Cons:
Finding or creating the required labeled data is costly and difficult.
A wide range of options make vendor selection tricky. It can go from being quite inexpensive with open source to very expensive with large vendor offerings. The less costly options will entail lots of internal resources to make it work.
As advanced as these products have become, there are still challenges in achieving objectives when there are large numbers of variables and interdependencies for a
particular decision.
Subject matter experts are needed to assist with the initial and ongoing review of data
training sets, which may prove costly and time consuming.
The bias-variance trade-off requires balanced learning algorithms based on the amount of
available data and the discernable complexity of the function to be learned to automatically adjust the bias-variance trade-off.
Low-code rules can be complex to manage decisions as assets.
The traditional, more highly adopted rules engines can be too hard to use for nondevelopers and the newer, easier-to-use rules engines may be too simple for
sophisticated use cases.
Low-Code App Platform
FIGURE 25
Low-Code App Platform Markers of Momentum
Source: IDC, 2016
Low-code application platforms combine development and runtime into a single offering. They typically
consist of graphical modeling environments to describe workflows, data objects, and forms; point-and-
click configurations; and relatively simple scripting. These environments are popular for rapid
development as well as development teams that include both business participants and developers.
In IoT, low-code platforms are useful for automating workflows, for building mobile apps, and for
assigning and managing tasks. Moving into the future, the low-code platforms will be useful for
designing and automating interactions used to manage an event-based customer experience.
IoT-specific examples of products include IBM's Node-RED and PTC's ThingWorx Foundation.
Generalized low-code workflow or mobile app environments include Nintex Workflow Cloud, Alfresco
Activiti, Bonitasoft's Bonita BPM, BP Logix, Appian, and Salesforce Lightning.
Pros:
Useful for application design involving collaboration between process experts and developers
Fast development cycles
Ability to provide short-term situational apps
Cons:
May not offer the control developers need for specific use cases
Produces a higher-level business view of sensor-supporting operations
Makes it easier to get started in IoT by identifying low-hanging opportunities where the
problems are straightforward to identify and causes are fairly well known
Shifts from reactively responding to problems to proactively identifying them to speed up
resolution
Cons:
Is supplemental rather than core IoT AIM technology
Does not incorporate advanced analytics
TECHNOLOGY ADOPTION OUTLOOK
IoT is the opposite of traditional AIM technology adoption, which involves moving data in batches and
then normalizing and loading the data into target systems. Analytical software is used once the data is
loaded and at rest, typically to produce reports or statistical analysis that help in decision making or for
on-demand decision automation.
IoT AIM is about sensing and responding within a time window, continuously moving and managing
sensor events, and handling large volumes of data, continuous decision automation, and decision
support using analytics and rules. Data must travel from a sensor to edge collection to central
processing where it is normalized and analyzed against some type of prediction model or algorithm to
determine whether further action is required. Once actions are required, response cycle times vary
substantially, but the end-to-end cycle time must be faster than the time window allotted to derive
business benefit.
Four considerations should dominate IoT AIM technology adoption planning:
What is the total time window available to deliver business value when a condition is identified
that requires a response? Adoption of AIM technology is required when time windows are narrower than the cycle time of the end-to-end IoT system.
How good is the prediction or insight from your analytics software? Quality problems occur for a variety of reasons, but noisy predictions and wrong or nonactionable predictions are all expensive. Using the best approach to analytics for a particular problem requires an
assessment of whether there are data gaps that need to be resolved as well as identifying options and experimenting with them prior to adoption. Different techniques may also be required for different workloads or stages within a workload.
How much technical debt are you accumulating by repurposing existing AIM technology or investing in custom development? In the beginning, it makes sense to keep costs low by
leveraging existing AIM technology for an IoT project. But technical debt rapidly accumulates when existing technology doesn't really align with needs and has to be customized or contorted on an ongoing basis to make it work. As IoT initiatives are operationalized, the use
of purpose-built tools is almost always a better path once those tools reach the required level of sophistication.
How do technology choices align with your enterprise's adoption risk profiles? Different organizations have different approaches to risk. When it becomes clear that there is a need to add new functionality or replace nonperforming existing technology, the selection has to align
with the skills of the team implementing and using the technology. We assess the adoption risk
and speed of adoption for each of the 25 technologies highlighted in this IDC TechScape. Planning should take both of those factors into account. If a new technology identified in
Figure 1 has a higher risk than is acceptable to your organization but has a fast rate of market adoption, it is important to begin planning and acquiring skills sooner than later for eventual adoption.
LEARN MORE
Related Research
IDC's Worldwide Software Taxonomy, 2016 (IDC #US41572216, July 2016)
Internet of Things Analytics and Information Management Software Taxonomy, 2016 (IDC #US40708515, December 2015)
IDC TechScape Methodology
Unlike other technology assessment frameworks, the IDC TechScape provides a visual representation
of the process of technology adoption, dividing technologies into three major categories based on their
impact on the organization and assessing their relative maturity within their respective categories. The
study examines particular individual categories and provides additional insights about the speed of
adoption, technology potential for success (risk), and industry hype. Refer back to Figure 1 for the IDC
TechScape for Internet of Things analytics and information management.
The IDC TechScape is a tool for strategic planning and tactical decision making for technology
professionals in IT buyer organizations. This audience may include CIOs and senior technology
professionals, strategists, and IT buyers from IT or from lines of business.
The document's two functions:
Strategic planning tool:
Offers a view into where a technology exists in its overall adoption life cycle. Generally, technologies in the early stages of evaluation and deployment are riskier investments thanthose further along in the adoption life cycle as they are deployed more broadly.
Sorts technologies into three categories that may help organizations make judgements about which technologies might provide the greatest positive impact on their organization.
IT strategists can use this information to prioritize interest in a technology or group of technologies.
Tactical decision-making tool: Because it lays out where a technology exists within its overall adoption life cycle, and a certain level of associated risk may be inferred, an organization can use the IDC TechScape to determine whether or not it should immediately adopt a particular
technology or should wait until the risk of adoption is less.
IDC TechScape Categories and Definitions: Transformational, Incremental,and Opportunistic
Executives use the IDC TechScape model to:
Inform technology adoption decisions based on organizational appetite for risk and potential for transformational change
Support a decision on when a technology or group of technologies might be ready for adoption, given the purchasing organization's preferred appetite for risk — whether or not an organization should immediately adopt a particular technology or wait until the risk of adoption
decreases
The three types of adoption curves in an IDC TechScape are:
Transformational. These technologies will completely reshape markets and investment
strategies. They may create new business and/or market opportunities and lead to new enterprise and consumer capabilities. They may differ significantly from current technologies and may have mostly unrecognized market impacts/opportunities. Transformational technologies
have already demonstrated that they fundamentally change current best practices.
Incremental. This new generation of technology measurably improves on an existing category
of technologies to deliver better business outcomes. In terms of business processes, technologies deliver small but measurable improvement over current best practices.
Opportunistic. These technologies will grow based on specific use cases, and they have an undetermined or limited capability to improve existing technologies/processes. Their potential changes currently lack a clear impact on current best practices.
Synopsis
Over time, analytics and information management (AIM) technology adopted for IoT will be different
from an organization's existing technology investments that performs a similar, but less time-sensitive
or data volume–intensive function. Enterprises will want to leverage as much of their existing AIM
investments as possible, especially initially, but will want to adopt IoT-aligned technology as they
operationalize and identify functionality gaps in how data is moved and managed, how analytics are
applied, and how actions are defined and triggered at the moment of insight. This IDC TechScape
covering IoT AIM is designed to help:
Enterprises learn more about the newer AIM technologies that support IoT
Align these technologies with an enterprise's technology risk profile to determine what is ready to adopt and what should be monitored
Gain a better understanding of where an IoT team will need to create skills and competencies as it plans to adopt newer AIM technologies
According to Maureen Fleming, vice president for IDC's IoT Analytics and Information Management
research program, "Implementing the analytics and information management tier of an IoT initiative is
about the delivery and processing of sensor data, the insights that can be derived from that data and,
at the moment of insight, initiating actions that should then be taken to respond as rapidly as possible.
To achieve value, insight to action must fall within a useful time window. That means the IoT AIM tier
needs to be designed for the shortest time window of IoT workloads running through the end-to-end
system. It is also critical that the correct type of analytics is used to arrive at the insight."
About IDC
International Data Corporation (IDC) is the premier global provider of market intelligence, advisory
services, and events for the information technology, telecommunications and consumer technology
markets. IDC helps IT professionals, business executives, and the investment community make fact-
based decisions on technology purchases and business strategy. More than 1,100 IDC analysts
provide global, regional, and local expertise on technology and industry opportunities and trends in
over 110 countries worldwide. For 50 years, IDC has provided strategic insights to help our clients
achieve their key business objectives. IDC is a subsidiary of IDG, the world's leading technology
media, research, and events company.
Global Headquarters
5 Speen Street
Framingham, MA 01701
USA
508.872.8200
Twitter: @IDC
idc-community.com
www.idc.com
Copyright and Trademark Notice
This IDC research document was published as part of an IDC continuous intelligence service, providing written
research, analyst interactions, telebriefings, and conferences. Visit www.idc.com to learn more about IDC
subscription and consulting services. To view a list of IDC offices worldwide, visit www.idc.com/offices. Please
contact the IDC Hotline at 800.343.4952, ext. 7988 (or +1.508.988.7988) or [email protected] for information on
applying the price of this document toward the purchase of an IDC service or for information on additional copies
or web rights. IDC and TechScape are trademarks of International Data Group, Inc. IDC TechScape is a
registered trademark of International Data Corporation, Ltd. in Japan.
Copyright 2016 IDC. Reproduction is forbidden unless authorized. All rights reserved.