Foundations for Data-Driven Enterprises solutions have helped solve many of the historical challenges of data warehousing and advanced analytical processing. When most enterprises

IT & DATA MANAGEMENT RESEARCH,INDUSTRY ANALYSIS & CONSULTING

Foundations for Data-Driven EnterprisesThe Rapidly Evolving Hadoop-based Enterprise Data HubAn ENTERPRISE MANAGEMENT ASSOCIATES® (EMA™) Product View

May 2014

Page 1 ©2014 Enterprise Management Associates, Inc. All Rights Reserved. | www.enterprisemanagement.com1

Foundations for Data-Driven EnterprisesThe Rapidly Evolving Hadoop-based Enterprise Data Hub

Executive SummaryHadoop-based solutions have helped solve many of the historical challenges of data warehousing and advanced analytical processing. When most enterprises embark on Hadoop, however, the initial use cases and number of users is often limited to a sandbox for analytics and business intelligence specialists. Recently, however, Hadoop has evolved, both in terms of open source Hadoop and commercial distributions, to the point where a far wider set of applications and users, in both business and IT, are able to take advantage of Hadoop-based “enterprise data hubs.”

Enterprises have used data repositories specifically for structured data and semi-structured content for decades, but enterprise data hubs house a far wider set of data types, arriving into that data hub at differing speeds, from batch to streaming. While the modern Hadoop-based enterprise data hub is different from previous generation repositories because of the data the hub houses, it is also different from first generation Hadoop because it incorporates the full range of enterprise-class features, including high availability, security, and better performance. The result is that enterprises are using Hadoop now for operational applications, in addition to just complex queries.

Some examples of business applications that take optimal advantage of enterprise data hubs include recommendation engines for marketing purposes, biometric identity management solutions, and fraud detection. In all cases, the ability of the enterprise data hub to wrap itself around what previously were data silos, and process through all that data rapidly and with reliability, makes a business-visible difference.

IT departments are also finding several uses for enterprise data hubs. For example, the long desired but seldom accomplished mainframe offloading project has a new solution candidate in enterprise data hubs. In addition the enterprise data hub offers a dependable data facility for development and test purposes.

Every organization will find different uses and benefits for enterprise data hubs, but no organization should take the building of an enterprise data hub lightly. At a high level, the usefulness of an enterprise data hub seems clear, but the details of how to create and manage such a sophisticated data repository should not be underestimated. As organizations gain familiarity with enterprise data hubs, the usefulness and the mission criticality of the data hub will grow.

MapR, one of several vendors that offer a commercial Hadoop distribution, has taken extra pains to give its customers a full-fledged and dependable data hub. With its MapR M7 Enterprise Database Edition (“M7”), MapR includes a NoSQL database and NFS to augment the data and file management capabilities that come natively with Apache HBase and HDFS – all running co-resident to ensure low latency. MapR also includes a long list of specific data management techniques, such as mirroring and snapshots, to provide options for backup, restore, data protection, and disaster recovery to their Hadoop distribution. MapR has added security to the point where M7 is now government-class. In terms of dependability, the MapR Distribution for Hadoop eliminates the outage risk associated with NameNodes, and reduces a variety of other outage and performance risks.

For those enterprises looking to become more data-driven, and enjoy a better return on their data assets, an enterprise data hub may make enormous sense. MapR stands out as a leading provider for Hadoop-based enterprise data hubs.



The Coming of Age of the Enterprise Data HubWhen using Apache Hadoop, many enterprises experience a metamorphosis regarding how they tap into the value of their data. In the earlier, more experimental days of using Hadoop, enterprises primarily focused on tactical projects to gain insights from complex Business Intelligence (BI) and analytics queries. While that early Hadoop work proved valuable, as enterprises became more comfortable with Hadoop, and as Hadoop matured, a diverse set of data-driven operational solutions evolve. Rounded out by commercial offerings that extend the enterprise data management features of Hadoop, today enterprises enjoy perhaps unanticipated benefits associated with Hadoop including, but beyond, basic BI.

The growing-in-popularity, second-generation Hadoop-based data management platform is sometimes referred to as data hubs, or data lakes. These data hubs offer enterprises a sophisticated central repository, making it easier to elicit value from the myriad data flowing through their organization. For the purposes herein, to distinguish a basic data management platform used for project-oriented BI from a data management platform that delivers staying power in terms of value and lifecycle, we will refer to the newer generation as an “enterprise data hub.”

The purpose of this ENTERPRISE MANAGEMENT ASSOCIATES® (EMA™) paper is to understand the value proposition of an enterprise data hub, and to distinguish those features that define “enterprise data hub” versus a more tactical Hadoop implementation. Two primary questions will be addressed, including:

1. What are some examples that illustrate the business and IT value enterprises gained through an enterprise data hub?

2. What are the technology characteristics and attributes required of enterprise data hubs?

The paper will also take a look at a particular software vendor, MapR, and what it has done to innovate its Hadoop distribution in order to make it an excellent solution candidate for enterprise data hubs.

Business and IT Examples of Hadoop-based Enterprise Data Hub ValueIt is clear that Hadoop has proven successful for handling large-scale storage and data management which enables complex BI-oriented query and analytical processing. In many cases, however, Hadoop solutions have been used by a small set of users, such as data scientists or data analysts. Thus, in many companies, Hadoop starts out as a sandbox for specialty users to gain analytical insights. With some additional innovation, and enhanced data governance and management practices, however, a Hadoop-based enterprise data hub dramatically expands the applicability of Hadoop (see Figure 1). Both the business and IT domains are finding a number of excellent uses for a Hadoop-based data hub platform, outside of data exploration. The following section illustrates examples of how Hadoop-based enterprise data hubs deliver for operational business and IT purposes.

Data hubs offer enterprises a sophisticated central repository, making it easier to elicit value from the myriad data flowing through their organization.



Figure 1: First versus second generation (enterprise data hub) Hadoop. Source: Enterprise Management Associates, 2014

Business Operations•Marketing -UnifiedReal-TimeAdvertising andRecommendationEngine: The pursuit of

the “360-degree view of the customer” or prospect has been underway for decades now. The problem has always been that touch-point data are captured and gathered in silos, as are user profiles, and thus advertising and recommendation engines work with thin data slices, resulting in weak relevancy. In addition, those companies with the foresight to cross data silos often run into performance challenges, which make it impossible to analyze and spawn useful offers in a customer’s buying window.

Hadoop-based enterprise data hubs enable all the elements needed for recommendations to co-exist on the same cluster or enterprise data hub, including search engines and the various data. The result is speedier closed-loop analytical processes where the concurrence models are updated closer to real time. The net results are more relevant offers to customers while they are still in the same browser session.

• Banking-FraudMonitoringplusCustomOffers: Fraud detection and custom product offers might seem like unrelated operations for banks, but with an enterprise data hub the hidden harmony between the two operations become apparent. By combining clickstream data with transactional data, the fraud analyst is able to tap into patterns of potential fraud, and evolve the fraud detection model over time. Using the same data, the marketing person is able to evolve the rules and models for custom offers. Both operations tap into the same data samples: the resulting metadata and models are inextricably linked. Also, both fraud detection and custom offerings serve mission-critical business needs, and therefore they share the requirement for enterprise-grade data management features like mirroring, file management, and Disaster Recovery (DR).

Both the business and IT domains are finding a number

of excellent uses for a Hadoop-based data hub platform.



• BiometricsforIdentityManagement:Identity theft has become a worldwide, multi-billion dollar “business.” The old approaches of, for example, trying to secure social security data, or to control client access at the device level, have failed over and over.

Biometrically-based identity management, however, holds enormous promise for both developed and developing societies. With biometrics in place, identification card theft and counterfeiting, and the stealing of identity data, whether on database or in stream, is almost entirely eliminated. Biometric data, however, comes in many forms, not just retinal scans, but from several sources arriving at different speeds. Equally as important, the speed of identity processing using biometrics must be blazing fast to support commercial and governmental processing requirements. A Hadoop-based enterprise data hub, augmented with real-time operational and enterprise-class features, is a natural for biometric identity management; its ability to combine massive amounts of retained data, support a variety of update approaches, and its blazing query and logging speeds make it a perfect fit.

• Secure Repository for Compliance: Governments, industries such as banking, and multi-nationals operate endlessly under the microscope of compliance and audit. A secure Hadoop-based repository gives such organizations the ability to effectively address compliance and even forensic data requests, with far faster lookup than through a tape-based facility buried in a hillside. Hadoop-based enterprise data hubs, enhanced with government-class security, high availability, data protection and file management, and enterprise-class DR (your company cannot lose this data after all) offer a tantalizing approach for a long-term, trusted repository.

For example, a major investment bank that had previously only kept one year of data active has now expanded to 10 years of active data through an enterprise data hub. Not only does this enable the bank to far more rapidly respond to audit and compliance requests, it also supports the generation of insights into long-term market pricing and valuation trends, and it supports customer queries that span years.

IT Operations•MainframeOffloading: The terms “mainframe offloading” and “application modernization” –

also known as migrating applications off of mainframes – have been part of IT’s vocabulary for well more than a decade, but progress has been slow. One reason for the slowness is that massive quantities of data are locked-up on the mainframe. Companies want to retain the data, but have found no legitimate alternative to housing the data on the mainframe. IT’s search for mainframe alternatives is also driven by economics and innovation: mainframes are comparatively expensive, don’t take advantage of more recent, less expensive and faster CPUs and layered storage, and don’t readily avail themselves to modern application development tools and frameworks.

Hadoop-based enterprise data hubs, however, offer arguably the best mainframe offloading solution for data-intensive use cases. The enterprise data hubs steps up to the mainframe’s strengths in, for example, availability and security, but gives enterprises a new place to house data with a richer programmatic model, and lower hardware and license costs. While a Hadoop-based enterprise data hub solution is not an alternative for mainframe-based transaction processing applications, conversely mainframes stumble over diverse data sets, complex queries, scale up/scale out data scenarios, and easy to manage storage – all strengths of the enterprise data hub.

Hadoop, armed with high performance NFS capabilities, serves as an excellent general purpose enterprise file system.



• DevelopmentandTestEnvironments: The chief data officer’s notion of being data-driven is to establish an ongoing center of excellence, in fact a culture, of business and IT personnel who understand how to use data to their company’s advantage. The term “continuous improvement” applies in the chief data officer’s strategy, but how does continuous improvement happen when every project, every new idea, involves the task of setting up an appropriate data environment for development and testing? Hadoop-based enterprise data hubs, enhanced with flexible security, multiple job support, multi-tenancy, and the flexibility of dedicated storage for particular groups and projects, are a natural to provide the data environment for continuous improvement – exactly what the chief data officer and IT was looking for.

• EnterpriseFileManagement: Structured data sources, like operational databases, often come with their own embedded data management features, typically applied by DBAs and their IT operations partners. But what about the endless non-database files generated, received, and retained by an enterprise? Solutions like Microsoft’s SharePoint are relatively ancient, lack the complex query strengths of Hadoop, and were never intended to house all files, just Microsoft files. Hadoop, armed with high performance NFS capabilities, serves as an excellent general purpose enterprise file system. It may also coexist with SharePoint, optimizing an organization’s overall file management and query processing.

Enterprise Fundamentals for Data-driven ROAEnterprises thinking strategically about data, such as those that have established and filled the position of chief data officer, often think of data as an asset. That is, they treat “big data” as a business imperative rather than an IT project. For those enterprises, the best metric associated with being “data-driven” is ROA – Return On Asset – rather than ROI, suggesting that the same data is potentially useful across many business endeavors. In that sense, data’s value is renewable, like machinery or recycling materials.

Few organizations, however, leap directly to a solution that helps them consciously pursue the “data as an asset” strategic approach. Rather, they wend their way through a series of proof-of-concept and business production projects that inexorably raise similar data management challenges. Such data-intensive projects invariably give rise to the observation, “Why are we reinventing the same data management wheel every time we work on another data-oriented project?

Eventually enterprises recognize that a single data management repository, albeit a sophisticated repository, could prevent them from having to constantly reinvent the data management wheel. The pursuit of crafting an enterprise data hub, a single, sophisticated data management facility to service the long-term data-driven ROA needs of the business, ensues. What specific characteristics does an enterprise data hub exhibit to support data-driven ROA?

• Real-TimeandMulti-Job: An oft-cited limitation of Hadoop is its native batch orientation, which likely reflects the historical batch loading and report writing approach of data warehouses. Higher cost, near real-time operational data stores typically only support structured data. Ultimately for data-driven enterprises, the business requires real-time and near-real-time data load and access for a wide variety of data, users, and uses.

…the business requires real-time and near-real-time data load and access for a wide

variety of data, users, and uses.

“Why are we reinventing the same data management wheel every time we work on another

data-oriented project?”



Related to real-time needs, and sometimes overlooked, is that an enterprise-grade Hadoop-based data management facility must support a multi-job environment. The days of the dedicated data scientist, running one batch MapReduce job at-a-time, are well on the wane. Enterprise data hubs should be architected and must operate under the assumption that all data speeds must be supported, and that data processing requests will not arrive in a single-threaded fashion.

• Security: The need for security has risen in importance in data-oriented projects that use customer and organizationally sensitive data. In the past, user communities were more well-defined, both internally and externally, the devices of display and network were similarly limited, and the sources and types of data were fewer. At the same time, however, it doesn’t make sense to go overboard in terms of applying security to all data management scenarios, for security comes at a cost. Finding an adaptable approach to security for enterprise data management makes the most sense. Enterprise data hubs need the “right” security to match the variety of organizational risks and related policies.

• AvailabilityandPerformance:In the early days of Hadoop projects, crashing nodes and clusters were just a part of the experiment, and were often shrugged-off. For implementations delivering serious business value, such as an enterprise data hub supporting a range of business and IT solutions, availability is as tantamount as the most important transactional application at your company.

Assuming that clusters or nodes will never fail, however, is naïve. The ability to recover quickly, preferably in seconds versus minutes or hours, becomes the service level requirement for day-to-day operational availability. The availability requirement extends to specific situations where recovery techniques ensure zero data loss in order to mitigate business risk.

Putting performance next to availability has everything to do with customer and user experience. Users don’t want or need to know what is happening behind the scenes; they just want to ensure that their solution is working and doing so at an acceptable rate of speed. Hadoop carries a fair amount of overhead, and often requires significant optimization to give users a low-latency experience. Enterprises must take care to ensure Hadoop-based solutions, and certainly enterprise data hubs, don’t end up with the same reputation as their older data warehouse ancestors – too slow to use.

• DealingwithDisasters: Despite taking measures to ensure that day-to-day operations run smoothly, providing for a predictable and enjoyable user experience, the power still goes out occasionally. Offering disaster recovery for experimental Hadoop projects was often viewed as overkill. When an enterprise data hub, however, has been established to serve a variety of ever-evolving business and IT needs, the hub must operationally match the value of the resulting data solutions, which will inevitably carry DR requirements.

• AdministrativeEase: As Hadoop has grown up and taken on more data management responsibility and use cases, the administration required for deployment and management has become more convoluted. The enterprise data hub, however, endeavors to limit the administrative load, particularly in terms of tuning and optimization. “Automation” has been the buzzword in the systems management world for many years now, and that same expectation applies to Hadoop-based enterprise data hubs.



MapR: Exemplifying the Hadoop-Based Enterprise Data HubGiven the requirements and use cases an enterprise would typically demand from a Hadoop-based solution for an enterprise data hub, what specific technology augmentations would be required? What vendor has concentrated on supplying those augmentations? To date, no Hadoop distributor has focused on rounding out Hadoop for enterprise data hub purposes more so than MapR. In particular, the MapR M7 Enterprise Database Edition, which includes an in-Hadoop NoSQL database for Apache HBase applications, stands out as an excellent choice for enterprise data hub implementations. Below identifies some critical enhancements MapR has made to their distribution for Hadoop.

• TrueHighAvailabilityandDisasterRecovery:The inherent Hadoop NameNode architecture carries too much risk for “production” data management, and Hadoop’s native lack of “HA” (High Availability) permeates other areas of its architecture. Hadoop-based data management must offer not just a work-around, but a fully isolated yet integrated approach to ensure that metadata high availability is achieved without NameNode bottlenecks and single points of failure, for node and job recovery, as well as for data protection, and for DR scenarios. A quick comparison between a centralized native Hadoop metadata architecture, versus the distributed metadata architecture of the MapR Distribution for Hadoop, is depicted in Figures 2 and 3. Figure 2 illustrates the native Hadoop issues with high availability. Figure 3 shows the updated architecture from MapR, and identifies some of the benefits of the MapR approach.

Figure 2: NameNode High Availability Architecture in Traditional Hadoop Distributions Source: MapR, 2014



Figure 3: No-NameNode High Availability Architecture in MapR Distribution for Hadoop Source: MapR, 2014

In addition to support for rolling upgrades, MapR includes optimized server redundancy to minimize user visible impact when a node fails. MapR also includes consistent point-in-time snapshots and recovery of open files, which is not inherent in other Hadoop distributions.

• PerformanceandLowLatency: MapR holds world records for MapReduce operations on Hadoop, using the TeraSort and MinuteSort benchmarks. With respect to NoSQL operations, MapR has eliminated data compactions, and added in-process auto tuning features that ensure it consistently delivers low latency, for both read and the harder to achieve mixed read-write scenarios. The fact that MapR M7 eliminates two layers of the stack compared to a native HBase on Apache HDFS means fewer Java Virtual Machines (JVMs), improving both reliability and performance. MapR M7, for certain workloads, has tested at a 10x throughput improvement over native HBase.

• Scalability: For database applications, MapR M7 scales to millions of columns across billions of rows over one trillion tables. Automated compression optimization helps not only storage scalability, but performance as well. Similarly, for flat files MapR scales to a trillion files on Hadoop. This means you can scale to support more data and applications with much less hardware and administrative costs. MapR eliminates the need to deploy multiple clusters to work around the file limits of HDFS.

• Security: MapR M7 offers a variety of security tools to be applied depending on each customer’s needs, policies, use cases, workloads, and data sources. It begins with standards-based native authentication and NSA Suite B encryption. MapR M7 offers authentication via Kerberos, and supports a wide variety of access controls from both a data perspective (table and/or column), and entity viewpoint (role, group, individual user). Finally, it supports Linux Pluggable Authentication Modules (PAMs) for enterprises that have more advanced and federated approaches to authentication.

• AdministrationandManagement: Appreciating the complexities of enterprise-grade operations, MapR has made every effort to make its graphically-based provisioning tools easy to use, and to streamline node installation and scheduling. In addition, MapR includes a heat map for cluster management, and event-based alarms and alerts to take the guesswork out of on-going Hadoop management. It also has added group and user quota management, and automated ingest compression to better manage and optimize the data management facility.

• DataAccessandAPIs: What good is a data management facility if IT is limited in the data access to interoperate with the platform? MapR M7 comes with built-in API and data access support such as ODBC, REST, and LDAP, along with all the native Hadoop interfaces.



The Big Data Management Difference for CustomersWhile the many security and management features that help set MapR M7 apart are important, several other vendors offer similar capabilities or are in close pursuit of doing so. The three feature groups that make implementations like the aforementioned bank fraud/offer and biometrics applications even possible, and that clearly set MapR apart from other Hadoop providers, include:

1. DataModel: The MapR M7 data model, based on Google BigTable, broadens the types of solu-tions enterprises may pursue with a Hadoop foundation. MapR M7 also supports native HBase, and both databases may run co-resident in the same cluster.

2. NFS: MapR includes the “Direct Access NFS [Network File System]” which further expands the potential application mix, augments the real-time nature of MapR M7, and spreads the application client base across the spectrum of users and machines. In essence, Hadoop data is treated as if it were stored on a regular disk drive, meaning any file system-based application can run on MapR without modification.

3. Reliability: MapR adds a long list of reliability related features, such as snapshots and mirroring, all adding up to best-in-class high availability.

By adding NFS and NoSQL into its offering, and designing them as integrated features of the data management platform rather than bolting them on, MapR has made it possible for enterprises to truly consider Hadoop as a foundation for the many possible uses of enterprise data hubs.

EMA PerspectiveOver the past three decades, relational databases and document management systems have acted as the primary repositories of enterprise data. Those repositories served their purposes well, and they will continue to deliver value for IT and the business, albeit narrowly. Unfortunately, they do not, generally speaking, do a good job serving the more advanced data management needs of data-driven organizations – the old-style data warehouse does not adopt well to the big data era that incorporates social, mobile, open, machine, and web data, plus new sources that will undoubtedly pop up in the future.

Furthermore, big data, as it turns out, isn’t just running complex BI-oriented queries. Rather, it is about achieving a compelling ROA on your data, with advanced analytics being one compelling possibility. The foundational technology to enable data-driven applications is an enterprise data hub that supports:

•Aggregating and ongoing management of a wide variety of data, supporting different velocities including batch, interactive, and streaming/real-time, for both ingest and user access.

• Full enterprise-grade features in terms of reliability, recoverability, manageability, performance, security, performance, and scalability.

• Supporting a variety of data and file models to optimize storage and access, but most importantly to support creative application development over the course of time.

MapR has it made possible for enterprises to truly consider Hadoop as a foundation for the many possible uses of

enterprise data hubs.



MapR took it upon itself to pursue this type of sophisticated data platform from its early days; it has long endeavored to turn Hadoop into an enterprise-class, general purpose data hub, extending Hadoop’s usefulness from BI to an infinitely-wider possibility of operational applications for business and IT. When the computer industry looks back at the early days of the “big data” movement, it will see that MapR was one of the first with a wider vision of Hadoop, and one of the first to execute and deliver on that vision. Leading enterprises interested in the ROA of data need only look to several existing MapR customers to find the state-of-the-art of being data-driven through the use of enterprise data hubs.

About EMA Founded in 1996, Enterprise Management Associates (EMA) is a leading industry analyst firm that provides deep insight across the full spectrum of IT and data management technologies. EMA analysts leverage a unique combination of practical experience, insight into industry best practices, and in-depth knowledge of current and planned vendor solutions to help its clients achieve their goals. Learn more about EMA research, analysis, and consulting services for enterprise line of business users, IT professionals and IT vendors at www.enterprisemanagement.com or blogs.enterprisemanagement.com. You can also follow EMA on Twitter, Facebook or LinkedIn. 2913.053014

http://www.enterprisemanagement.com

http://blogs.enterprisemanagement.com/

http://twitter.com/ema_research

https://www.facebook.com/enterprisemanagementassociates

http://www.linkedin.com/company/25620

Foundations for Data-Driven Enterprises solutions have helped solve many of the historical challenges of data warehousing and advanced analytical processing. When most enterprises

Documents