Top Banner
Big Data Implementation: Hadoop and Beyond CONDUCTED BY: Tabor Communications Custom Publishing Group IN CONJUNCTION WITH:
17
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Big dataimplementation hadoop_and_beyond

Big Data Implementation:

Hadoop and Beyond

CONDUCTED BY:Tabor Communications Custom Publishing Group

IN CONJUNCTION WITH:

Page 2: Big dataimplementation hadoop_and_beyond

Dear Colleague,

Welcome to the first in a series of professionally written reports and live online events around topics related to big data, high-performance computing, cloud computing, green computing, and digital manu-facturing that comprise the media portfolio produced by Tabor Communications, Inc.

Most upcoming Tabor Communications reports and events will be offered at no charge to viewers through underwriting by a limited group of sponsors. Our first report titled Big Data Implementation: Hadoop and Be-yond, is being graciously underwritten by IBM, Omnibond and Xyratex. More about how these organizations tackle challenges associated with big data can be found in the sponsor section.

Tabor Communications has been writing about the most progressive technol-ogies driving science, manufacturing, finance, healthcare, energy, government research, and more for nearly 25 years. The extended depth of our new report series extends beyond the limited space we can devote for coverage on specific topics through our regular media resources. The writers of these reports are typically contributing editors for Tabor Communications, and/or content part-ners like Intersect360 Research from whom research cited within this report was developed.

The best topics to cover typically come from the user community. We en-courage you to offer any suggestions for upcoming reports that you feel will be helpful to you in better reaching your business goals. Please send your ideas to our SVP of Digital Demand Services, Alan El Faye ([email protected]). Sponsorship inquiries for future reports would also go to Alan.

Thank you for reading Big Data Implementation: Hadoop and Beyond.

Please visit our corporate website to learn more about the suite of Tabor pub-lications: http://www.taborcommunications.com/publications/index.htm

Cordially,

Nicole HemsothEditorial Director-Tabor Communications, Inc.

Big Data Implementation: Hadoop and BeyondConducted by: Tabor Communications Custom Publishing Group

A Tabor Communications, Inc. (TCI) publication © 2013. This report cannot be duplicated without prior permission from TCI. While every effort is made to assure accuracy of the contents we do not assume liability for that accuracy, its presentation, or for any opinions presented.

Study Sponsors

OOMMUN NI ICC AT S

Media and Research Partners

Page 3: Big dataimplementation hadoop_and_beyond

By Mike May

The big data phenomenon is far more than a set of buzzwords and new technologies that fit into

a growing niche of application areas. Further, all the hype around big data goes far beyond what many think is the key issue, which is data size. While data volume is a critical issue, the related matters of fast movement and opera-tions on that data, not to mention the incredible diversity of that data demand a new movement—and a new set of complementary technologies.

But we’ve heard all of this before, of course. Without context, it’s hard to

communicate just how important this movement is. Consider, for example, President Obama’s recent campaign, which many touted as by far the most comprehensive Web and data-driven campaign in history.

The team backing the big data efforts for Obama utilized vast, diverse wells of information ranging from GIS, social me-dia, demographic and other open and proprietary data to seek out the most likely voters to lure from challenger Mitt Romney. This approach to wrangling big data for a big win worked, although it’s still a matter of speculation just how much these efforts tipped the scale.

So let’s take a more concrete set of

Big Data Implementation: Hadoop and BeyondIn this new era of data-intensive computing, the next generation of tools to power everything from research to business operations is emerging. Keeping up with the influx of emerging big data technologies presents its own challenges—but the rewards for adoption can be of incredible value.

High-performance computing and machine learning help users track their family tree at Ancestry.com. This screen shot features the Shaky Leaf, which employs predictive analytics to suggest possible new information in a user’s family tree. (Image courtesy of Ancestry.com.)

Page 4: Big dataimplementation hadoop_and_beyond

measurable examples. Healthcare data are being processed at incredible speed, pulling in data from a number of sources to turn over rapid insights. Insurance, retail and Web companies are also tapping into big, fast data based on transactions. And life sci-ences companies and research enti-ties alike are turning to new big data technologies to enhance the results from their high performance systems by allowing more complex data inges-tion and processing.

There is little doubt that the next wave of big data technologies will rev-olutionize computing in business and research environments, but we should back up for a moment and take in the 50,000-foot view to put this all in clearer context.

The roots of big data are not in the traditional places one used to look for the cutting edge of computational tech-nologies—they are entrenched in the daily needs of the world’s largest, most complex and earliest web companies.

As Jack Norris, VP of Marketing at Hadoop distribution startup, MapR, reminds, “Google is the pioneer of big data and showed the dramatic power and ability to change industries through better analysis.” He adds, “Not only did Google do better analysis, but they did it on a larger dataset and much cheap-er than traditional alternatives.” Mak-ing use of big data requires powerful new computing tools, and perhaps the best-known one is called Hadoop.

As the case of Google calls to mind, even though big data are often big, size alone does not adequately define the space. Likewise, the field of big data extends beyond any particular applica-tion, as indicated by the examples giv-en above. As Norris explains, “Big data is not just about the kind or volume of data. It’s a paradigm shift.” He adds, “It’s a new way to process and analyze existing data and new data sources.”

The point at which a computing chal-lenge turns into big data depends on the specific situation. For some research and enterprise end users, existing ap-plications are being strained due to the data complexity and volume overload, for others performance is lagging due

to these factors. Another group of users simply needs to abandon old applica-tions, systems or even methodologies completely and build again to create a big data-ready environment.

As Yonggang Hu, chief architect for Platform Computing, IBM Systems and Technology Group says, “One of the simplest ways to define big data is when an organization outgrows their current data-management systems, and they are experiencing problems with the scalability, speed and reliabil-ity of current systems in the data they are seeing or expect to get.” When faced with such a situation, says Hu,

“They need to find a cost-effective way to move forward.” (See ‘A to Z for Big Data Challenges.’)

When viewed in total, the scalability, performance, storage, network and other concerns continue to mount—which has opened the door for the new breed of big data technologies. This includes everything from fresh approaches to systems (new appli-ances purpose-built for high-volume and -velocity data ingestion and pro-cessing) to new applications, data-bases and even frameworks (Hadoop being a prime example).

The Deluge of Data to Decode

There is no one definition of a big data application, but it’s safe to say that many existing applications need to be retooled, reworked or even thrown overboard in the pursuit to keep pace with data volume, variety and velocity needs.

Throw in the tricky element of real-time applications—which require high performance on top of complex in-gestion, storage and processing—and

Benefits of BigQuery To explore big data, people want to ask questions about the information. “Google’s

BigQuery is a tool that allows you to run ad hoc queries,” says Michael Manoochehri, one of Mountain View, California-based Google’s developer programs engineers. “It lets you ask questions on large data sets, like terabytes.”

To make this tool accessible to more users, Google implemented a SQL-style syntax. Moreover, the company has used this tool—in a version called Dremel—internally for years before releasing a form of it in their generally available service, BigQuery, in 2012. “It’s the only tool that lets you get results in seconds from tera-bytes,” Manoochehri says.

This tool brings big data to many users because it is hosted on Google’s infrastruc-ture. “You take your data, put it in our cloud and use this application,” Manoochehri ex-plains. Because of that simplicity, a diverse group of users already employs BigQuery.

As examples, Manoochehri mentions Claritics—another Mountain View–based company, which describes its services as providing “social gaming and social com-merce analytics”—and Blue Box—a company in Brazil that provides mobile adver-tisements. Regarding Blue Box, Manoochehri says, “They started with MySQL, and when the data got too big they moved to Hadoop.” He adds, “Hadoop is cool, but it takes set up and administration.” Then, Blue Box tried BigQuery. “It fit,” Manoocheh-ri says, “because Blue Box is a small shop with lots of data to process. They wanted to run queries on data and integrate it into their application.”

Various other industries also use BigQuery. As an example, Manoochehri points out the hospitality industry, which includes hotel and travel data. Overall, he says, “Almost all of the case studies for BigQuery are small shops, like two or three engi-neers, that have no time to set up hardware.”

In general, BigQuery’s biggest benefits arise from speed. “It’s optimized for speed,” Manoochehri explains. It’s fast to set up and it brings back results fast, even from very big datasets.

They are often looking at complexity,

trying to manage complexity with

limited resources.

Page 5: Big dataimplementation hadoop_and_beyond

users are left with a lot of questions about how to meet their goals. While there is no shortage of new technolo-gies that are aimed at meeting those targets, the challenges can be over-whelming to many users.

When asked about the hurdles that users face today in working with big data, Michael Manoochehri, developer programs engineer at Google head-quarters says, “They are often looking at complexity, trying to manage com-plexity with limited resources.” He adds, “That’s a big problem for lots of people.” In the past, the lack of resources kept many potential users out of big data.

As Manoochehri says, “Google and Amazon and Microsoft could handle big data,” but not everyone. Today, big data tools exist for many more users. For example, Manoochehri says, “Our products make these seemingly impos-sible big data challenges more acces-sible.” (See ‘Benefits of BigQuery.’)

A prime example of this can be found in large-scale data mining, which can quickly turn into a big data problem, even if it’s a process that used to fit neatly in a relational data-base and commodity system. For life sciences companies, data mining has always been a key area of computa-

tional R&D, but the addition of new, diverse, large data sets continues to push them toward new solutions.

The data mining, analytics and pro-cessing challenges for life sciences are great. Genomics explores the chromo-somes of a wide range of organisms—everything from seeds to trees and hu-mans. Chromosomes consist of genes, which code for proteins, and nucleotides make up the genes. The nucleotides are arranged, more or less, like beads on a necklace, and the order of the nucleotides and the types—they come in four forms—are like letters spelling out words. In the case of genomics, the words translate into proteins, which control all of the biochemical processes that fuel life. Hidden in those words are complex programming and process-ing challenges as life sciences users attempt to piece these bits together to find “needle in the hackstack” discov-eries amidst millions (and more often, billions) of pieces at a time.

Sequencing determines the order and identity of the nucleotides, and that information can be used in basic research, biotechnology, medicine and more. Today’s sequencing machines produce data so fast that this quickly generates big data. In 2010, for ex-ample, BGI—a sequencing center in Beijing, China—bought 128 Illumina HiSeq 2000 sequencing platforms. Ac-cording to specifications from Illumina, one of these machines can produce up to 600 gigabytes of data in 8.5 days. So 128 of these sequencing machines could produce about one-quarter mil-lion gigabytes of data a month, and that’s big data. Today’s sophisticated life science tools and big data applica-tions also impact the healthcare indus-try. (See ‘Boosting Big Pharma.’)

The challenges across the com-putational spectrum are not hard to point to in such a case, especially when it comes to large scale genomic sequence analysis. With the terabytes of data rolling off sequencers, the vol-ume side of the big data equation is incredible, necessitating innovative approaches to storage, ingestion and data movement. In terms of the nee-dle in a haystack side of the analytics

Boosting Big Pharma“The big data question hasn’t really been fully exploited in big pharma at the mo-

ment,” says Gavin Nichols, vice president of strategy and R&D IT at Quintiles in Research Triangle Park, North Carolina. Nonetheless, many big data opportunities exist in healthcare. “If you took the big data around genomics, the ability to master investigator and doctor information, the information from clinical trials and what goes on drug labels and much more, you get a picture of where you could influence the field,” says Nichols. In short, big data could impact the pharmaceutical industry from discovering new drugs to commercializing them. “There is the ability to influence the whole lifecycle,” Nichols says, “but that has not been fully realized.”

Quintiles, on the other hand, started working with big data some time age. “For the past seven or eight years,” Nichols says, “we’ve been very strategic about the data that we consider.” This includes both enterprise and technical computing. For ex-ample, Quintiles analyzes large datasets in an effort to improve efficiencies across its various organizations. Likewise, the company takes advantage of electronic health records (EHRs) and registries in an effort to better understand the populations that need new pharmaceuticals.

As one crucial example, Quintiles uses big data in an effort to make clinical trials more efficient. In a late-phase trial, for example, better use of EHRs could reduce time and cost. “Instead of going to a trial site and asking 100s of questions, like what drugs the people take,” Nichols explains, “we could ask the EHRs those questions, and maybe we can answer 40 percent of the questions.” Beyond increasing efficien-cy, a similar approach could also enhance the information gained from a clinical trial. “Information from EHRs could help us design more-targeted studies on a smaller population,” Nichols says, “and that’s a win-win for everyone.”

Although big data offers a gold mine to the pharmaceutical industry, companies must use this resource wisely. “Big data in the life sciences without good clinical interpretations could be a hazard,” Nichols says. “So we bring data together and enrich it.” Quintiles combines data from, say, clinical trials with the experience of clinical experts.

To make the most of big data opportunities, the experts at Quintiles consider vari-ous options. “We look long and hard at things,” Nichols says. “We’re looking at Ha-doop, and we’ve also used Solar on a SQL frontend.” He adds, “We’re also looking at techniques developed by companies that are building on top of Hadoop.” For patient privacy, however, Quintiles segregates analytics from the raw data.

In many cases, though, each pharmaceutical question requires its own big data implementation. For efficiency, Quintiles relies heavily on data scientists. “By using research analysts—individuals who sit between the knowledge of what the data are and the analytical approaches—we can see if we’ve examined similar questions in the past,” Nichols says. “Then, we might be able to go back to some application that you’ve used before.” Nonetheless, Nichols adds, “Clinical research has more and more one-off questions.”

Page 6: Big dataimplementation hadoop_and_beyond

problem, standard software that fits into neat boxes is no longer able to keep pace with the incredible speed and complexity of the puzzle-piecing. Hence life sciences is one of the key use cases representing the larger challenges of big data operations.

Big data goes beyond our bod-ies, another example of data-inten-sive computing in action could soon change our homes.

The “smart grid” is another hot ap-plication area for a wide range of big data technologies across the entire stack. The US Department of En-ergy describes the smart grid as “a class of technology people are us-ing to bring utility electricity delivery systems into the 21st century, using computer-based remote control and automation.” Of course, without am-ple processing, frameworks and ap-plications designed from the ground up to handle so many diverse data feeds that power the smart grid, the project couldn’t be a success.

Making the smart grid work requires real innovation on the data front, according to Ogi Kavozovic, vice president of strategy and product at Opower in Arlington, Virginia. “Smart grid deployment will create a three-orders-of-magnitude jump in the amount of energy-usage data that the utility industry will have,” he says.

The energy exec went on to note that when this data spike happens, “Then there will be a several-orders-of-magnitude jump above that from energy-management devices that go

inside people’s homes, like Wi-Fi ther-mostats that read data by the sec-ond.” Opower is already working with Honeywell, the global leader in ther-mostats, to make such devices. “This is all just emerging now, and the utility industry is preparing for it,” Kavozovic says. “The data presents an unprec-edented opportunity to bring energy management and efficiency into the mainstream.” (See ‘Evolution in En-ergy Efficiency.’)

For the smart grid and energy sector, moving to big data technologies is a priority because it will enable new, criti-cal efficiencies that can create a new foundation for energy consumption and distribution. But of course, the frame-works, both on the hardware and pro-gramming side, need to be in place.

Data-Based DecisionsIn the field of big data, the wide

range of applications mirrors the sorts of computing situations that gener-ate big data challenges. Some users run into big data because of han-dling many files. In other cases, the challenge comes from working with very large files. Big data issues also emerge from extensive data sharing, allowing multiple users to explore or analyze the same data set.

“Time plays a role in big data,” says Steve Paulhus, senior director, strate-gic business development at Xyratex. “Depending on how the data is used, it may be valuable today or it may become valuable tomorrow. For instance, with financial information, the data can lose

Evolution in Energy EfficiencyUtility companies that want to use big energy data to enhance efficiency can turn

to Opower in Arlington, Virginia. “We partner with about 80 utilities around the world to help their customers understand and control their energy usage through our tools,” says Anand Babu, lead on all data and platform marketing efforts at Opower. With these tools, customers can compare their energy usage to that of their neighbors and learn to improve their energy efficiency. Likewise, this use of big energy data can help utilities reduce peak energy demands, which can reduce the need to develop or purchase more energy.

To suggest some of the volume of big energy data, Babu says, “In the United States alone, consider what happens once every one of our 120 million households has a smart meter and WiFi thermostat. That’s 120 million times 100,000 data points per month.” That’s 12 trillion data points acquired every month. Keep in mind too that most households today only get one monthly meter read. So big energy data includes high volume and fast growth.

As utilities start to roll out smart meters, most companies lack the infrastructure to store the data. That’s one place where Opower comes in on big energy data. This software-as-a-service company provides the infrastructure and the analytical exper-tise. “We operate a managed service where utilities securely transfer data to us, and we do the analytics,” Babu explains.

To provide this service, Opower uses several tools. Babu says, “We are the biggest user of Hadoop in our industry, operating at the scale of nearly 50 million homes.” In other cases, different tools work better. “For some things, we still use MySQL,” Babu explains. “For example, to manage how we communicate with certain segments of the customer base, that can be done more efficiently using a traditional database.”

“The exciting thing here is that customers see value on day one,” Babu says, “and they can start to take action right away.”

In “Different Perspectives on Big Data: A Broad-Based, End-User View” from Intersect360 Research, people polled rated a series of big data challenges on a scale of 1–5, ranging from ‘not a challenge’ to an ‘insurmountable challenge,’ respectively. As the results show, users run into the most trouble managing many files and performing computations and analyses fast. (Image courtesy of Intersect360 Research.)

Business Technical All Data

Challenging dimension of Big Data Avg. %4/5 Avg. %4/5 Avg. %4/5

Mangement of very large files 2.7 25% 3.5 56% 3.3 46%

Mangement of very large number of files(many files) 3.5 58% 3.9 70% 3.7 66%

Providing concurrent access to data formultiple users 3.1 40% 3.5 52% 3.4 48%

Completing computation or analysis withina shirt timeframe (short lifespan of data) 3.8 65% 3.7 61% 3.7 62%

Long-term access to data (long lifespan of data) 3.6 61% 3.5 56% 3.5 57%

Number of respondents 102 203 305

Intersect360 Research, 2012

Page 7: Big dataimplementation hadoop_and_beyond

value if you don’t get it right now. In other cases, like weather data, you might want to look at it over a long period of time.” (See ‘Crafting Storage Solutions.’)

Addison Snell, CEO of Intersect360 Research in Sunnyvale, California, de-

scribes another example where data must be stored for a long time: “A pe-diatric hospital, for example, keeps in-formation for the life of a patient plus some additional number of years.”

Depending on the particular ap-

plication and the kind of data that it includes, end-users must select a system that works for them. No sin-gle solution handles every big data problem. That said, most users of big data face one consistent problem: in-

put/output scalability. “From a technical perspective,” says Snell, “this is the number one big data problem.”

Moreover, I/O scalability im-pacts both technical and en-terprise computing. Technical computing often involves a user’s top-line mission, such as car design for an automotive manufacturer or drug discovery for a pharmaceutical company. Enterprise computing, on the other hand, consists of the steps that keep a company running, such as handling the mechanics of an e-commerce website.

In the past, technical comput-ing focused almost entirely on unsolved problems. As Snell ex-plains, “You don’t need to design the same bridge twice.” Since this area of computing moves to harder and harder problems, the users quickly adopt new technologies and algorithms. In enterprise computing, by com-parison, once users find solu-tions, they stick with them. “In enterprise computing, there are people using approaches that are 30 years old because they still work,” says Snell.

In many ways, the big data era changes this distinction. By definition, big data scenarios push an end-user to a point where previous methods won’t work. This applies to technical and enterprise computing. “So, for the first time, you have enter-prise-class customers viewing performance specifications of infrastructure as the dominant purchase criteria,” Snell says. The purchase criterion used the most is I/O performance.

In addition to I/O performance scalability, users must ask other questions about their big-data

Computing Family ConnectionsAt Ancestry.com, data come in many forms. This family-history company stores data from

historical records, such as census data and military and immigration records. This historical information also includes birth and death certificates. Overall, the historical data at Ancestry.com consists of around 11 billion records. In addition, this web-based company stores user-generated data.

“When users come to this site, their goal is to build a family tree,” explains Leonid Zhukov, the company’s director of machine learning and development. In this process, a user creates a profile of themselves, plus their parents, grandparents and so on. “We have more than 40 million family trees and 4 billion people represented here,” Zhukov says. Users can also upload images, such as photographs or even handwritten stories, and Ancestry.com stores about 185 million of these records. On top of all of that, this site keeps track of user activity. “On aver-age, Ancestry.com servers handle 40 million searches a day,” says Zhukov, “and we log user searches and use it in machine learning.”

The kinds of data being stored at Ancestry.com also evolve over time. As an example of that, the company can now collect a user’s DNA data. The company provides the DNA test and pro-cesses the information. As the company explains: “AncestryDNA uses recent advances in DNA technology to look into your past to the people and places that matter to you most. Your test results will reach back hundreds—maybe even a thousand years—to tell you things that aren’t always in historical records, but are recent enough to be important parts of your family story.”

This range of data—a variety of types from various sources—can only be stored and analyzed with powerful tools. “There’s no black-box solution for our data challenges,” Zhukov explains. “We use different pipelines for processing different types of data. Plus, we deal with big and versatile data, so we need to do lots of hand work on it.” He also points out that moving so much data creates challenges. “We must customize servers and how they are connected—where to collect data and where to process it,” Zhukov says. “The logistics of the data are quite important.”

Some of the most interesting data analysis at Ancestry.com involves machine learning. This works on two types of objects: a person or a record, which is any document with information about that person. Then, users can find information about ancestors by searching for names, places of birth and so on. These focused searches provide precise results. On the other hand, users can also turn to exploratory analysis. “In this case, a user is looking for suggestions,” says Zhukov. “It’s similar to when you log on Amazon, and they show you the books that you might be interested in.” So instead of a search, the exploratory analysis provides suggestions of records or family-tree information based on what the user has done in the past, and that depends on machine learning.

The system at Ancestry.com creates the suggestions for exploratory analysis offline. “We train the system to make suggestions based on the feedback of previous suggestions that were accepted or rejected,” Zhukov explains. “The system looks for similarity. It asks: How similar is a person to a particular record?” The software uses a range of features to make these compari-sons. It can start with a first and last name, but maybe a document includes a misspelling, and then the system relies on a phonetic comparison. Also, a record could have been handwritten, scanned and turned into data with optical character recognition. “You can imagine how many mistakes there are in hand-written historical records,” Zhukov says, “so we do fuzzy matching instead of exact matches.” The features of a record serve as the input and the training data for machine learning come from the users rejecting or accepting the suggestions.

For many parts of this processing, Ancestry.com uses Hadoop. As Zhukov says, “Hadoop is really great technology, but it’s not at the point where you can open the box and things miraculously work.” Consequently, this company relies on Hadoop experts and also experts who understand how to search huge databases for similarities between disparate sources of information. As Zhukov adds, “Hadoop is not only for data storage, but it also provides a data-processing pipeline. So you need to feed it with the right things.”

Page 8: Big dataimplementation hadoop_and_beyond

infrastructure needs. For example, Ken Claffey, senior vice president and gen-eral manager, ClusterStor at Xyratex, asks: “Once you have all of your data in a common environment you need to think about the reliability, availability and serviceability of that infrastructure as you want that data to be consistently avail-able.” He adds, “You need to think about the constant with big data—multi facet-ed growth in capacity, data sources, data types, number of clients, applications, workloads and the new ways to use the data. It is key to ask yourself how are you going to efficiently manage such multi faceted exponential growth over the next number of years against the constraints of available data scientists and data cen-ter space, power, cooling and budget. Your big data infrastructure needs to be scalable to meet performance and ca-pacity demands, flexible to respond to change and extremely efficient.”Delving even deeper, Norris noted that when looking at new architecture and what’s possible, the real focus is on what types of applications will be used. What data sources are we going to exploit? What is the desired outcome?” In short, a user must answer questions like these to know where to start. After that, a user decides how to integrate new infrastruc-ture into an existing computing environ-

ment. “How does the new computing fit my current security?” Norris asks. “How do I make it enterprise grade?” As Nor-ris points out, a user might not start ask-ing these questions until working on the second or third big data application and sharing it across a group of users. “You can generalize about the best platform,

but you have to have it working well with the existing environment and ap-plications,” Norris says. “So you’ll end up with a little bit of a hybrid solution. You’ll find yourself asking: How do I take advantage of this new technology, but on a platform that delivers what I need in terms of reliability, ease of integration and so forth?”

Among some of the frameworks that are emerging in some of the key sec-tors we’ve discussed is Hadoop and

MapReduce, which were again, born out of the largest-scale data opera-tions the world has seen to date—the search engines and web companies.

Mapping the MapReduce Paradigm

For a wide range of industries, MapRe-duce makes up one of the most com-mon big data application frameworks. This often gets implemented as Hadoop MapReduce, which makes it easier for a user to develop applications that handle large data sets—even in multiple tera-bytes—on computer clusters.

The complete range of big data ap-plications could always remain in flux. As Boyd Wilson, executive at Omnibo-nd in Pendleton, South Carolina, says, “As you iterate on various methods of teasing out the information from the data, that learning can lead to new ar-eas of big data.” As an example, Wilson mentions cross-disciplinary analytics.

Overall, he says, “The big data sweet spot is lots of—potentially di-verse—data that you want to com-bine to learn new correlations, and you need high-performance medium, something like OrangeFS, to aggre-gate and analyze in large volume and with high throughput.” (See “A Big data File System for the Masses.’)

As this schematic indicates, a wide range of information plays a part in big energy data. Providing benefits to both utilities and customers requires ad-vanced storage and analytics. (Image courtesy of Opower.)

With Hadoop, you can recognize

patterns much more easily

and drive new applications.

Page 9: Big dataimplementation hadoop_and_beyond

That “sweet spot” covers a lot of ground: consumer-behavior analysis, dynamic pricing, inventory and logis-tics, scientific analysis, searching and pattern matching, and more. More-over, these applications of big data arise in many vertical markets, includ-ing financial services, retail, transpor-tation, research and development, in-telligence and the web.

In some cases, the use of big data will come from some unexpected plac-es. For instance, Leonid Zhukov, direc-tor of machine learning and develop-ment at Ancestry.com, says, “We’re a technology company, and we do fam-ily history, but we are a big data com-pany.” He adds, “We use data science to drive family-history advances. So we’re hiring data people.” (See ‘Com-puting Family Connections.’)

As MapReduce and Hadoop con-tinue to mature, these use cases will continue to extend further into main-stream enterprise contexts, but for now, there are some bigger problems to address.

Hadoop’s High PointsAlthough Hadoop only made up

slightly more than 10 percent of the applications mentioned in Inter-sect360 Research’s poll (see below for details), many experts endorse this approach. When asked about the benefits of applying a Hadoop solu-tion, Norris of MapR says, “The way to summarize the benefits of Hadoop is: It’s about flexibility; it’s about pow-er; it’s about new insights.” (See ‘M Series from MapR.’)

In terms of flexibility, according to Norris, Hadoop does not require that you understand ahead of time what you’re going to ask. “That seems sim-ple,” he says, “but it’s a really dramatic difference in terms of how you set up and establish an application.” He adds, “Setting up a typical data warehouse is initially a large process, because you are asking how you will use the data, and that requires a whole set of pro-cesses.” With Hadoop, on the other hand, a user can just start putting in data, according to Norris. “It’s really flexible in the analysis you can perform

and the data you can perform it on, and that really surprises people.”

Norris points about several fea-tures of the power behind Hadoop. “You can use different data sets, and the volume that you can use is quite broad.” He adds, “This eliminates the sampling. For example, you can store a whole year of transactions instead of just the last 10 days. Instead of sam-pling, you just analyze everything.”

By using such large volumes of data in analytical procedures, users can apply new kinds of analysis and also more easily identify outliers. Because of that ability to work with all of the data, says Norris, “Relatively simple

approaches can produce some dra-matic results.” For instance, just the added ability to explore outliers could reveal some pleasant surprises, may-be some tactic that generated an un-expectedly high amount of revenue. On the other hand, analyzing outli-ers can reveal failures. As an exam-ple, Norris says, “You might find an anomaly like an expensive part of the supply chain that you want to elimi-nate. Overall, analyzing these outliers makes dramatic returns possible.”

In terms of new insights, says Norris, “With Hadoop, you can recognize pat-terns much more easily and drive new applications.” He also adds that Hadoop

M’s from MapRMapR in San Jose, California, developed a series of Apache Hadoop solutions,

including M3, M5 and the latest M7. The M7 version adds HBase capabilities, which is the Hadoop database. “This provides real-time database operations side by side with analytics,” says Jack Norris, the company’s vice president of marketing. “This provides lots of flexibility for the developers, because you can run applications from deep analytics to real-time database operations.”

All of the MapR tools make it possible to treat a Hadoop cluster like standard stor-age. “Consequently, you can use industry-standard protocols and existing tools to access and manipulate data in the Hadoop clusters,” Norris explains. “This makes it dramatically simpler and easier to support and use.” This tool can also be used as an enterprise-grade solution.

For all of the MapR editions, the developers aimed at making Hadoop easy, fast and reliable to use. For example, these tools provide full data protection. Norris adds, “In terms of hardware systems, the beauty of MapR is that it supports standard, off-the-shelf hardware. So you don’t need a specialized server.” He adds, “You get better hardware utilization with MapR, because, for example, you can use a greater density of disk drives and faster drives.”

To get an idea of how MapR can be applied, consider these case studies with M5. “One of the largest credit-card issuers is using M5 to create new products and services that better detect fraud,” says Norris. “This includes some broad uses of analytics that benefit the cardholders.” Some large online retailers also use M5 to better analyze shopping behavior and then use that to recommend items or simplify the user’s experience. Other websites also use M5 in digital advertising and digital-media platforms. For example, says Norris, “Ancestry.com is using it as part of their service.” Norris also points out: “Manufacturing customers use M5 for basically ana-lyzing failures. They collect sensor data on a global basis and detect precursors to problems and schedule maintenance to avoid or prevent downtime.” In summary, Norris calls the MapR users “really, really broad.”

A user just starting with big data might start with Apache Hadoop. “When you really put this into production use,” Norris explains, “having enterprise grade with advanced features is a real benefit.” Some of the advanced features of MapR tools include a reduced need for service. For instance, the MapR platforms are self-heal-ing.

To ensure that MapR platforms include the features that users really want and need, the company works with an extensive network of partners. “In some case, these partners provide vertical expertise within a market or even do consulting work to identify likely use cases to help us select the best implementations of MapR,” Norris says. In addition, this partner network also helps develop some of the training available from the MapR Academy, which provides free training. “We want to ensure that users can easily share information and applications on a cluster,” Norris says. “That is basically a unique value proposition of MapR.”

Page 10: Big dataimplementation hadoop_and_beyond

can be an economical solution. “A high-end data warehouse can cost $1 million per terabyte, whereas Hadoop is only hundreds of dollars per terabyte.”

At Ancestry.com, Zhukov and his colleagues use Hadoop to solve many problems. “For different data types and different business units, we use Hadoop in different ways,” Zhu-kov explains. For example, the team that provides optical character rec-ognition for documents scanned into Ancestry.com uses Hadoop for that. Also, when this company acquires newspaper data that are structured and in a text format, facts—such as dates, names and places—must be extracted, and Hadoop provides the technology for the parallel processing that achieves this. Zhukov adds, “In the DNA pipeline, where we compare customer DNAs and do matching for relatives, we look at the percentage of overlap and ethnicity. That’s a big data job.” Hadoop also gets used at this company for the machine learn-ing that generates the possible family connections for users. As Zhukov con-cludes: “We do analytics in more of a traditional setting, but we are turning to Hadoop for all business units in the company.”

Emerging OptionsHadoop does not solve every big

data problem or make the best fit for every user. In some cases, the actual implementation of Hadoop creates the biggest hurdle. As Zhukov says, “If you compare Hadoop to tradition-al database systems, it’s in the early stages of its technology. So there’s lots of tweaking and twisting for it to run smoothly.”

In fact, the Hadoop challenges vary by user. “The biggest challenge really depends on the use case the customer has,” says Hu of IBM. For example, he says, “If they are just analyzing data in some standalone system and they have Hadoop clusters that operate in isolation, then the challenge is just to keep those clusters up and running.”

Hu also points out some general concerns with Hadoop. He says, “It has challenges in single points of

failure, and massive amounts of sys-tem must be stitched together, which makes system provisioning and man-agement a challenge.” Beyond that, he lists several other issues, including the latency of execution in an applica-tion and the inability to execute work-loads outside the Hadoop paradigm. To explore the general concerns even more, Hu directs readers to “Top 5 Challenges for Hadoop MapReduce in the Enterprise.”

Beyond concerns of implementa-tion and execution, Zhukov points out that it’s not easy to hire Hadoop experts. As he says, “Here in Silicon Valley, it’s proved to be quit difficult to hire people with Hadoop skills. So we’re trying to grow and train our own people to work with Hadoop.”

Sometimes, users assume that Ha-doop is the only big data option. As Snell says, “In many cases, we find that Hadoop is a technology in search of solutions.” To explain this, he says, “Imagine that I’ve designed a new screwdriver, and I’m trying to find how many things I can do with it. In prac-tice, you might be able to drive in a Phillips screw with a flathead screw-driver, but the flathead screwdriver might not have been the best tool.” Likewise, Hadoop is not the best tool for every big data situation.

Other options keep emerging for us-ers. In some cases, companies even develop new approaches to creating the new tools. For example, Omni-bond makes all of its advanced tools available to the open-source commu-

nity. As Ray Keys, the company’s mar-keting manager, explains, “We have a very unique business model. It’s pure open source, and pushing everything out to the open-source community is really powerful.” He adds, “There is goodwill involved, but it’s goodwill from people we do development for, and they all realize that it’s goodwill for the entire community. As a result, the community contributes, as well.” Beyond just good will, this company’s OrangeFS provides a file system that can handle millions and millions of files in a single directory without reducing the performance of the solution.

End-User ApplicationsTo get a quantitative grasp of how

users apply big data, Intersect360 Re-search conducted a survey and pro-duced this report: “Different Perspec-tives on Big Data: A Broad-Based, End-User View.”

This study arose from surveying—in partnership with the Gabriel Consult-ing Group in Beaverton, Oregon—more than 300 users of big data applications, which included both high-performance computing (HPC) and non-HPC. The survey sampled a wide range of users, including com-mercial, government and academic (or not-for-profit) users. For example, the commercial users ranged from consulting and computing to energy and engineering, as well as manu-facturing and medical care. The gov-ernment users worked in national re-search laboratories, defense, security and other areas. The users also came from a range of group sizes, cover-ing a range from a few to more than 100,000 employees. For instance, nearly 20 percent of the groups had only 1–49 employees, and almost 10 percent had 20,000–99,999.

Such a diverse sample should pro-vide a broad overview of big data ap-plications, and the survey results con-firm that. Among those polled, the big data applications included enterprise, research and real-time analytics, plus MapReduce, complex-event process-ing, data mining and visualization. For example, more than two-thirds of those

We have a very unique business model. It’s pure

open source, and pushing everything

out to the open-source community is really powerful.

Page 11: Big dataimplementation hadoop_and_beyond

polled said that they work with big data applications in research analytics, and more than half apply data mining to big data situations. A small portion of those surveyed—6 percent—reported no current applications, although they were planning future ones.

Those surveyed also rated a series of big data challenges on a scale of 1–5, ranging from ‘not a challenge’ to an ‘in-surmountable challenge,’ respectively. Overall, the users ranked the most signif-icant challenges as managing many files and performing computations and anal-yses fast, where the average scores for both were 3.7 on the 1–5 scale. Among the technical users, 70 percent of them ranked managing many files as a 4 or 5 on the scale—making it feel like an in-surmountable, or nearly that, challenge for 7 out of 10 of those polled. Among business users, 65 percent of them rated performing computations and analyses fast as a 4 or 5 on the scale of big data challenges. In general, though, most users rated most of the challenges as significant. For example, users also reported serious challenges—averages scores above 3—for managing large files, giving multiple users simultaneous access to data and providing long-term access to data.

This survey also explored where users get their software for big data applications. The results show some unexpected, or at least inefficient, re-sults: Half of those polled developed the applications internally; one-quar-ter of them purchased the applica-tions; and about another quarter used open-source applications. “For those who developed the applications inter-nally,” says Snell, “those are all one-offs by definition.” With challenges as complicated as big data, developing a unique solution to every problem does not make for efficient comput-ing. Moreover, the wide use of one-offs limits these applications to groups that maintain the personnel to develop these advanced applications.

This survey also examined the spe-cific tools the users apply to big data scenarios. Out of the 300-plus re-sponses to this survey, 225 of them provided details on their applica-

tions, and Hadoop came out on top. Nonetheless, only 11 percent of the users reported that they make use of Hadoop. Nonetheless, says Snell, “If people have a big data problem, they often think they must use Hadoop.”

Future FitsBy definition, big data problems

outstrip the capabilities of a user’s current system. So some of the tools typically used only in HPC will work their way into non-HPC environments. A key example of that could arise from the use of parallel file systems.

The Intersect360 Research survey asked users about the file systems that they employ. The replies included a diverse set of approaches. Across all types of users—academic, commer-cial and government—most selected the Network File System (NFS), which was developed by Sun Microsystems in 1984. Overall, nearly 70 percent of the users employ NFS. Not many of the users implement parallel files sys-tems at this point. For example, less the 20 percent of the people polled used the Hadoop Distributed File System (HDFS), IBM’s General Paral-lel File System (GPFS) or Lustre.

As Snell says, “Parallel file systems are still poorly penetrated into com-mercial markets.” He adds, “Com-mercial users will need to find solu-tions for parallel I/O.”

Beyond learning to build the infra-structure for big data, much remains to be learned about where to use it. Gavin Nichols, vice president of strat-egy and R&D IT at Quintiles in Re-search Triangle Park, North Carolina, provides a recent example. During the 2012 influenza season, big data ex-perts in Scotland sequenced the gen-otype of every person who contracted the flu over a two-week period. From that data, the analysts extracted infor-mation that revealed who was most susceptible to the illness, and who might require hospitalization. Scot-land’s government then used those data to adjust its healthcare policy.

“They are using big data in how they deal with influenza,” Nichols ex-plains. “This is a good example that we haven’t even seen the starting point for what we can do with this technology.” He adds, “If big data solutions grow in the right way, then there’s the opportu-nity to make meaningful changes in the way drugs are developed, put through to market and utilized.”

The variety of infrastructures be-ing employed and the questions being considered reflect the diversity of ap-proaches that today’s users take to big data challenges. This reveals that many technologies in computing can be modi-fied to handle big data problems in many enterprises. Moreover, many technolo-gies besides Hadoop can be explored. For any particular user, the best technol-ogy depends on the big data problem at hand. Even for the same user, different questions and applications will often require different approaches to mak-ing use of big data. As Snell concludes, “Hadoop is not all things to all people. Plus, there is not one perfect answer for all big data problems.” n

About the Author:Mike May, Ph.D. is a science and

technology writer, editor and book au-thor. Mike has published hundreds of articles within prestigious publications during his 20 year professional writing career, and is currently editorial direc-tor for Scientific American Worldview. He holds a Ph.D. from Cornell Univer-sity in neurobiology and behavior.

If big data solutions grow in the right way, then there’s the

opportunity to make meaningful changes

in the way drugs are developed, put through to market

and utilized.

Page 12: Big dataimplementation hadoop_and_beyond

When a customer runs into a big-data problem—from struggling with data velocity, volume or

variety—IBM offers a solution. Such solutions, however, must supply flex-ible options, because of the relativity of the term big in big data. “The ‘big’ in big data is like the ‘high’ in high performance computing,” says Yonggang Hu, chief architect for Platform Computing, IBM Systems and Technology Group. “It’s dif-ferent to different people.”

To provide flexibility, IBM orchestrates a very broad big-data initiative. It ranges from complete systems—like the xSeries and pSeries architectures that handle big-data workflows—to various aspects of computing solu-tions. For example, IBM offers vari-ous forms of storage technology that benefit big-data situations. The com-pany also creates networking com-ponents and interconnects needed for extremely high-throughput sys-

tems. The data in any system does not live in isolation. Big data come from a location and the analysis goes to a des-tination. So IBM solutions can provide connectivity between big-data systems and the databases and data warehous-es in the enterprise.

The Power of Parallel FilesTo provide an overview of big-data

challenges, Hu divides the computer processing of big-data problems into dif-ferent patterns. The first data workload might require only a monthly assessment. This does not require any immediate anal-ysis. Instead, data can be collected and analyzed over time, and the results get used for decision-making as needed. At the other end of the spectrum, some ap-plications—such as a traffic-monitoring system or software that maintains power across a utility grid—require real-time de-cisions. This requires a system with inter-activity task latencies on the order of mil-liseconds. “The current Hadoop system doesn’t have this kind of near real-time response capability,” Hu says.

Working effectively with this variety of big-data applications requires a power-ful file system. IBM’s new General Paral-lel File System File Placement Optimizer (GPFS FPO), developed specifically for big data, allows for a model very similar to a Hadoop distributed file system (HDFS).

This file system provides a range of necessary features. For example, it in-cludes enterprise capability. GPFS FPO is also extremely scalable, so it can grow with a user’s data needs. Fur-thermore, this file system includes no single points of failure, which provides the robust computing needed to create sophisticated data solutions. Moreover, its POSIX support allows applications to

A to Z for Big-Data ChallengesIBM provides every tool—from applications to complete systems—for cost-effective solutions

IBM Platform Symphony brings mature multi-tenancy and low-latency scheduling to Hadoop. Symphony enables multiple MapReduce workloads to run on the same cluster at the same time, dynamically sharing resources based on priorities that can change at run-time.

IBMA globally integrated technology and consulting company

Founded: 1911 (as Computing Tabulating Recording Company)

Number of Employees: 433,362

Headquarters: Armonk, NY

Top Executive: Virginia M. Rometty (chairman, president and CEO)

Sponsor Profile

Page 13: Big dataimplementation hadoop_and_beyond

get the benefits of a UNIX-com-patible file system.

Orchestrating a SolutionIn many cases, big-data applica-

tions require advanced analytics. For those situations, we created IBM Platform Symphony. “This software implements a Hadoop MapReduce that is very fast,” says Hu. Moreover, this software provides very low latency. For in-stance, the overhead provisioning for an executed task takes less than 1 millisecond. “Open-source tool sets take an order of magni-tude longer,” Hu says. On average, performance tests reveal that Plat-form Symphony is 7.3 times faster than open-source Hadoop for jobs comprised of a representative mix of social media workloads.1

For someone who as been run-ning Hadoop MapReduce on a cluster, one challenge involves efficient utilization. In many cas-es, the user lacks sufficient jobs to keep the cluster occupied. In fact, utilization often runs as low as 20 percent. Platform Symphony helps users extract more from a cluster by getting workloads there faster and not keeping them waiting to run. In ad-dition, users can run heterogeneous workloads, such as Hadoop MapRe-duce mixed with C++ jobs. Moreover, a business can divide a cluster among multiple units with Platform Sympho-ny, giving each a slice of the structure. Then, the units can operate virtually independently of each other.

In a recent return-on-investment study, IBM explored the cost of run-ning a big-data system, covering ev-erything from the cost of the hardware and administration to cooling the data center. By using Platform Symphony, instead of a standard Hadoop system, the results showed a potential savings of almost $1 million depending on the customer’s use case.

For Real-Time NeedsIBM’s infosphere offers further options

in big-data analytics. This approach han-dles big-data applications that need to

run in real or almost real time. It includes ready-made components for people who already use Hadoop analytics.

Furthermore, Infosphere’s accelera-tors and frameworks speed up the de-velopment of big-data analytics. One ready-made tool, for example, is Big Sheets. This provides a spreadsheet-like interface to the Hadoop stack. It extracts and manipulates data from the spreadsheet interface.

Infosphere also lets users com-bine Big Sheets with other software. It supports all of the components of the open-source Hadoop ecosystem. So a user could select open-source Hadoop layers in any of the stacks. Hu adds, “Customers can choose the layers where they want enterprise-ca-pable technology.”

Building RelationshipsIn tailoring big-data solutions to

customers, IBM supplies a wide range of customer services. This includes assessing a customer’s existing infrastructure and pro-viding business-level needs. This combines IBM’s technical- and business-service arms.

Getting the right solution often requires both levels of analysis. For example, an efficient big-data solution requires the right computing infrastructure and software systems, as well as the business knowledge to de-termine what reports should be generated from the analysis.

The very basis of big data often implies an analytical challenge. For example, Vestas—an energy sup-plier in Denmark—turned to IBM to determine the best locations for electricity-generating wind tur-bines. The resulting analysis includ-ed weather information, such as wind and climate, as well as the lo-cal geography. This company used IBM technology to put together a

system that could produce accurate answers in a timely fashion. The results

generated the best locations for the wind turbines and how to arrange them to harness the most energy possible.

As big-data problems grow and evolve, IBM’s solutions keep adapting and advancing. It takes the teamwork of hardware and software infrastruc-ture plus powerful analytics to perform the needed computation. In addition, technical and business solutions must work together to generate the most useful solution. From the xSeries and pSeries systems to powerful intercon-nects and advanced analytical algo-rithms, IBM supplies big-data tools and solutions that extend from high-performance infrastructure to sophisti-cated business intelligence and plan-ning. Only approaches as orchestrated as this can turn big-data opportunities into business growth and success. n

Reference• STAC Report – A Comparison of

IBM Platform Symphony and Apache Hadoop MapReduce using Berkeley SWIM 2012

IBM provides a variety of System x and Power-based solutions optimized to run big-data tools, like IBM InfoSphere BigInsights, IBM Platform Symphony and IBM GPFS.

Sponsor Profile

Page 14: Big dataimplementation hadoop_and_beyond

To break down big-data problems, many data scientists are look-ing to the Hadoop MapReduce

model. Currently those systems run on dedicated Hadoop clusters. Many com-panies and researchers have existing clusters that run all sorts of code but currently cannot leverage Hadoop in their shared environments. Dedicated clusters are required due to the close ties of Hadoop MapReduce and the Ha-doop Distributed File System (HDFS). To

solve that problem, Omnibond has extended the open source Orange File System (OrangeFS) to natively support Hadoop MapReduce with-out the HDFS requirement.

According to Boyd Wilson, Omni-bond executive, “Our main business model for OrangeFS is all open-source so you can download and start running today.” In addition, Om-nibond provides support and devel-

opment services. For example, if a cus-tomer needs a specific feature added to OrangeFS, Omnibond will price out the cost and create it. The customer must agree that the resulting code goes back into open-source. Furthermore, Orange-FS offers a wide range of applications that can be used with high performance and enterprise computing.

The need to make parallel file sys-

tems easier to use, including leveraging MapReduce on existing clusters, ignited the original work on OrangeFS. To bring such systems to more users, the paral-lel file system development work needs to continue. OrangeFS is not limited to traditional data science alone, as the broadening list of use cases is expand-ing from research to finances, entertain-ment and analytics.

Enhancing the open-source business model, Omnibond provides everything that a customer would expect from a company that develops closed-source code. “If you have a problem, just call. If you need a feature, we can add it, and in addition, you get the peace of mind in having the source code, too,” Wilson says. Customers can purchase this ser-vice and directed development on an annual subscription basis, giving them the assurance of stable development and support. Moreover, this model pro-vides a triple-win situation that’s great for the support customer, Omnibond and the entire open-source community, thus assuring the ongoing development of OrangeFS.

Inside OrangeFSIn general, OrangeFS provides the next-

generation approach to the Parallel Virtual File System, version 2 (PVFS2). From the

start, PVFS2 delivered desirable features, in-cluding a modular de-sign and distributed metadata. It also sup-ports a message pass-ing interface (MPI). Other features include PVFS2’s high scal-ability, which makes it a great choice for large data sets or files. Beyond working with such large files, PVFS2

A Big-Data File System for the MassesThe computing community and Omnibond developed open-source infrastructure software for storage clusters

Omnibond Systems develops open-source tools that can be used in a wide range of big-data applications. (Image courtesy of Omnibond Systems.)

Omnibond Systems, LLCA software engineering and support company focused on infrastructure software.

Founded: 1999

Employees: 45

Headquarters: Clemson, SC

Top Executive: Boyd Wilson

Sponsor Profile

Page 15: Big dataimplementation hadoop_and_beyond

works fast. For example, it outruns—really outruns—a network file system (NFS) using the MPI-IO interface.

Omnibond started with those fea-tures and generated OrangeFS by adding even more. For example, this file system comes from a focused de-velopment team. It also includes im-provements in working with smaller files; Omnibond does all of its testing on 1 megabyte files. In the 2.8.6 ver-sion, released on June 18, 2012, us-ers were provided with advanced cli-ent libraries, Linux kernel support and Direct Interface libraries for POSIX calls and low-level system calls. Ad-ditionally, users can receive support for WebDAV and S3 via the new Or-angeFS Web Pack.

Future releases promise even more features, such as distributed directo-ries and capability-based security that will make the access time for large di-rectories even faster and the entire file system more secure.

Added IncentivesFor many big-data users, the list of

OrangeFS features already mentioned sounds enticing enough, especially with the best of open- and closed-source worlds. Nonetheless, even more incentives exist.

The object-based design of Or-angeFS appeals to many users. In addition, this file system separates data and metadata. OrangeFS also provides optimized MPI-IO requests. Furthermore, this software can be installed on a range of architectures. For example, users can apply Or-angeFS to multiple network architec-tures. In addition, OrangeFS works on many commodity hardware platforms. To ensure file safety, OrangeFS in the near future will also provide cross-

server redundancy.Given the open-source nature of

OrangeFS, any documentation might sound like a rarity, but Omnibond goes even further. For instance, the recent OrangeFS Installation Instructions supply complete documentation—all available online.

Many applications of big data require aggregation. For example, a system might gather real-time data and store it for access. That stored data might have access to clustering nodes for Hadoop. Then, the collected data can be queried and analyzed later with MapReduce.

Likewise, a big-data system might require other forms of data aggre-

gation. For instance, real-time data could be collected, combined with private data and then processed. On the other hand, a user might take ‘snapshots’ while processing real-time data to collect a subset of data, and then analyze it. OrangeFS can provide storage for all of these big-data scenarios.

Advanced ApplicationsThe open-source community stays

at the cutting-edge of computation

and data analysis. Consequently, experts at Omnibond track all of the trends in high-performance and big-data environments. The information gleaned by Omnibond continues to impact future versions of OrangeFS.

As an example, ongoing work in data analysis spawns unimagined ap-proaches. New ways of even think-ing about data keep emerging. Social media–data analysis, for instance, and other large streams of data, can trigger entirely new ways of analyzing information, such as log, customer and other real and near real-time data. In other cases, new perspectives on existing data will generate novel ap-proaches to data integration that were never considered.

As Wilson explains, “The new con-cept of data analysis covers a wide range. In some cases, it involves huge amounts of data, and other times not so much.” He adds, “You can bring together different data points that you wouldn’t think would correlate, but sometimes they do.”

This kind of thinking—seeing to-day’s challenges and turning them into tomorrow’s opportunities—lies at the heart of Omnibond’s philoso-phy. “We are adding exciting new features on our development path to OrangeFS 3.0” says Wilson. “These should be real game-changing en-hancements.”

As a company, Omnibond focuses on building the OrangeFS that customers want, and then sharing those advances with the wider community. n

Omnibond’s open source Orange File System (OrangeFS) supports Hadoop MapReduce and pro-vides a simplified parallel file system. (Image courtesy of Omnibond Systems.)

If you have a problem, just call. If you need a feature, we can add it, and

in addition, you get the peace of mind in having the source code, too.

Sponsor Profile

Page 16: Big dataimplementation hadoop_and_beyond

B ig data varies from one custom-er to the next. As Ken Claffey, senior vice president and gen-

eral manager, ClusterStor at Xyratex, said, “What’s big data to one guy is tiny to another.” No matter how you define big data, achieving analytic results rap-idly needs one thing: fast, reliable data storage. And with more than 25 years of experience as a leading provider of data storage technology, this is exactly where Xyratex leads the market.

Xyratex’ innovative technology solutions help make data storage faster, more reli-

able and more cost-effective. As an organization, Xyratex provides a range of products from disk-drive heads and disk media test equipment to enter-prise storage platforms and scale-out storage for high-performance comput-ing (HPC). For big data applications, where users need storage that scales both in capacity and performance, Xyratex solutions support high-perfor-

mance data processing of relatively small data quantities to some of the largest su-percomputers in the world.

“Performance is often overlooked,” Claffey said, “but it’s the key driver. Once you have your data together, you need performance for analytics to transform that data into information you can use to make more-informed decisions.”

Combining FeaturesXyratex achieves its big data storage

success by providing solutions, not just parts. The company has developed a fresh, groundbreaking scale-out stor-age architecture that meets user per-formance and scalability needs while also providing the industry’s highest levels of integrated solution reliability, availability and serviceability.

“You can go from a single pair of scal-able storage nodes with less than 100 terabytes of capacity to solutions that sup-port more than 30 petabytes of capacity and deliver terabyte-per-second perfor-

mance,” Claffey said. “Then we embed the analytics engine right into the storage solution, helping you make better use of all the data you are storing and maximize op-erational efficiency within the solution.”

Adding EngineeringTraditional HPC users often tried to

build their own data storage infrastruc-ture by stitching together different pieces of off-the-shelf hardware. This approach often encountered challenges.

Xyratex saw an opportunity to engineer a high-performance, singly managed ap-pliance-like data storage solution called ClusterStor. ClusterStor is a tightly inte-grated scale-out solution that is easy to deploy, scale and manage.

What is now termed big data has been used in practice for many years within compute-intensive environments and is evolving towards an enterprise class of requirements. However, enterprise users do not have the time to struggle with the level of complexity these environments bring with them. To support exponential growth in these types of environments requires data storage solutions that can keep costs in check. As with general-pur-pose storage acquisitions, enterprise us-ers should consider the total cost of own-ership, including the cost of upgrades, time to productivity as well as operational efficiency. ClusterStor allows enterprise-class users to move into an engineered high-performance data storage solution

Crafting Storage SolutionsXyratex builds scalable capacity and performance into big data systems

XYRATEXA leading provider of data-storage technology.

Founded: 1994

Employees: 1,900

Headquarters: Havant, United Kingdom

To help big-data users, Xyratex provides its ultra dense 5U84 enclosure. (Image courtesy of Xyratex.)

Sponsor Profile

Page 17: Big dataimplementation hadoop_and_beyond

with enterprise reliability, serviceabil-ity and availability without the hidden costs associated with stitching together components.

Breadth of BenefitsWith a Xyratex ClusterStor storage so-

lution, a user gets an integrated product. While Xyratex leverages open-source and standards-based hardware platforms, it also adds its own blend of ease of de-ployment and reliability. It all adds up to

the fastest and most re-

liable data storage on the market today. In addition, all ClusterStor systems come pre-configured from the factory, enabling turnkey deployment, which significantly reduces installation time.

Many sectors—such as digital manu-facturing, government, commercial indus-try and others—are finding that big-data initiatives can provide new understanding into the data they already have, but many organizations in these sectors lack the in-house expertise to create and manage big-data solutions.

“Coming to users with a big-data so-lution that they can easily deploy and manage removes a barrier for them to get access to the technology they need,” said Claffey. “The complexity of the solu-tion is completely hidden from the user, and you don’t see all of the thousands of man-hours that it took to create that ex-perience. We help you reduce complex-ity, get a great out-of-the-box experience and maximize the uptime of the system.”

“Instead of building and tweaking to get a system up and running, our turnkey solutions reduce the time it takes to start computing. We even have architects who can come in and design how the solution will fit into your environment,” added Steve Paulhus, senior director, strategic business development at Xyra-tex. “We are known for high-quality stor-age products—in the way we engineer, test, validate and deliver preconfigured ClusterStor solutions. Through all of this, we offer the quickest path to the perfor-mance and results users seek.”

As big data continues to quickly evolve, users’ storage needs will as well. And data storage systems and vendors will have to adapt to stay relevant in the market.

“We continue to leverage our knowl-edge and experience-based under-standing of end user data storage needs to relentlessly push beyond traditional boundaries,” said Paulhus. “That’s how we will continue to advance in big data and provide solutions ideally suited to serving the needs of the market.” n

That’s how we will continue to advance in big

data and provide solutions ideally suited to serving the needs of the

market.

The ClusterStor 6000 provides the ultimate integrated HPC data-storage solution. (Image courtesy of Xyratex.)

Sponsor Profile