Database Architecture Proposal

<Client – Confidential>

Architecture Proposal

Prepared By

Bernard Dedieu

2 ... Architecture Proposal Confidential

Table of Content 1. Background ............................................................................................................ 3

2. Problem Statement ................................................................................................. 3

3. Proposed Architecture—High-Level ........................................................................ 4

a. 100GB-scale Data Volume ..................................................................................... 4

b. Log Files as Data Source ........................................................................................ 4

c. Customer-facing OLAP ........................................................................................... 4

4. Proposed Architecture—Low-Level ......................................................................... 6

a. Hadoop ................................................................................................................... 6

b. Data Marts .............................................................................................................. 9

i. One vs. Many .......................................................................................................... 9

ii. Brand of RDBMS .................................................................................................. 10

c. Reporting Portal .................................................................................................... 11

d. Hardware .............................................................................................................. 12

e. Java Programming ................................................................................................ 12

5. Data Anomaly Detection ....................................................................................... 12

6. Data integration/importation and Data Quality Management ................................. 12

7. Summary .............................................................................................................. 13

Appendix A. Hadoop Overview ................................................................................... 14

MapReduce ................................................................................................................ 14

Map ............................................................................................................................. 14

Reduce ....................................................................................................................... 14

Hadoop Distributed File System (HDFS) ..................................................................... 16

8. Query Optimization ............................................................................................... 18

9. Access and Data Security ..................................................................................... 18

10. Internal Management and Collaboration tools ....................................................... 18

11. Sales Force and Force.com integration ................................................................ 19

12. Roadmap .............................................................................................................. 20


1. Background <Company presentation and background – Confidential>

2. Problem Statement In term of load for the database, the number of sites is the best metric since it describes the number …. So

it is very important that the web-application remains effective as the company is growing (this includes the

database, the framework and the architecture of the servers). Also, as the company grows in … need to

deploy a server in Europe to manage ...

In addition, the historical data will be kept and the number of ... will grow the data volume will grow

exponentially. So the overall database architecture needs to be highly and easily scalable.

It is also more than likely as the solution price will decrease, bigger corporations will be interested in ...

solution. Therefore, ... solution will need to be integrated in existing information systems.

This will require:

To interface ... solution to existing applications.

To have ... solution relying on standard and open technologies.

To build partnership with System Integrators or build an internal Professional Services organization

to support these customers.

With its current, somewhat limited database schema, the data warehouse’s millions of records consume more than 2GB of disk space, including indexes. Extensions to the data warehouse schema, coupled with a growing customer base, will easily push the data warehouse volume beyond 100GB. The single instance, multi-schema MySQL database architecture simply does not provide the scalability necessary to meet ... demands. In addition to these scalability problems, the reporting infrastructure is also limited in its potential for enhanced functionality. For instance, ... would like to extend the Reporting Portal to provide customers with ad-hoc, multi-dimensional query capability and custom reporting based on searchable attribute tags in the data warehouse. At present, the data warehouse dimensions do not provide the flexibility needed to easily accommodate these kinds of changes. Therefore, ... has a pressing need to replace its current reporting infrastructure with a scalable, flexible architecture that can not only accommodate their growing data volumes, but also dramatically extend their reporting functionality. Key goals for the new infrastructure include:

Redundant, efficient retention of historical detail o Write once, ready many o Compression o No encryption required o ANSI-7 single-byte code page is sufficient

Linear scalability (i.e., as data volume increases, performance is not degraded)

Flexible extensibility (e.g., attributes can easily be added and exposed to customers for reporting, either as dimensional attributes or fact attributes)


Full OLAP support o Standard reports o Custom reports o Ad-hoc query o Multi-dimensional o Hierarchical categories (i.e., tagging, snowflakes) o Charts and graphs o Drill-down to atomic detail (i.e., ... log) o 24x7 availability o Query response time measured in seconds (not minutes)

Efficient ETL o Near real time (i.e., < 15 minutes) o Handles fluctuating volumes throughout the day without becoming a bottleneck (which can

cause synchronization problems in the data warehouse)

Partitioning of data by customer This new architecture must deliver vastly improved functionality, while controlling for implementation cost and time to roll-out.

3. Proposed Architecture—High-Level From an architectural perspective, there are three overarching factors driving the technical solution for ... reporting needs:

a. 100GB-scale Data Volume Due to their sheer size, large applications like ...s data warehouse require more resources than can typically be served by a single, cost-effective machine. Even if a large, expensive server could be configured with enough disk and CPU to handle the heavy workload, it is unlikely that a single machine could provide the continuous, uninterrupted operation needed to meet ... SLAs. A cloud computing architecture, on the other hand, is an economical, scalable solution that provides seamless fault tolerance for large data applications. b. Log Files as Data Source More and more organizations are seeking to leverage the rich content in their verbose log files to drive business intelligence. Sourcing from log files presents a different set of challenges compared to selecting data out of a highly structured OLTP database. Efficient, robust, and flexible parsing routines most be programmed to identify tagged attributes and map these to business constructs in the data warehouse. And because log files tend to consume lots of disk space, they should ideally be stored in a distributed file system in order to load balance I/O and improve fault tolerance. c. Customer-facing OLAP The stakes are usually higher when building and maintaining a customer-facing business intelligence solution, as opposed to one that is implementing internally. ... reputation and marketability depend in part on its customers’ opinions of the Reporting Portal. It must be intuitive, easy to use, powerful, secure, and available anytime. Its data should be as fresh as possible, while providing historical data for trend analyses. Customers should have seamless access to both aggregated metrics and ... log detail. The Reporting Portal should expose the customizability of the speech application through its reports. Any customer-specific categories, tags, and data content should be faithfully reflected in the Reporting Portal, just as the customer would expect to see them.


Based on these driving factors, we propose a cloud computing architecture comprising a distributed file system, distributed file processing, one or more relational data marts, and a browser-based OLAP package (see Figure 1). Most of this infrastructure will be built using open source software technologies running on commodity hardware. This strategy keeps initial implementation costs low for a right-sized solution, while providing a path for scalable growth.

Figure 1. High-Level Architecture

In this design, Apache Hadoop (http://hadoop.apache.org/) is used to perform some of the functions normally provided by a relational data warehouse. Most specifically, Hadoop behaves as the system of record, storing all of the historical detail generated by the Speech Applications. New ... logs are immediately replicated into the Hadoop Distributed File System (HDFS), which is massively scalable to accommodate virtually any amount of data. HDFS is based on Google’s GFS, which essentially stores the content of the Web in order to facilitate index generation. Other well-known companies that store huge volumes on data in HDFS include Yahoo!, AOL, Facebook, and Amazon. Hadoop is free to download and install. It uses a cloud computing architecture (i.e., lots of inexpensive computers linked together, sharing workload), so it can be easily and economically extended as needed to scale for growth. Scaling performance is linear; performance does not degrade as you increase data volume. Hadoop cannot fulfill all of the functions of a data warehouse, though. For instance, it does not contain indexes like a relational database, so it can’t truly be optimized to return query results quickly. Hadoop provides a very powerful, distributed job processing technology called MapReduce, which can perform much of the extract and transform work that is commonly done by ETL tools. Therefore, Hadoop powerfully augments ... business intelligence architecture by using distributed storage and processing to perform the data warehousing functions that would otherwise be the hardest to scale under a traditional, single-machine, relational data warehouse architecture.

Reporting

Portal

... logs Relational

Data Mart(s)

Hadoop Distributed

File System (HDFS)

... logs are retained for ever (or as otherwise specified per customer requirements).

...logs are immediately replicated into HDFS and can be retained indefinitely.

Any portion of historical data can be read from Hadoop and aggregated as needed into optimized reporting database(s).

Reports, ad-hoc queries, graphs, and charts are presented via browser-based software.

http://hadoop.apache.org/


While Hadoop does the ―heavy lifting,‖ other, more traditional technologies are used to provide familiar business intelligence functionality. Relational data marts serve up optimized OLAP database schemas (e.g., highly indexed star schemas) for querying via standard business intelligence tools. One defining factor of a data mart is that it can be completely truncated and reloaded from the upstream data repository (in this case, Hadoop) as needed. This means that if ... needs to enhance the reporting database design by altering a dimension or adding new metrics, the data mart’s schema can altered—even dramatically—and repopulated without the risk of losing any historical data. It’s also worth noting that because the Hadoop repository stores all historical detail, it is possible to retroactively back-populate new metrics that re added to the data mart(s). As of this writing, it is not know how much data volume must be accommodated in a given data mart. And we don’t yet know whether one data mart would suffice, or if there would be many data marts. These questions will influence the choice of relational database management system (RDBMS) that is selected for .... For example, MySQL is cheap to procure and implement, but has serious scalability limitations. A columnar MPP database like ParAccel is ideal for handling multi-terabyte data volumes, but comes with a price tag. One advantage of this proposed architecture, though, is that the data marts can be migrated from one technology to another without risk of losing valuable data. The customer-facing front end technology should be a mature, fully-supported product like BusinessObjects or MicroStrategy. Such technologies are rich with features that would otherwise be very costly to develop in-house, even with open source Java libraries. Besides, the customers who use this interface should not become quality assurance testers for internally developed user interfaces. The Reporting Portal is a marketed service and as such, must leave customers with a great impression.

4. Proposed Architecture—Low-Level This section outlines an in-depth look at each component in Figure 1 above.

a. Hadoop Hadoop is an extremely powerful open source technology that does certain things very well, like store immense volumes of data and perform distributed computations on that data. Some of these strengths can be leveraged within the context of a business intelligence application. For instance, several of the functions that would normally be performed within a traditional data warehouse could be taken up by Hadoop. One defining feature of a data warehouse is that it stores historical data. While source systems may only keep a rolling window of recent data, the data warehouse retains all or most of the history. This frees up the transactional systems to efficiently run the business, while keeping a historical system of record in the data warehouse. HDFS is ideal for archiving large volumes of static data, such as ... ... logs. HDFS provides linear scalability as data volumes increase. Not only can HDFS easily handle ... for ever retention requirement, but if could also permit ... to retain all of its history. HDFS comfortably scales into the petabyte range, so the need to age out and purge files could be eliminated altogether. Hadoop is a perfect solution for historicity problems, because it easily scales to petabyte sizes by simply by configuring additional hardware into the cluster. Another benefit of HDFS is its data redundancy. HDFS replicates file blocks across nodes, which can physically reside in the same data center or in another data center (assuming the VPN bandwidth supports it). This would entirely eliminate the need for ... to copy zipped ... log files between data centers (see Figure 2).


Figure 2. ... Log-Hadoop Architecture

Although business intelligence solutions depend on lots of data, business users are interested in information. In order to transform large volumes of raw data into meaningful business metrics, calculations must be performed, business rules must be applied, and large numbers of data elements must be summarized into a few figures. Traditionally, this type of aggregation work is done outside of the data warehouse by an extract, transform, and load (ETL) tool, or within the data warehouse using stored procedures and materialized views. Due to the inherent constraints imposed by a relational database system like MySQL, there are limits to how much data can reasonably be aggregated this way. As source data volumes increase, the time required to perform aggregations can extend beyond the point in time when the resulting metrics are needed by the customers. Hadoop is able to perform these kinds of aggregations much quicker on large data volumes because it distributes the processing across many computers, each one crunching the numbers for a subset of the source data. Consequently, aggregated metrics that might have taken days to calculate in a traditional data warehouse model can be churned out by Hadoop in a couple of hours or even minutes. MapReduce is particularly well-suited to structured data sets like ... ... logs. Tagged attributes map easily to key/value pairs, which are the transactional unit of MapReduce jobs (see Figure A-1 in appendix). ... ETL routines could therefore be replaced with Java MapReduce jobs read from HDFS ... log files and write to the data marts (see Figure 3).

A Java program reads each ... log and writes it into

HDFS for permanent

storage.

The Hadoop Distributed File System (HDFS) can be configured to transparently replicate data across racks and across data centers, providing redundant failover copies of all file blocks.

Rack

JobTracker

TaskTracker/ DataNode

NameNode




Rack TaskTracker/ DataNode






Value0

ValueA

Key5

ValueB

Key6

ValueC

Key8

ValueD

Key4

Value9

Key3

Value6

Key2

Value7

Key2

Value8

Key1

Value1

Key7

Value2

Key2

Value3

Key4

Value4

Key8

Value5

R e c or d

R e c or d

R e c or d

R e c or d

R e c or d

R e c or d

R e c or d

R e c or d

R e c or d

R e c or d

R e c or d

R e c or d

R e c or d

R e c or d

R e c or d

Split

Split

Split

Reduce Task

Shuffle And Sort

Key4

Value9

Key2

Value3

Value7

Key8

Value5

Value8

Key6

ValueC

Value4

ValueD

Reduce Task

Shuffle And Sort

Key1

Value1

Value0

Key7

Value2

Key3

ValueA

Key5

ValueB

Value6

R e c or d

R e c or d

R e c or d

R e c or d

R e c or d

R e c or d

R e c or d

R e c or d

R e c or d

R e c or d

HDFS

(Figure A-2)

MapReduce

(Figure A-1)

VXML

… Logs


Figure 3. Hadoop MapReduce Architecture

There are also quite a few maturing open source tools that can provide analysts direct access to Hadoop data. For instance, a desktop tool like HBase or Hive can be used as a SQL-like interface into Hadoop, permitting analysts to run queries in much the same way that they would access a traditional data warehouse. These tools might be useful to ... personnel who want to perform analyses that are not immediately available through the Reporting Portal. Such tools are best suited for more technically literate analysts who are comfortable writing their own queries and do not require fast query response time. Cloudera (http://www.cloudera.com/) recently unveiled its browser-based Cloudera Desktop product. This tool simplifies some of the work required to set up, execute, and monitor MapReduce jobs. For the more technically inclined analysts in ... organization, Cloudera Desktop might be a good fit—even better than one of the SQL emulators like HBase. Cloudera Desktop’s main features include:

File Browser – Navigate the Hadoop file system

Job Browser – Examine MapReduce job states

Job Designer – Create MapReduce job designs

Cluster Health – At-a-glance state of the Hadoop cluster

It is also possible to use Hadoop’s MapReduce to generate ―canned reports‖ in batch processing mode. That is, nightly batch jobs can be scheduled to produce static reports. These reports would consume data directly from Hadoop, and the resulting content could be pre-formatted for presentation via HTML. Such reports would effectively by-pass the relational data mart altogether.

The entire history of ... logs is permanently stored in Hadoop, making it possible to back-populate new BI metrics with old data, perform year-over-year trend reports, and manually mine data as needed.

Rack

JobTracker


NameNode




Rack TaskTracker/ DataNode






Value0

ValueA

Key5

ValueB

Key6

ValueC

Key8

ValueD

Key4

Value9

Key3

Value6

Key2

Value7

Key2

Value8

Key1

Value1

Key7

Value2

Key2

Value3

Key4

Value4

Key8

Value5

R e c or d

R e c or d

R e c or d

R e c or d

R e c or d

R e c or d

R e c or d

R e c or d

R e c or d

R e c or d

R e c or d

R e c or d

R e c or d

R e c or d

R e c or d

Split

Split

Split

Reduce Task

Shuffle And Sort

Key4

Value9

Key2

Value3

Value7

Key8

Value5

Value8

Key6

ValueC

Value4

ValueD

Reduce Task

Shuffle And Sort

Key1

Value1

Value0

Key7

Value2

Key3

ValueA

Key5

ValueB

Value6

R e c or d

R e c or d

R e c or d

R e c or d

R e c or d

R e c or d

R e c or d

R e c or d

R e c or d

R e c or d

HDFS

(Figure A-2)

MapReduce

(Figure A-1)

Relational

Data Mart(s)

Java programs execute MapReduce jobs to extract and transform any subset of ... log data, and then write the aggregated results into the relational data marts via JDBC.

JDBC

Other Tools …

http://www.cloudera.com/


b. Data Marts Stated simply, Hadoop can make an excellent contribution as a component of a business intelligence solution, but it cannot be the whole solution. A key limitation is that a data warehouse is indexed to provide fast query response time, while Hadoop data is not. A data warehouse (or data mart) typically contains pre-aggregated metrics in order to deliver selected results as fast as possible (i.e., without re-aggregating on the fly). Therefore, a gating factor in deciding whether to run analytic queries and reports against Hadoop is the end user’s expectation for response time. Since ... customers expect and deserve immediate to near-immediate query performance, directly querying Hadoop is not a viable design for the Reporting Portal. It’s also worth noting here that most of the mature, industry-standard OLAP tools like BusinessObjects and MicroStrategy cannot be coupled directly with Hadoop. Therefore, the ... reporting infrastructure will still require a traditional, relational, indexed data store containing pre-aggregated metrics. This data store is rightly called a data mart, because it is not the historical repository of detailed data, or system of record. All of its content can be regenerated at any time from the upstream data source. ... has two basic architectural decisions to make with regard to the data mart. First is whether to create one data mart or multiple data marts. The second decision is which brand of RDBMS to implement.

i. One vs. Many There are a couple of compelling reasons to implement multiple, separate data marts. One reason is performance. The less data you cram into a relational database, the faster it generally performs. There can be exceptions to this rule (like ParAccel’s Analytic Database), but relational databases are usually more responsive with smaller data volumes. A second motivation for splitting ... data into multiple marts is security. It’s certainly quite possible to implement robust security within a single relational database instance, but physically separating each customer’s data definitely ensures that they cannot see one another’s content. However, it is strongly recommended that ... not rely solely on physical separation to enforce data security. There might be situations in which it is not economical store lots of small customers’ data separately. ... should retain the option to co-mingle multiple customers’ data in one database instance, while ensuring privacy to each of them.


Figure 4. Multiple Data Marts

A third reason for implementing multiple data marts is customizability. It’s quite possible that Customer A might require different kinds of metrics from what Customer B needs. One data mart would have to be all things to all customers, making it horribly complex. The turnaround time required to add customer-specific metrics would be greatly improved by hosting them in a dedicated data mart. Having multiple data marts would be very similar to ... current reporting architecture, which uses dedicated MySQL schemas to partition customer data. ii. Brand of RDBMS There are several factors influencing ... choice of relational database management system. The primary factor will likely be data volume, which itself is influenced by many factors (e.g., data model, historical timeframe, individual customer’s ... log volume). Therefore, within the context of this proposal, it is not possible to accurately estimate data sizing. Instead, we can provide some basic guidance for future reference. From our experience, relatively small volumes (i.e., 10s of GB or less) can be comfortably accommodated by MySQL. Medium volumes (up to 100s of GB) are better served by Microsoft SQL Server or Oracle. Large volumes 100s of GB to TB-scale) require a columnar MPP database like ParAccel Analytical Database, Netezza, Teradata, Exadata, or Vertica. In addition to data volumes, ... will likely consider cost. MySQL is free, while other products can costs hundreds of thousands of dollars to purchase. The cost of a given RDBMS may also depend in part of the hardware needed to support it. Some RDBMS products only run on certain brands of hardware. Clearly, this can have far-reaching ramifications for ... costs of operations. We recommend that ... choose database software that can run on any Intel-powered, rackable server. Such hardware will provide the most economical scalability path.

Relational

Data Marts

System of record contains all historical detail.

Customer A

Customer B

Customer C

…


Table 1. RDBMS Recommendations

Data Volume Brand Notes

Up to 10s of GB MySQL Free, but doesn’t scale well

Up to 100s of GB

Microsoft SQL Server Good value for money, easy to run on

commodity hardware

100s of GB to TB

ParAccel Analytic Database

Powerful, hardware-flexible, negotiable pricing model

c. Reporting Portal ... next generation Reporting Portal could provide its customers with a greatly expanded set of features if it is replaced with an industry-standard business intelligence tool like BusinessObjects or MicroStrategy. The choice of such tool will be essentially driven by how ... customers needs change and more importantly if ... start to have bigger corporations with existing IT architecture as client. On the short and middle term, an open source tools such as DataVision http://datavision.sourceforge.net will be a perfect solution allowing producing custom reports easily and generating the result using XML format. The XML format will allow to distribute the report almost Operating System agnostic. The only requirement will be to have XML file reading capabilities on the platform the reports needs to be visualized. These web-based tools leverage the power of metadata to enforce security and map business metrics to back-end data structures. A metadata-based tool flexibly supports business abstractions like categories and hierarchies that are not inherent to the physical data. Business intelligence tools offer a rich presentation layer capable of displaying the graphs, charts, and pivot tables that business users have come to expect from reporting interfaces.

Figure 5. Browser-based Front-end

Relational

Data Marts

BI Metadata

Repository

BI Web

Server

Inte

rnet

Customer’s

Browser

...

Netw

ork

Vendor supported business intelligence application provides richly featured, web-based interface. Customers can run standard and customer reports, ad-hoc queries, generate charts and graphs, save results to Excel, etc.

http://datavision.sourceforge.net/


By leveraging a mature front-end technology, ... gains the advantage of reducing its internal Java development effort, while giving its customers a greatly expanded set of reporting and OLAP functionality. There a many products on the market, some cheaper and less mature than the long-standing industry leaders, Business Objects XI 3.1 and Micro Strategy 9. Our recommendation to ... is to be willing to invest in this customer-facing component so that it reinforces the most appealing impression in its end users. d. Hardware All of the technologies outlined thus far will run quite well on the type of hardware that ... currently uses to serve the Reporting Portal’s data warehouse. ... could purchase several more of the rackable Dell PowerEdge 2950 server trays running Windows Server 2003 and array them as a Hadoop cluster, data mart hosts, or web servers. Operational considerations like data center space and power notwithstanding, this hardware choice would preserve ... current SOE (standard operating environment), and minimize retraining of operations staff. e. Java Programming On reason that the Hadoop technology was selected is the high degree of skill and experience that ... personnel have with Java programming. As discussed earlier, interfaces into and out of Hadoop will most likely be coded in Java. These interfaces would likely be designed, developed, tested, and supported by ... personnel. At first blush, this statement might raise concerns about the cost of hand-coding data interfaces, versus buying a vendor-supported product. However, there are currently no data integration products available on the market to perform these tasks. Furthermore, if an off-the-shelf data integration (ETL) tool like Informatica PowerCenter could be purchased, it would still require expensive consulting services to implement and support. Net net, programming these interfaces in Java is actually a very logical choice for ....

5. Data Anomaly Detection In addition, thanks to its extensive analytics capabilities and performances, Hadoop allows doing different kind of deep analysis to define and then detect data anomaly patterns and report them in minutes. You’ll find attached several documents describing different anomaly approaches. In addition, there is a lot of information available on Hadoop Wiki such as http://wiki.apache.org/hadoop/Anomaly_Detection_Framework_with_Chukwa describing Chukwa framework to detect anomalies.

6. Data integration/importation and Data Quality Management As an alternative using Hadoop ETL features, Cloudera (open source editor of Hadoop) and Talend (open source ETL tool – Extract Transform and Load) recently announced a technology partnership http://www.cloudera.com/company/press-center/releases/talend_and_cloudera_announce_technology_partnership_to_simplify_processing_of_large_scale_data. Talend is the recognized market leader in open source data management. Talend’s solutions and services allow minimizing the costs and maximizing the value of data integration, ETL, data quality and master data management. We highly recommend using Talend as the dedicated tool for data integration, ETL and data quality.

http://wiki.apache.org/hadoop/Anomaly_Detection_Framework_with_Chukwa

http://www.cloudera.com/company/press-center/releases/talend_and_cloudera_announce_technology_partnership_to_simplify_processing_of_large_scale_data




7. Summary Based on key factors like terabyte-scale data volumes, log files as data source, and customer-facing OLAP, the optimal architecture for ... Reporting Portal infrastructure comprises a cloud computing model with distributed file storage; distributed processing; optimized, relational data marts; and an industry-leading, web-based, metadata-driven business intelligence package. The cloud computing architecture affords ... virtually unlimited, linear scalability that can grow economically with demand. Relational data marts ensure excellent query performance and low-risk flexibility for adding metrics, changing reporting hierarchies, etc.


Appendix A. Hadoop Overview Due to their sheer size, large applications like ...s data warehouse require more resources than can typically be served by a single, cost-effective machine. Even if a large, expensive server could be configured with enough disk and CPU to handle the heavy workload, it is unlikely that a single machine could provide the continuous, uninterrupted operation needed by today’s full-time applications. The Hadoop open-source framework—or Hadoop Common, as it is now officially known—is a Java cloud computing architecture designed as an economical, scalable solution that provides seamless fault tolerance for large data applications. Hadoop is a top-level Apache Software Foundation project, being built and used by a community of contributors from all over the world. As such, Hadoop is not a vendor-supported software package. It is a development framework that requires in-depth programming skills to implement and maintain. Therefore, an organization that chooses to deploy Hadoop will need to employ skilled personnel to maintain the cluster, program MapReduce jobs, and develop input/output interfaces. Hadoop Common runs applications on large, high-availability clusters of commodity hardware. It implements a computational paradigm named MapReduce, where the application is divided into many small fragments of work, each of which may be executed on any node in the cluster. In addition, Hadoop Common provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both MapReduce and HDFS are designed so that node failures are automatically handled by the framework.

MapReduce Hadoop supports the MapReduce parallel processing model, which was introduced by Google as a method of solving a class of petabyte-scale problems with large clusters of inexpensive machines. MapReduce is a programming paradigm that expresses a large distributed computation as a sequence of distributed operations on data sets of key/value pairs. The Hadoop MapReduce framework harnesses a cluster of machines and executes user defined MapReduce jobs across the nodes in the cluster. A MapReduce computation has two phases, a map phase and a reduce phase (see Figure A-1 below).

Map In the map phase, the framework splits the input data set into a large number of fragments and assigns each fragment to a map task. The framework also distributes the many map tasks across the cluster of nodes on which it operates. Each map task consumes key/value pairs from its assigned fragment and produces a set of intermediate key/value pairs. For each input key/value pair (K,V), the map task invokes a user defined map function that transmutes the input into a different key/value pair (K',V'). Following the map phase the framework sorts the intermediate data set by key and produces a set of (K',V'*) tuples so that all the values associated with a particular key appear together. It also partitions the set of tuples into a number of fragments equal to the number of reduce tasks. Reduce In the reduce phase, each reduce task consumes the fragment of (K',V'*) tuples assigned to it. For each such tuple it invokes a user-defined reduce function that transmutes the tuple into an output key/value pair (K,V). Once again, the framework distributes the many reduce tasks across the cluster of nodes and deals with shipping the appropriate fragment of intermediate data to each reduce task.


Tasks in each phase are executed in a fault-tolerant manner. If node(s) fail in the middle of a computation the tasks assigned to them are re-distributed among the remaining nodes. Having many map and reduce tasks enables efficient load balancing and allows failed tasks to be re-run with small runtime overhead. The Hadoop MapReduce framework has a master/slave architecture comprising a single master server or JobTracker and several slave servers or TaskTrackers, one per node in the cluster. The master node manages the execution of jobs, which involves assigning small chunks of a large problem to many nodes. The master also monitors node failures and substitutes other nodes as needed to pick up dropped tasks. The JobTracker is the point of interaction between users and the framework. Users submit MapReduce jobs to the JobTracker, which puts them in a queue of pending jobs and executes them on a first-come, first-served basis. The JobTracker manages the assignment of map and reduce tasks to the TaskTrackers. The TaskTrackers execute tasks upon instruction from the JobTracker and also handle data motion between the Map and Reduce phases.


Figure A-1. MapReduce Model

Hadoop Distributed File System (HDFS) Hadoop's Distributed File System (HDFS) is designed to reliably store very large files across clustered machines. It is inspired by the Google File System (GFS). HDFS sits on top of the native operating system’s file system and stores each file as a sequence of blocks. All blocks in a file except the last block are the same size. Blocks belonging to a file are replicated across machines for fault tolerance. The block size and replication factor are configurable per file. Files in HDFS are "write once, read many" and have strictly one writer at any time.

Value0

ValueA

Key5 ValueB

Key6 ValueC

Key8 ValueD

Map Task

Key4 Value9

Key3 Value6

Key2 Value7

Key2 Value8

Map Task

Key1 Value1

Key7 Value2

Key2 Value3

Key4 Value4

Key8 Value5

Map Task

Input Data Set R

eco

rd

Re

co

rd

Re

co

rd

Re

co

rd

Re

co

rd

Re

co

rd

Re

co

rd

Re

co

rd

Re

co

rd

Re

co

rd

Re

co

rd

Re

co

rd

Re

co

rd

Re

co

rd

Re

co

rd

Split Split Split

Reduce

Task

Shuffle And Sort

Key4 Value9

Key2

Value3

Value7

Key8 Value5

Value8

Key6 ValueC

Value4

ValueD

Reduce

Task

Shuffle And Sort

Key1

Value1

Value0

Key7 Value2

Key3 ValueA

Key5 ValueB

Value6

Output Data Set

Reco

rd

Reco

rd

Reco

rd

Reco

rd

Reco

rd

Reco

rd

Reco

rd

Reco

rd

Reco

rd

Reco

rd

Ma

p

Ph

as

e

Inte

rme

dia

te

Ph

as

e

Re

du

ce

Ph

as

e


Like Hadoop MapReduce, HDFS follows a master/slave architecture, made up of a robust master node and multiple data nodes (see Figure A-2 below). An HDFS installation consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, one per node in the cluster, which manage storage attached to the nodes that they run on. The NameNode makes file system namespace operations like opening, closing, and renaming of files and directories available via an RPC interface. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from file system clients. They also perform block creation, deletion, and replication upon instruction from the NameNode.

Figure A-2. HDFS Model

Rack

JobTracker


NameNode




Rack







Switch 100 Mbit

Switch 100 Mbit

Switch 1 Gbit

Client


8. Query Optimization Our recommendation is have a deep dive on the worst performing queries focusing on the ones running frequently. On the other hand moving most of the analytics from the MySQL production database to Hadoop will reduce the data volume and the load of the MySQL database. This will necessarily imply a performance improvement.

9. Access and Data Security During our discussions it was mentioned some efforts would be needed to better protect and encrypt the URL used to access the different website pages. In addition, we’ve suggested for future use to secure the data themselves doing some encryption.

10. Internal Management and Collaboration tools Sales Force appears to be the recommend choice in regards of its numerous management and collaboration features. It includes all the capabilities required: Contact management; Project management and time tracking; Technical Support Management … :

Sales Force Professional is $65 /user/month = $3,900 (2 846€) per year for 5 users


11. Sales Force and Force.com integration In addition, Sales Force offers a complete API named Force.com allowing integrating features on your existing platform. This API will allow for future use an easy way to integrate new features to ... application, such as mobile device support; interface with existing application using AppsExchange; Real-Time Analytics …


12. Roadmap Hadoop Installation and configuration takes no more than 2 days for one person (see ―Building and Installing Hadoop-MapReduce‖ PDF file). We recommend taking seriously the design phase to build strong foundations of your future architecture. Your customers Datamart should take no more than a month for a full implementation. Regarding your internal Datamart the implantation time will depend on how deep you want to go in analytics, however gaining experience by implementing the customer Datamart this shouldn’t be longer than a month. Of course, we’ll be able to assist you as needed to follow up on your future architecture implementation. Cloudera is also providing different services on Hadoop: Professional Services (http://www.cloudera.com/hadoop-services)

Best practices for setting up and configuring a cluster suitable to run Cloudera’s Distribution for Hadoop: Choice of hardware, operating system, and related systems software Configuration of storage in the cluster, including ways to integrate with existing storage repositories Balancing compute power with storage capacity on nodes in the cluster A comprehensive design review of your current system and your plans for Hadoop: Discovery and analysis sessions aimed at identifying the various data types and sources streaming

into your cluster Design recommendations for a data-processing pipeline that addresses your business needs Operational guidance for a cluster running Hadoop, including: Best practices for loading data into the cluster and for ensuring locality of data to compute nodes Identifying, diagnosing, and fixing errors in Hadoop and the site-specific analyses our customers run Tools and techniques for monitoring an active Hadoop cluster Advice on the integration of MapReduce job submission into an existing data-processing pipeline,

so Hadoop can read data from, and write data to, the analytic tools and databases our customers already use

Guidance on the use of additional analytic or developmental tools, such as Hive and Pig, that offer high-level interfaces for data evaluation and visualization

Hands-on help in developing Hadoop applications that deliver the data-processing and analysis you need. How to connect Hadoop to your existing IT infrastructure. We can help with moving data between Hadoop and data warehouses, collecting data from file systems, creating document repositories, logging infrastructure and other sources, and setting up existing visualization and analytic tools to work with Hadoop. Performance audits of your Hadoop cluster, with tuning recommendations for speed, throughput, and response times

http://www.cloudera.com/hadoop-services/


Training (http://www.cloudera.com/hadoop-training) Cloudera offers numerous on-line training resources and live public sessions: Developer Training and Certification

Cloudera offers a three-day training program targeted toward developers who want to learn how to use Hadoop to build powerful data processing applications. Over three days, this course will assume only a casual understanding of Hadoop and teach you everything you need to know to take advantage of some of the most powerful features. We’ll get into deep details about Hadoop itself, but also devote ample time for hands-on exercises, importing data from existing sources, working with Hive and Pig, debugging MapReduce and much more. A full agenda is on the registration page. This course includes the certification exam to become Cloudera Certified Hadoop Developer.

Sysadmin Training and Certification Systems administrators need to know how Hadoop operates in order to deploy and manage clusters for their organizations. Cloudera offers a two-day intensive course on Hadoop for operations staff. The course describes Hadoop’s architecture, covers the management and monitoring tools most commonly used to oversee it, and provides valuable advice on setting up, maintaining and troubleshooting Hadoop for development and production systems. This course includes the certification exam to become Cloudera Certified Hadoop Administrator.

HBase Training Use HBase as a distributed data store to achieve low-latency queries and highly scalable throughput. HBase training covers the HBase architecture, data model, and Java API as well as some advanced topics and best practices. This training is for developers (Java experience is recommended) who already have a basic understanding of Hadoop

http://www.cloudera.com/hadoop-training/

Database Architecture Proposal

Technology

data integrationimportation

data source

data security

historical data

data marts

data warehouse schema

data warehouse volume

data quality management