A Survey on Data Mapping Strategy for data stored in the storage cloud 111

VOLUME 9,ISSUE 3,MAY 2016 ISSN-2347-8047

INTERNATIONAL JOURNAL OF COGNITIVE SCIENCE, ENGINEERING AND TECHNOLOGY

Available online at: http://airnetjournal.org/

A Survey on Data Mapping Strategy for data stored in the storage cloud

Navneet Kumar Department Of cse

K S Institute Of Technology

Bengaluru,India

Naseeruddin V N Department Of cse


Bengaluru,India

Murali Krishna V Department Of cse


Bengaluru,India

S K Manu Department Of cse


Bengaluru,India

Swathi K Department Of cse


Bengaluru,India

Abstract— In the recent past the data being processed over the internet is increasing exponentially so it’s difficult to store such huge amount of data and It becomes computationally inefficient to analyze such huge data. There is currently considerable enthusiasm around the Map Reduce paradigm for large-scale data analysis. It is inspired by functional programming which allows expressing distributed computation massive amounts of data. It is designed for large-scale data processing as it allows to run on clusters of commodity hardware. A prominent parallel data processing tool Map Reduce is gaining significant momentum from both industry and academia as the volume of data to analyze grows rapidly. In this paper we propose a method to process huge amount of data over the internet. This method involves storing the data to be processed on the cloud and processing the data on hadoop multicluster environment.

Keywords— Storage Cloud, Hadoop cluster, Hadoop, Distributed File System, Parallel Processing, MapReduce

I.Introduction

The very challenging problem is to analyze big data. For the effective handling of such massive data or applications, the use of MapReduce framework has been widely came into focus. Over the last few years, MapReduce has emerged as the most popular computing paradigm for parallel, batch-style and analysis of large amount of data. Many areas where massive data analysis is required, MapReduce is used. There are

evolving numbers of applications that handle big data but to handle such huge collection of data is a very challengingproblem today. Here, we got the MapReduce or its opensource equivalent Hadoop which is a powerful tool for building such applications. Data-intensive processing is fast and currently becoming a necessity to handle the large databases efficiently. It is required to design algorithms that must be capable of scaling to real-world datasets. There is currently considerable enthusiasm around the MapReduce paradigm for large-scale data analysis. It is inspired by functional programming which allows expressing distributed computations on massive amounts of data. It is designed for large-scale data processing as it allows running on clusters of commodity hardware. MapReduce is used in the areas where the volume of data to analyze grows speedily. Though, it comprise of such abilities, still there are argument on its concert, effectiveness, and simple concept. At the present time there is outburst of data, so to process such a massive volume of data in a timely manner, parallel processing is important. MapReduce gained its popularity when used successfully by Google. In real, it is a scalable and fault-tolerant data processing tool which provides the ability to process huge voluminous data in parallel with many low-end computing nodes. By virtue of its simplicity, scalability, and fault tolerance, MapReduce is becoming ubiquitous, gaining significant momentum from both industry and academia. However, MapReduce has inherent limitations on its performance and efficiency. Therefore, many studies have endeavoured to overcome the limitations of the MapReduce framework. The goal of this analysis is to provide a timely

remark on the status of MapReduce studies and related work focusing on the current research aimed at improving and enhancing the MapReduce framework. This paper is brought into consideration to assist the database in understanding various technical aspects of the MapReduce framework. In this paper, we focus on the working of MapReduce framework and examine its in-built advantages and drawbacks. We then introduce application and effective ways to improve its properties so that we can get the optimized result. We also brought into focus the issues and challenges raised on MapReduce. It is well known for its simplicity, effectiveness and capability to handle “Big Data” in a timely manner. With all these valuable features still it consist of some limitations which is required to be sorted out.

II.Working

MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks [3]. The MapReduce paradigm of parallel programming provides simplicity, while at the same time offering load balancing and fault tolerance The Google File System (GFS) that typically underlies a MapReduce system provides the efficient and reliable distributed data storage needed for applications involving large databases [10]. MapReduce is inspired by the map and reduces primitives present in functional languages. In its pure form, various implementations of the MapReduce interface are possible, depending on the desired context. Some currently available implementations are: shared-memory multi-core system [11][12], asymmetric multi-core processors[13], graphic processors, and cluster of networked machines[4]. The most popular implementation is probably the one introduced by Google, which utilizes large clusters of commodity computers connected with switched Ethernet. In essence, the Google’s MapReduce technique simplifies the development and lowers the cost of large-scale distributed applications on clusters of commodity machines. MapReduce framework executes its tasks based on runtime scheduling scheme. It means that MapReduce does not build any execution plan that specifies which tasks will run on which nodes before execution [14]. The MapReduce model is capable of parallelly processing large data sets distributed across many nodes. The main goal is to simplify large data processing by using inexpensive cluster computers and to make this easy for users while achieving both load balancing and fault tolerance. Map-Reduce have two primary functions: the Map function and the Reduce function. These functions are defined by the user to meet the specific requirements. The original Map-Reduce software is a proprietary system of

Google, and therefore, not available for public use [15]. Although the distributed computing is largely simplified with the notions of Map and Reduce primitives, the underlying infrastructure is non-trivial in order to achieve the desired performance [2]. A key infrastructure in Google’s MapReduce is the underlying distributed file system to ensure data locality and availability [3]. Combining the MapReduce programming technique and an efficient distributed file system, one can easily achieve the goal of distributed computing with data parallelism over thousands of computing nodes.

III.Methodlogy

The Architecture above illustrates the layout of the project. User uploads the data to the cloud over the internet, then selects the operation to be carried out. the controller present as a middleware interprets the request and forwards the request to hadoop master. hadoop master starts the jobtracker and connects the cloud as the data node and the mapreduce algorithm is run, which maps the data and reduces according to the algorithm implempted. the result is collected and concated and stored back onto the cloud for the use r to download the result.The use of a storage cloud allows the user to upload the data and download the data from places connected to the internet without known where the actual processing is done.

IV.Design

The goals of application is to provide an easy to use interface so that a user with even little knowledge about using website can use the browser . We have designed few models and structures to explain the design and structure of the application under discussion.Data Flow ModelA data flow diagram (DFD) is a graphical representation of the "flow" of data through an information system, modeling its process aspects. Often they are a preliminary step used to create an overview of the system which can later be elaborated. DFDs can also be used for the visualization of data processing (structured design).

A DFD shows what kinds of information will be input to and

output from the system, where the data will come from and go to, and where the data will be stored. It does not show information about the timing of processes, or information about whether processes will operate in sequence or in parallel.Data flow: A data flow shows the flow of information from its source to its destination. A data flow is represented by a line, with arrowheads showing the direction of flow.Data Store: A data store is a holding place for information within the system. It is represented by an open-ended narrow rectangle.External Entities: It is normal for all information represented within a system to have been obtained from and/or to be passed on to external source recipient.Processes: When naming processes, avoid glossing over them, without really understanding their role. It is descriptive title area – like ‘process’ or ‘update’.Data Flows: Double-headed arrows can be used on all but bottom-level diagrams. Furthermore, in common with most of the other symbol used, a data flow at a particular level of diagram may be decomposed to multiple data flows.

V.Snapshots

Description: Admin logins using admin as Admin name and Password. If admin name or password not matches it will display message as Wrong admin or wrong password, If matches it will display message as Login Success.

Description: Website contains options for user, through

which he can do the necessary task.

Description: User can select the containers and can upload the data to the cloud.

Description: User can select the output container where the result is stored and can download the data from the cloud.

Acknowledgment The satisfaction and euphoria that accompany the successful

completion of any task will be incomplete without the mention

of the individuals, we are greatly indebted to, who through

guidance and providing facilities have served as a beacon of

light and crowned our efforts with success .we are thankful to

Mrs. Swathi K , Assistant Professor, CSE,KSIT for being our

Project Guide, under whose able guidance this project work

has been carried out and completed successfully.

We thank the management, principal, Department of computer

science and engineering, KSIT. We thank VGST(Vision

Group on Science and Technology) Government of Karnataka,

India for providing infrastructure facilities through the K-FIST

Level II project at KSIT,CSE R&D Department Bengaluru.

References

[1] Maitrey S, Jha. An Integrated Approach for CURE Clustering using Map-Reduce Technique. In Proceedings of Elsevier, ISBN 978-81- 910691-6-3,2nd August 2013.[2] Kyuseok Shim. MapReduce Algorithms for Big Data Analysis. In Proceedings of the VLDB Endowment, Vol. 5, No. 12, August 27th 2012, Istanbul, Turkey.[3]Jeffrey Dean et al. Mapreduce: Simplified data processing on large clusters. In Proceedings of the 6th USENIX OSDI, pages 137–150, 2004.[4] J. Dean et al. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107– 113, 2008.[5] D. DeWitt and M. Stonebraker. MapReduce: A major step backwards. The Database Column, 1, 2008.[6] A. Pavlo et al. A comparison of approaches to large-scale data analysis. In Proceedings of the ACM SIGMOD, pages 165– 178, 2009.[7] M. Stonebraker et al. MapReduce and parallel DBMSs: friends or foes? Communications of the ACM, 53(1):64–71, 2010.[8] A. Thusoo et al. Hive: a warehousing solution over a mapreduce framework. Proceedings of the VLDB Endowment, (2):1626–1629, 2009.[9] A.F. Gates et al. Building a high-level dataflow system on top of Map-Reduce: the Pig experience. Proceedings of the VLDB Endowment, 2(2):1414–1425, 2009.[10] S. Ghemawat et al. The google file system. ACM SIGOPS Operating Systems Review, 37(5):29–43, 2003.[11]OpenStack Installation Guide for Ubuntu 14.04 ,February 26, 2015.[12]http://www.stackoverflow.com/

[13]https://github.com/

https://github.com/

http://www.stackoverflow.com/

A Survey on Data Mapping Strategy for data stored in the storage cloud 111

Documents