Big Data: A Survey - MMLABmchen/min_paper/BigDataSurvey... · 2019-11-16 · big data is mainly used to describe enormousdatasets. Com-pared with traditional datasets, big data typically

Post on 01-Apr-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

Big Data A Survey

Min Chen middot Shiwen Mao middot Yunhao Liu

copy Springer Science+Business Media New York 2014

Abstract In this paper we review the background andstate-of-the-art of big data We first introduce the generalbackground of big data and review related technologiessuch as could computing Internet of Things data centersand Hadoop We then focus on the four phases of the valuechain of big data ie data generation data acquisition datastorage and data analysis For each phase we introduce thegeneral background discuss the technical challenges andreview the latest advances We finally examine the severalrepresentative applications of big data including enterprisemanagement Internet of Things online social networksmedial applications collective intelligence and smart gridThese discussions aim to provide a comprehensive overviewand big-picture to readers of this exciting area This surveyis concluded with a discussion of open problems and futuredirections

Keywords Big data middot Cloud computing middot Internet ofthings middot Data center middot Hadoop middot Smart grid middot Big dataanalysis

M Chen ()School of Computer Science and TechnologyHuazhong University of Science and Technology1037 Luoyu Road Wuhan 430074 Chinae-mail minchen2012husteducn minchenieeeorg

S MaoDepartment of Electrical amp Computer EngineeringAuburn University 200 Broun Hall AuburnAL 36849-5201 USAe-mail smaoieeeorg

Y LiuTNLIST School of Software Tsinghua University Beijing Chinae-mail yunhaogreenorbscom

1 Background

11 Dawn of big data era

Over the past 20 years data has increased in a large scalein various fields According to a report from InternationalData Corporation (IDC) in 2011 the overall created andcopied data volume in the world was 18ZB (asymp 1021B)which increased by nearly nine times within five years [1]This figure will double at least every other two years in thenear future

Under the explosive increase of global data the term ofbig data is mainly used to describe enormous datasets Com-pared with traditional datasets big data typically includesmasses of unstructured data that need more real-time analy-sis In addition big data also brings about new opportunitiesfor discovering new values helps us to gain an in-depthunderstanding of the hidden values and also incurs newchallenges eg how to effectively organize and managesuch datasets

Recently industries become interested in the high poten-tial of big data and many government agencies announcedmajor plans to accelerate big data research and applica-tions [2] In addition issues on big data are often coveredin public media such as The Economist [3 4] New YorkTimes [5] and National Public Radio [6 7] Two pre-mier scientific journals Nature and Science also openedspecial columns to discuss the challenges and impacts ofbig data [8 9] The era of big data has come beyond alldoubt [10]

Nowadays big data related to the service of Internet com-panies grow rapidly For example Google processes data ofhundreds of Petabyte (PB) Facebook generates log data ofover 10 PB per month Baidu a Chinese company processesdata of tens of PB and Taobao a subsidiary of Alibaba

Mobile Netw Appl (2014) 19171ndash209DOI 101007s11036-013-0489-0

generates data of tens of Terabyte (TB) for online tradingper day Figure 1 illustrates the boom of the global data vol-ume While the amount of large datasets is drastically risingit also brings about many challenging problems demandingprompt solutions

ndash The latest advances of information technology (IT)make it more easily to generate data For example onaverage 72 hours of videos are uploaded to YouTubein every minute [11] Therefore we are confronted withthe main challenge of collecting and integrating massivedata from widely distributed data sources

ndash The rapid growth of cloud computing and the Internet ofThings (IoT) further promote the sharp growth of dataCloud computing provides safeguarding access sitesand channels for data asset In the paradigm of IoT sen-sors all over the world are collecting and transmittingdata to be stored and processed in the cloud Such datain both quantity and mutual relations will far surpass

the capacities of the IT architectures and infrastruc-ture of existing enterprises and its realtime requirementwill also greatly stress the available computing capacityThe increasingly growing data cause a problem of howto store and manage such huge heterogeneous datasetswith moderate requirements on hardware and softwareinfrastructure

ndash In consideration of the heterogeneity scalability real-time complexity and privacy of big data we shalleffectively ldquominerdquo the datasets at different levels duringthe analysis modeling visualization and forecastingso as to reveal its intrinsic property and improve thedecision making

12 Definition and features of big data

Big data is an abstract concept Apart from masses of datait also has some other features which determine the differ-ence between itself and ldquomassive datardquo or ldquovery big datardquo

Fig 1 The continuouslyincreasing big data

172 Mobile Netw Appl (2014) 19171ndash209

At present although the importance of big data has beengenerally recognized people still have different opinions onits definition In general big data shall mean the datasetsthat could not be perceived acquired managed and pro-cessed by traditional IT and softwarehardware tools withina tolerable time Because of different concerns scientificand technological enterprises research scholars data ana-lysts and technical practitioners have different definitionsof big data The following definitions may help us have abetter understanding on the profound social economic andtechnological connotations of big data

In 2010 Apache Hadoop defined big data as ldquodatasetswhich could not be captured managed and processed bygeneral computers within an acceptable scoperdquo On the basisof this definition in May 2011 McKinsey amp Company aglobal consulting agency announced Big Data as the nextfrontier for innovation competition and productivity Bigdata shall mean such datasets which could not be acquiredstored and managed by classic database software This def-inition includes two connotations First datasetsrsquo volumesthat conform to the standard of big data are changing andmay grow over time or with technological advances Sec-ond datasetsrsquo volumes that conform to the standard of bigdata in different applications differ from each other Atpresent big data generally ranges from several TB to sev-eral PB [10] From the definition by McKinsey amp Companyit can be seen that the volume of a dataset is not the onlycriterion for big data The increasingly growing data scaleand its management that could not be handled by traditionaldatabase technologies are the next two key features

As a matter of fact big data has been defined as earlyas 2001 Doug Laney an analyst of META (presentlyGartner) defined challenges and opportunities brought aboutby increased data with a 3Vs model ie the increase ofVolume Velocity and Variety in a research report [12]Although such a model was not originally used to definebig data Gartner and many other enterprises includingIBM [13] and some research departments of Microsoft [14]still used the ldquo3Vsrdquo model to describe big data withinthe following ten years [15] In the ldquo3Vsrdquo model Volumemeans with the generation and collection of masses ofdata data scale becomes increasingly big Velocity meansthe timeliness of big data specifically data collection andanalysis etc must be rapidly and timely conducted so asto maximumly utilize the commercial value of big dataVariety indicates the various types of data which includesemi-structured and unstructured data such as audio videowebpage and text as well as traditional structured data

However others have different opinions including IDCone of the most influential leaders in big data and itsresearch fields In 2011 an IDC report defined big data asldquobig data technologies describe a new generation of tech-nologies and architectures designed to economically extract

value from very large volumes of a wide variety of data byenabling the high-velocity capture discovery andor anal-ysisrdquo [1] With this definition characteristics of big datamay be summarized as four Vs ie Volume (great volume)Variety (various modalities) Velocity (rapid generation)and Value (huge value but very low density) as shown inFig 2 Such 4Vs definition was widely recognized sinceit highlights the meaning and necessity of big data ieexploring the huge hidden values This definition indicatesthe most critical problem in big data which is how to dis-cover values from datasets with an enormous scale varioustypes and rapid generation As Jay Parikh Deputy ChiefEngineer of Facebook said ldquoYou could only own a bunchof data other than big data if you do not utilize the collecteddatardquo [11]

In addition NIST defines big data as ldquoBig data shallmean the data of which the data volume acquisition speedor data representation limits the capacity of using traditionalrelational methods to conduct effective analysis or the datawhich may be effectively processed with important horizon-tal zoom technologiesrdquo which focuses on the technologicalaspect of big data It indicates that efficient methods ortechnologies need to be developed and used to analyze andprocess big data

There have been considerable discussions from bothindustry and academia on the definition of big data [16 17]In addition to developing a proper definition the big dataresearch should also focus on how to extract its value howto use data and how to transform ldquoa bunch of datardquo into ldquobigdatardquo

13 Big data value

McKinsey amp Company observed how big data created val-ues after in-depth research on the US healthcare the EUpublic sector administration the US retail the global man-ufacturing and the global personal location data Throughresearch on the five core industries that represent the globaleconomy the McKinsey report pointed out that big datamay give a full play to the economic function improve theproductivity and competitiveness of enterprises and publicsectors and create huge benefits for consumers In [10]McKinsey summarized the values that big data could cre-ate if big data could be creatively and effectively utilizedto improve efficiency and quality the potential value ofthe US medical industry gained through data may surpassUSD 300 billion thus reducing the expenditure for the UShealthcare by over 8 retailers that fully utilize big datamay improve their profit by more than 60 big data mayalso be utilized to improve the efficiency of governmentoperations such that the developed economies in Europecould save over EUR 100 billion (which excludes the effectof reduced frauds errors and tax difference)

Mobile Netw Appl (2014) 19171ndash209 173

Fig 2 The 4Vs feature of big data

The McKinsey report is regarded as prospective andpredictive while the following facts may validate the val-ues of big data During the 2009 flu pandemic Googleobtained timely information by analyzing big data whicheven provided more valuable information than that providedby disease prevention centers Nearly all countries requiredhospitals inform agencies such as disease prevention centersof the new type of influenza cases However patients usu-ally did not see doctors immediately when they got infectedIt also took some time to send information from hospitals todisease prevention centers and for disease prevention cen-ters to analyze and summarize such information Thereforewhen the public is aware of the pandemic of the new typeof influenza the disease may have already spread for one totwo weeks with a hysteretic nature Google found that dur-ing the spreading of influenza entries frequently sought atits search engines would be different from those at ordinarytimes and the use frequencies of the entries were corre-lated to the influenza spreading in both time and locationGoogle found 45 search entry groups that were closely rel-evant to the outbreak of influenza and incorporated themin specific mathematic models to forecast the spreading ofinfluenza and even to predict places where influenza spreadfrom The related research results have been published inNature [18]

In 2008 Microsoft purchased Farecast a sci-tech venturecompany in the US Farecast has an airline ticket forecastsystem that predicts the trends and risingdropping ranges ofairline ticket price The system has been incorporated intothe Bing search engine of Microsoft By 2012 the systemhas saved nearly USD 50 per ticket per passenger with theforecasted accuracy as high as 75

At present data has become an important production fac-tor that could be comparable to material assets and humancapital As multimedia social media and IoT are devel-oping enterprises will collect more information leading

to an exponential growth of data volume Big data willhave a huge and increasing potential in creating values forbusinesses and consumers

14 The development of big data

In the late 1970s the concept of ldquodatabase machinerdquoemerged which is a technology specially used for stor-ing and analyzing data With the increase of data volumethe storage and processing capacity of a single mainframecomputer system became inadequate In the 1980s peo-ple proposed ldquoshare nothingrdquo a parallel database system tomeet the demand of the increasing data volume [19] Theshare nothing system architecture is based on the use ofcluster and every machine has its own processor storageand disk Teradata system was the first successful com-mercial parallel database system Such database becamevery popular lately On June 2 1986 a milestone eventoccurred when Teradata delivered the first parallel databasesystem with the storage capacity of 1TB to Kmart to helpthe large-scale retail company in North America to expandits data warehouse [20] In the late 1990s the advantagesof parallel database was widely recognized in the databasefield

However many challenges on big data arose With thedevelopment of Internet servies indexes and queried con-tents were rapidly growing Therefore search engine com-panies had to face the challenges of handling such big dataGoogle created GFS [21] and MapReduce [22] program-ming models to cope with the challenges brought aboutby data management and analysis at the Internet scale Inaddition contents generated by users sensors and otherubiquitous data sources also feuled the overwhelming dataflows which required a fundamental change on the comput-ing architecture and large-scale data processing mechanismIn January 2007 Jim Gray a pioneer of database software

174 Mobile Netw Appl (2014) 19171ndash209

called such transformation ldquoThe Fourth Paradigmrdquo [23] Healso thought the only way to cope with such paradigm wasto develop a new generation of computing tools to managevisualize and analyze massive data In June 2011 anothermilestone event occurred EMCIDC published a researchreport titled Extracting Values from Chaos [1] which intro-duced the concept and potential of big data for the firsttime This research report triggered the great interest in bothindustry and academia on big data

Over the past few years nearly all major companiesincluding EMC Oracle IBM Microsoft Google Ama-zon and Facebook etc have started their big data projectsTaking IBM as an example since 2005 IBM has investedUSD 16 billion on 30 acquisitions related to big data Inacademia big data was also under the spotlight In 2008Nature published a big data special issue In 2011 Sciencealso launched a special issue on the key technologies ofldquodata processingrdquo in big data In 2012 European ResearchConsortium for Informatics and Mathematics (ERCIM)News published a special issue on big data In the beginningof 2012 a report titled Big Data Big Impact presented at theDavos Forum in Switzerland announced that big data hasbecome a new kind of economic assets just like currencyor gold Gartner an international research agency issuedHype Cycles from 2012 to 2013 which classified big datacomputing social analysis and stored data analysis into 48emerging technologies that deserve most attention

Many national governments such as the US also paidgreat attention to big data In March 2012 the ObamaAdministration announced a USD 200 million investmentto launch the ldquoBig Data Research and Development Planrdquowhich was a second major scientific and technologicaldevelopment initiative after the ldquoInformation Highwayrdquo ini-tiative in 1993 In July 2012 the ldquoVigorous ICT Japanrdquoproject issued by Japanrsquos Ministry of Internal Affairs andCommunications indicated that the big data developmentshould be a national strategy and application technologiesshould be the focus In July 2012 the United Nations issuedBig Data for Development report which summarized howgovernments utilized big data to better serve and protecttheir people

15 Challenges of big data

The sharply increasing data deluge in the big data erabrings about huge challenges on data acquisition storagemanagement and analysis Traditional data managementand analysis systems are based on the relational databasemanagement system (RDBMS) However such RDBMSsonly apply to structured data other than semi-structured orunstructured data In addition RDBMSs are increasinglyutilizing more and more expensive hardware It is appar-ently that the traditional RDBMSs could not handle the

huge volume and heterogeneity of big data The researchcommunity has proposed some solutions from different per-spectives For example cloud computing is utilized to meetthe requirements on infrastructure for big data eg costefficiency elasticity and smooth upgradingdowngradingFor solutions of permanent storage and management oflarge-scale disordered datasets distributed file systems [24]and NoSQL [25] databases are good choices Such program-ming frameworks have achieved great success in processingclustered tasks especially for webpage ranking Various bigdata applications can be developed based on these innova-tive technologies or platforms Moreover it is non-trivial todeploy the big data analysis systems

Some literature [26ndash28] discuss obstacles in the develop-ment of big data applications The key challenges are listedas follows

ndash Data representation many datasets have certain levelsof heterogeneity in type structure semantics organiza-tion granularity and accessibility Data representationaims to make data more meaningful for computer anal-ysis and user interpretation Nevertheless an improperdata representation will reduce the value of the origi-nal data and may even obstruct effective data analysisEfficient data representation shall reflect data structureclass and type as well as integrated technologies so asto enable efficient operations on different datasets

ndash Redundancy reduction and data compression gener-ally there is a high level of redundancy in datasetsRedundancy reduction and data compression is effec-tive to reduce the indirect cost of the entire system onthe premise that the potential values of the data are notaffected For example most data generated by sensornetworks are highly redundant which may be filteredand compressed at orders of magnitude

ndash Data life cycle management compared with the rel-atively slow advances of storage systems pervasivesensing and computing are generating data at unprece-dented rates and scales We are confronted with a lotof pressing challenges one of which is that the currentstorage system could not support such massive dataGenerally speaking values hidden in big data dependon data freshness Therefore a data importance princi-ple related to the analytical value should be developedto decide which data shall be stored and which datashall be discarded

ndash Analytical mechanism the analytical system of big datashall process masses of heterogeneous data within alimited time However traditional RDBMSs are strictlydesigned with a lack of scalability and expandabilitywhich could not meet the performance requirementsNon-relational databases have shown their uniqueadvantages in the processing of unstructured data and

Mobile Netw Appl (2014) 19171ndash209 175

started to become mainstream in big data analysisEven so there are still some problems of non-relationaldatabases in their performance and particular applica-tions We shall find a compromising solution betweenRDBMSs and non-relational databases For examplesome enterprises have utilized a mixed database archi-tecture that integrates the advantages of both types ofdatabase (eg Facebook and Taobao) More researchis needed on the in-memory database and sample databased on approximate analysis

ndash Data confidentiality most big data service providers orowners at present could not effectively maintain andanalyze such huge datasets because of their limitedcapacity They must rely on professionals or tools toanalyze such data which increase the potential safetyrisks For example the transactional dataset generallyincludes a set of complete operating data to drive keybusiness processes Such data contains details of thelowest granularity and some sensitive information suchas credit card numbers Therefore analysis of big datamay be delivered to a third party for processing onlywhen proper preventive measures are taken to protectsuch sensitive data to ensure its safety

ndash Energy management the energy consumption of main-frame computing systems has drawn much attentionfrom both economy and environment perspectives Withthe increase of data volume and analytical demandsthe processing storage and transmission of big datawill inevitably consume more and more electric energyTherefore system-level power consumption controland management mechanism shall be established forbig data while the expandability and accessibility areensured

ndash Expendability and scalability the analytical system ofbig data must support present and future datasets Theanalytical algorithm must be able to process increas-ingly expanding and more complex datasets

ndash Cooperation analysis of big data is an interdisci-plinary research which requires experts in differentfields cooperate to harvest the potential of big dataA comprehensive big data network architecture mustbe established to help scientists and engineers in var-ious fields access different kinds of data and fullyutilize their expertise so as to cooperate to complete theanalytical objectives

2 Related technologies

In order to gain a deep understanding of big data this sec-tion will introduce several fundamental technologies that areclosely related to big data including cloud computing IoTdata center and Hadoop

21 Relationship between cloud computing and big data

Cloud computing is closely related to big data The keycomponents of cloud computing are shown in Fig 3 Bigdata is the object of the computation-intensive operation andstresses the storage capacity of a cloud system The mainobjective of cloud computing is to use huge computing andstorage resources under concentrated management so asto provide big data applications with fine-grained comput-ing capacity The development of cloud computing providessolutions for the storage and processing of big data On theother hand the emergence of big data also accelerates thedevelopment of cloud computing The distributed storagetechnology based on cloud computing can effectively man-age big data the parallel computing capacity by virtue ofcloud computing can improve the efficiency of acquisitionand analyzing big data

Even though there are many overlapped technologiesin cloud computing and big data they differ in the fol-lowing two aspects First the concepts are different to acertain extent Cloud computing transforms the IT archi-tecture while big data influences business decision-makingHowever big data depends on cloud computing as thefundamental infrastructure for smooth operation

Second big data and cloud computing have differenttarget customers Cloud computing is a technology andproduct targeting Chief Information Officers (CIO) as anadvanced IT solution Big data is a product targeting ChiefExecutive Officers (CEO) focusing on business operationsSince the decision makers may directly feel the pressurefrom market competition they must defeat business oppo-nents in more competitive ways With the advances ofbig data and cloud computing these two technologies arecertainly and increasingly entwine with each other Cloudcomputing with functions similar to those of computers andoperating systems provides system-level resources big data

Fig 3 Key components of cloud computing

176 Mobile Netw Appl (2014) 19171ndash209

operates in the upper level supported by cloud computingand provides functions similar to those of database and effi-cient data processing capacity Kissinger President of EMCindicated that the application of big data must be based oncloud computing

The evolution of big data was driven by the rapid growthof application demands and cloud computing developedfrom virtualized technologies Therefore cloud computingnot only provides computation and processing for big databut also itself is a service mode To a certain extent theadvances of cloud computing also promote the developmentof big data both of which supplement each other

22 Relationship between IoT and big data

In the IoT paradigm an enormous amount of networkingsensors are embedded into various devices and machinesin the real world Such sensors deployed in different fieldsmay collect various kinds of data such as environmentaldata geographical data astronomical data and logistic dataMobile equipments transportation facilities public facil-ities and home appliances could all be data acquisitionequipments in IoT as illustrated in Fig 4

The big data generated by IoT has different characteris-tics compared with general big data because of the differenttypes of data collected of which the most classical charac-teristics include heterogeneity variety unstructured featurenoise and high redundancy Although the current IoT datais not the dominant part of big data by 2030 the quantity of

sensors will reach one trillion and then the IoT data will be

the most important part of big data according to the fore-

cast of HP A report from Intel pointed out that big data in

IoT has three features that conform to the big data paradigm

(i) abundant terminals generating masses of data (ii) data

generated by IoT is usually semi-structured or unstructured

(iii) data of IoT is useful only when it is analyzed

At present the data processing capacity of IoT has fallen

behind the collected data and it is extremely urgent to accel-

erate the introduction of big data technologies to promote

the development of IoT Many operators of IoT realize the

importance of big data since the success of IoT is hinged

upon the effective integration of big data and cloud com-

puting The widespread deployment of IoT will also bring

many cities into the big data era

There is a compelling need to adopt big data for IoT

applications while the development of big data is already

legged behind It has been widely recognized that these

two technologies are inter-dependent and should be jointly

developed on one hand the widespread deployment of IoT

drives the high growth of data both in quantity and cate-

gory thus providing the opportunity for the application and

development of big data on the other hand the application

of big data technology to IoT also accelerates the research

advances and business models of of IoT

Fig 4 Illustration of data acquisition equipment in IoT

Mobile Netw Appl (2014) 19171ndash209 177

23 Data center

In the big data paradigm the data center not only is a plat-form for concentrated storage of data but also undertakesmore responsibilities such as acquiring data managingdata organizing data and leveraging the data values andfunctions Data centers mainly concern ldquodatardquo other thanldquocenterrdquo It has masses of data and organizes and man-ages data according to its core objective and developmentpath which is more valuable than owning a good site andresource The emergence of big data brings about sounddevelopment opportunities and great challenges to data cen-ters Big data is an emerging paradigm which will promotethe explosive growth of the infrastructure and related soft-ware of data center The physical data center network isthe core for supporting big data but at present is the keyinfrastructure that is most urgently required [29]

ndash Big data requires data center provide powerful back-stage support The big data paradigm has more strin-gent requirements on storage capacity and processingcapacity as well as network transmission capacityEnterprises must take the development of data centersinto consideration to improve the capacity of rapidlyand effectively processing of big data under limitedpriceperformance ratio The data center shall providethe infrastructure with a large number of nodes build ahigh-speed internal network effectively dissipate heatand effective backup data Only when a highly energy-efficient stable safe expandable and redundant datacenter is built the normal operation of big data applica-tions may be ensured

ndash The growth of big data applications accelerates therevolution and innovation of data centers Many bigdata applications have developed their unique architec-tures and directly promote the development of storagenetwork and computing technologies related to datacenter With the continued growth of the volumes ofstructured and unstructured data and the variety ofsources of analytical data the data processing and com-puting capacities of the data center shall be greatlyenhanced In addition as the scale of data center isincreasingly expanding it is also an important issue onhow to reduce the operational cost for the developmentof data centers

ndash Big data endows more functions to the data center Inthe big data paradigm data center shall not only con-cern with hardware facilities but also strengthen softcapacities ie the capacities of acquisition processingorganization analysis and application of big data Thedata center may help business personnel analyze theexisting data discover problems in business operationand develop solutions from big data

24 Relationship between hadoop and big data

Presently Hadoop is widely used in big data applications inthe industry eg spam filtering network searching click-stream analysis and social recommendation In additionconsiderable academic research is now based on HadoopSome representative cases are given below As declaredin June 2012 Yahoo runs Hadoop in 42000 servers atfour data centers to support its products and services egsearching and spam filtering etc At present the biggestHadoop cluster has 4000 nodes but the number of nodeswill be increased to 10000 with the release of Hadoop 20In the same month Facebook announced that their Hadoopcluster can process 100 PB data which grew by 05 PB perday as in November 2012 Some well-known agencies thatuse Hadoop to conduct distributed computation are listedin [30] In addition many companies provide Hadoop com-mercial execution andor support including Cloudera IBMMapR EMC and Oracle

Among modern industrial machinery and systems sen-sors are widely deployed to collect information for environ-ment monitoring and failure forecasting etc Bahga and oth-ers in [31] proposed a framework for data organization andcloud computing infrastructure termed CloudView Cloud-View uses mixed architectures local nodes and remoteclusters based on Hadoop to analyze machine-generateddata Local nodes are used for the forecast of real-time fail-ures clusters based on Hadoop are used for complex offlineanalysis eg case-driven data analysis

The exponential growth of the genome data and the sharpdrop of sequencing cost transform bio-science and bio-medicine to data-driven science Gunarathne et al in [32]utilized cloud computing infrastructures Amazon AWSMicrosoft Azune and data processing framework basedon MapReduce Hadoop and Microsoft DryadLINQ torun two parallel bio-medicine applications (i) assembly ofgenome segments (ii) dimension reduction in the analy-sis of chemical structure In the subsequent application the166-D datasets used include 26000000 data points Theauthors compared the performance of all the frameworks interms of efficiency cost and availability According to thestudy the authors concluded that the loose coupling will beincreasingly applied to research on electron cloud and theparallel programming technology (MapReduce) frameworkmay provide the user an interface with more convenientservices and reduce unnecessary costs

3 Big data generation and acquisition

We have introduced several key technologies related to bigdata ie cloud computing IoT data center and HadoopNext we will focus on the value chain of big data which

178 Mobile Netw Appl (2014) 19171ndash209

can be generally divided into four phases data generationdata acquisition data storage and data analysis If we takedata as a raw material data generation and data acquisitionare an exploitation process data storage is a storage processand data analysis is a production process that utilizes theraw material to create new value

31 Data generation

Data generation is the first step of big data Given Inter-net data as an example huge amount of data in terms ofsearching entries Internet forum posts chatting records andmicroblog messages are generated Those data are closelyrelated to peoplersquos daily life and have similar features ofhigh value and low density Such Internet data may bevalueless individually but through the exploitation of accu-mulated big data useful information such as habits andhobbies of users can be identified and it is even possible toforecast usersrsquo behaviors and emotional moods

Moreover generated through longitudinal andor dis-tributed data sources datasets are more large-scale highlydiverse and complex Such data sources include sensorsvideos clickstreams andor all other available data sourcesAt present main sources of big data are the operationand trading information in enterprises logistic and sens-ing information in the IoT human interaction informationand position information in the Internet world and datagenerated in scientific research etc The information far sur-passes the capacities of IT architectures and infrastructuresof existing enterprises while its real time requirement alsogreatly stresses the existing computing capacity

311 Enterprise data

In 2013 IBM issued Analysis the Applications of Big Datato the Real World which indicates that the internal data ofenterprises are the main sources of big data The internaldata of enterprises mainly consists of online trading data andonline analysis data most of which are historically staticdata and are managed by RDBMSs in a structured man-ner In addition production data inventory data sales dataand financial data etc also constitute enterprise internaldata which aims to capture informationized and data-drivenactivities in enterprises so as to record all activities ofenterprises in the form of internal data

Over the past decades IT and digital data have con-tributed a lot to improve the profitability of business depart-ments It is estimated that the business data volume of allcompanies in the world may double every 12 years [10]in which the business turnover through the Internet enter-prises to enterprises and enterprises to consumers per daywill reach USD 450 billion [33] The continuously increas-ing business data volume requires more effective real-time

analysis so as to fully harvest its potential For exampleAmazon processes millions of terminal operations and morethan 500000 queries from third-party sellers per day [12]Walmart processes one million customer trades per hour andsuch trading data are imported into a database with a capac-ity of over 25PB [3] Akamai analyzes 75 million eventsper day for its target advertisements [13]

312 IoT data

As discussed IoT is an important source of big data Amongsmart cities constructed based on IoT big data may comefrom industry agriculture traffic transportation medicalcare public departments and families etc

According to the processes of data acquisition and trans-mission in IoT its network architecture may be dividedinto three layers the sensing layer the network layer andthe application layer The sensing layer is responsible fordata acquisition and mainly consists of sensor networksThe network layer is responsible for information transmis-sion and processing where close transmission may rely onsensor networks and remote transmission shall depend onthe Internet Finally the application layer support specificapplications of IoT

According to characteristics of Internet of Things thedata generated from IoT has the following features

ndash Large-scale data in IoT masses of data acquisi-tion equipments are distributedly deployed which mayacquire simple numeric data eg location or complexmultimedia data eg surveillance video In order tomeet the demands of analysis and processing not onlythe currently acquired data but also the historical datawithin a certain time frame should be stored Thereforedata generated by IoT are characterized by large scales

ndash Heterogeneity because of the variety data acquisitiondevices the acquired data is also different and such datafeatures heterogeneity

ndash Strong time and space correlation in IoT every dataacquisition device are placed at a specific geographiclocation and every piece of data has time stamp Thetime and space correlation are an important propertyof data from IoT During data analysis and process-ing time and space are also important dimensions forstatistical analysis

ndash Effective data accounts for only a small portion of thebig data a great quantity of noises may occur dur-ing the acquisition and transmission of data in IoTAmong datasets acquired by acquisition devices only asmall amount of abnormal data is valuable For exam-ple during the acquisition of traffic video the few videoframes that capture the violation of traffic regulations

Mobile Netw Appl (2014) 19171ndash209 179

and traffic accidents are more valuable than those onlycapturing the normal flow of traffic

313 Bio-medical data

As a series of high-throughput bio-measurement technolo-gies are innovatively developed in the beginning of the21st century the frontier research in the bio-medicine fieldalso enters the era of big data By constructing smartefficient and accurate analytical models and theoretical sys-tems for bio-medicine applications the essential governingmechanism behind complex biological phenomena may berevealed Not only the future development of bio-medicinecan be determined but also the leading roles can be assumedin the development of a series of important strategic indus-tries related to the national economy peoplersquos livelihoodand national security with important applications such asmedical care new drug R amp D and grain production (egtransgenic crops)

The completion of HGP (Human Genome Project) andthe continued development of sequencing technology alsolead to widespread applications of big data in the fieldThe masses of data generated by gene sequencing gothrough specialized analysis according to different applica-tion demands to combine it with the clinical gene diag-nosis and provide valuable information for early diagnosisand personalized treatment of disease One sequencing ofhuman gene may generate 100 600GB raw data In theChina National Genebank in Shenzhen there are 13 mil-lion samples including 115 million human samples and150000 animal plant and microorganism samples By theend of 2013 10 million traceable biological samples willbe stored and by the end of 2015 this figure will reach30 million It is predictable that with the development ofbio-medicine technologies gene sequencing will becomefaster and more convenient and thus making big data ofbio-medicine continuously grow beyond all doubt

In addition data generated from clinical medical care andmedical R amp D also rise quickly For example the Uni-versity of Pittsburgh Medical Center (UPMC) has stored2TB such data Explorys an American company providesplatforms to collocate clinical data operation and mainte-nance data and financial data At present about 13 millionpeoplersquos information have been collocated with 44 arti-cles of data at the scale of about 60TB which will reach70TB in 2013 Practice Fusion another American com-pany manages electronic medical records of about 200000patients

Apart from such small and medium-sized enterprisesother well-known IT companies such as Google Microsoftand IBM have invested extensively in the research and com-putational analysis of methods related to high-throughputbiological big data for shares in the huge market as known

as the ldquoNext Internetrdquo IBM forecasts in the 2013 StrategyConference that with the sharp increase of medical imagesand electronic medical records medical professionals mayutilize big data to extract useful clinical information frommasses of data to obtain a medical history and forecast treat-ment effects thus improving patient care and reduce costIt is anticipated that by 2015 the average data volume ofevery hospital will increase from 167TB to 665TB

314 Data generation from other fields

As scientific applications are increasing the scale ofdatasets is gradually expanding and the development ofsome disciplines greatly relies on the analysis of masses ofdata Here we examine several such applications Althoughbeing in different scientific fields the applications havesimilar and increasing demand on data analysis The firstexample is related to computational biology GenBank isa nucleotide sequence database maintained by the USNational Bio-Technology Innovation Center Data in thisdatabase may double every 10 months By August 2009Genbank has more than 250 billion bases from 150000 dif-ferent organisms [34] The second example is related toastronomy Sloan Digital Sky Survey (SDSS) the biggestsky survey project in astronomy has recorded 25TB datafrom 1998 to 2008 As the resolution of the telescope isimproved by 2004 the data volume generated per night willsurpass 20TB The last application is related to high-energyphysics In the beginning of 2008 the Atlas experiment ofLarge Hadron Collider (LHC) of European Organization forNuclear Research generates raw data at 2PBs and storesabout 10TB processed data per year

In addition pervasive sensing and computing amongnature commercial Internet government and social envi-ronments are generating heterogeneous data with unprece-dented complexity These datasets have their unique datacharacteristics in scale time dimension and data categoryFor example mobile data were recorded with respect topositions movement approximation degrees communica-tions multimedia use of applications and audio environ-ment [108] According to the application environment andrequirements such datasets into different categories so asto select the proper and feasible solutions for big data

32 Big data acquisition

As the second phase of the big data system big data acqui-sition includes data collection data transmission and datapre-processing During big data acquisition once we col-lect the raw data we shall utilize an efficient transmissionmechanism to send it to a proper storage managementsystem to support different analytical applications The col-lected datasets may sometimes include much redundant or

180 Mobile Netw Appl (2014) 19171ndash209

useless data which unnecessarily increases storage spaceand affects the subsequent data analysis For examplehigh redundancy is very common among datasets collectedby sensors for environment monitoring Data compressiontechnology can be applied to reduce the redundancy There-fore data pre-processing operations are indispensable toensure efficient data storage and exploitation

321 Data collection

Data collection is to utilize special data collection tech-niques to acquire raw data from a specific data generationenvironment Four common data collection methods areshown as follows

ndash Log files As one widely used data collection methodlog files are record files automatically generated by thedata source system so as to record activities in desig-nated file formats for subsequent analysis Log files aretypically used in nearly all digital devices For exam-ple web servers record in log files number of clicksclick rates visits and other property records of webusers [35] To capture activities of users at the web sitesweb servers mainly include the following three log fileformats public log file format (NCSA) expanded logformat (W3C) and IIS log format (Microsoft) All thethree types of log files are in the ASCII text formatDatabases other than text files may sometimes be usedto store log information to improve the query efficiencyof the massive log store [36 37] There are also someother log files based on data collection including stockindicators in financial applications and determinationof operating states in network monitoring and trafficmanagement

ndash Sensing Sensors are common in daily life to measurephysical quantities and transform physical quantitiesinto readable digital signals for subsequent process-ing (and storage) Sensory data may be classified assound wave voice vibration automobile chemicalcurrent weather pressure temperature etc Sensedinformation is transferred to a data collection pointthrough wired or wireless networks For applicationsthat may be easily deployed and managed eg videosurveillance system [38] the wired sensor network isa convenient solution to acquire related informationSometimes the accurate position of a specific phe-nomenon is unknown and sometimes the monitoredenvironment does not have the energy or communica-tion infrastructures Then wireless communication mustbe used to enable data transmission among sensor nodesunder limited energy and communication capability Inrecent years WSNs have received considerable inter-est and have been applied to many applications such

as environmental research [39 40] water quality mon-itoring [41] civil engineering [42 43] and wildlifehabit monitoring [44] A WSN generally consists ofa large number of geographically distributed sensornodes each being a micro device powered by batterySuch sensors are deployed at designated positions asrequired by the application to collect remote sensingdata Once the sensors are deployed the base stationwill send control information for network configura-tionmanagement or data collection to sensor nodesBased on such control information the sensory data isassembled in different sensor nodes and sent back to thebase station for further processing Interested readersare referred to [45] for more detailed discussions

ndash Methods for acquiring network data At present net-work data acquisition is accomplished using a com-bination of web crawler word segmentation systemtask system and index system etc Web crawler isa program used by search engines for downloadingand storing web pages [46] Generally speaking webcrawler starts from the uniform resource locator (URL)of an initial web page to access other linked web pagesduring which it stores and sequences all the retrievedURLs Web crawler acquires a URL in the order ofprecedence through a URL queue and then downloadsweb pages and identifies all URLs in the downloadedweb pages and extracts new URLs to be put in thequeue This process is repeated until the web crawleris stopped Data acquisition through a web crawler iswidely applied in applications based on web pagessuch as search engines or web caching Traditional webpage extraction technologies feature multiple efficientsolutions and considerable research has been done inthis field As more advanced web page applicationsare emerging some extraction strategies are proposedin [47] to cope with rich Internet applications

The current network data acquisition technologiesmainly include traditional Libpcap-based packet capturetechnology zero-copy packet capture technology as wellas some specialized network monitoring software such asWireshark SmartSniff and WinNetCap

ndash Libpcap-based packet capture technology Libpcap(packet capture library) is a widely used network datapacket capture function library It is a general tool thatdoes not depend on any specific system and is mainlyused to capture data in the data link layer It featuressimplicity easy-to-use and portability but has a rel-atively low efficiency Therefore under a high-speednetwork environment considerable packet losses mayoccur when Libpcap is used

Mobile Netw Appl (2014) 19171ndash209 181

ndash Zero-copy packet capture technology The so-calledzero-copy (ZC) means that no copies between any inter-nal memories occur during packet receiving and send-ing at a node In sending the data packets directly startfrom the user buffer of applications pass through thenetwork interfaces and arrive at an external networkIn receiving the network interfaces directly send datapackets to the user buffer The basic idea of zero-copyis to reduce data copy times reduce system calls andreduce CPU load while ddatagrams are passed from net-work equipments to user program space The zero-copytechnology first utilizes direct memory access (DMA)technology to directly transmit network datagrams to anaddress space pre-allocated by the system kernel so asto avoid the participation of CPU In the meanwhile itmaps the internal memory of the datagrams in the sys-tem kernel to the that of the detection program or buildsa cache region in the user space and maps it to the ker-nel space Then the detection program directly accessesthe internal memory so as to reduce internal memorycopy from system kernel to user space and reduce theamount of system calls

ndash Mobile equipments At present mobile devices aremore widely used As mobile device functions becomeincreasingly stronger they feature more complex andmultiple means of data acquisition as well as morevariety of data Mobile devices may acquire geo-graphical location information through positioning sys-tems acquire audio information through microphonesacquire pictures videos streetscapes two-dimensionalbarcodes and other multimedia information throughcameras acquire user gestures and other body languageinformation through touch screens and gravity sensorsOver the years wireless operators have improved theservice level of the mobile Internet by acquiring andanalyzing such information For example iPhone itselfis a ldquomobile spyrdquo It may collect wireless data andgeographical location information and then send suchinformation back to Apple Inc for processing of whichthe user is not aware Apart from Apple smart phoneoperating systems such as Android of Google and Win-dows Phone of Microsoft can also collect informationin the similar manner

In addition to the aforementioned three data acquisitionmethods of main data sources there are many other datacollect methods or systems For example in scientific exper-iments many special tools can be used to collect exper-imental data such as magnetic spectrometers and radiotelescopes We may classify data collection methods fromdifferent perspectives From the perspective of data sourcesdata collection methods can be classified into two cate-gories collection methods recording through data sources

and collection methods recording through other auxiliarytools

322 Data transportation

Upon the completion of raw data collection data will betransferred to a data storage infrastructure for processingand analysis As discussed in Section 23 big data is mainlystored in a data center The data layout should be adjusted toimprove computing efficiency or facilitate hardware mainte-nance In other words internal data transmission may occurin the data center Therefore data transmission consistsof two phases Inter-DCN transmissions and Intra-DCNtransmissions

ndash Inter-DCN transmissions Inter-DCN transmissions arefrom data source to data center which is generallyachieved with the existing physical network infrastruc-ture Because of the rapid growth of traffic demandsthe physical network infrastructure in most regionsaround the world are constituted by high-volumn high-rate and cost-effective optic fiber transmission systemsOver the past 20 years advanced management equip-ment and technologies have been developed such asIP-based wavelength division multiplexing (WDM) net-work architecture to conduct smart control and man-agement of optical fiber networks [48 49] WDM isa technology that multiplexes multiple optical carriersignals with different wave lengths and couples themto the same optical fiber of the optical link In suchtechnology lasers with different wave lengths carry dif-ferent signals By far the backbone network have beendeployed with WDM optical transmission systems withsingle channel rate of 40Gbs At present 100Gbs com-mercial interface are available and 100Gbs systems (orTBs systems) will be available in the near future [50]However traditional optical transmission technologiesare limited by the bandwidth of the electronic bot-tleneck [51] Recently orthogonal frequency-divisionmultiplexing (OFDM) initially designed for wirelesssystems is regarded as one of the main candidatetechnologies for future high-speed optical transmis-sion OFDM is a multi-carrier parallel transmissiontechnology It segments a high-speed data flow to trans-form it into low-speed sub-data-flows to be transmittedover multiple orthogonal sub-carriers [52] Comparedwith fixed channel spacing of WDM OFDM allowssub-channel frequency spectrums to overlap with eachother [53] Therefore it is a flexible and efficient opticalnetworking technology

ndash Intra-DCN Transmissions Intra-DCN transmissionsare the data communication flows within data centersIntra-DCN transmissions depend on the communication

182 Mobile Netw Appl (2014) 19171ndash209

mechanism within the data center (ie on physical con-nection plates chips internal memories of data serversnetwork architectures of data centers and communica-tion protocols) A data center consists of multiple inte-grated server racks interconnected with its internal con-nection networks Nowadays the internal connectionnetworks of most data centers are fat-tree two-layeror three-layer structures based on multi-commoditynetwork flows [51 54] In the two-layer topologicalstructure the racks are connected by 1Gbps top rackswitches (TOR) and then such top rack switches areconnected with 10Gbps aggregation switches in thetopological structure The three-layer topological struc-ture is a structure augmented with one layer on the topof the two-layer topological structure and such layeris constituted by 10Gbps or 100Gbps core switchesto connect aggregation switches in the topologicalstructure There are also other topological structureswhich aim to improve the data center networks [55ndash58] Because of the inadequacy of electronic packetswitches it is difficult to increase communication band-widths while keeps energy consumption is low Overthe years due to the huge success achieved by opti-cal technologies the optical interconnection among thenetworks in data centers has drawn great interest Opti-cal interconnection is a high-throughput low-delayand low-energy-consumption solution At present opti-cal technologies are only used for point-to-point linksin data centers Such optical links provide connectionfor the switches using the low-cost multi-mode fiber(MMF) with 10Gbps data rate Optical interconnec-tion (switching in the optical domain) of networks indata centers is a feasible solution which can provideTbps-level transmission bandwidth with low energyconsumption Recently many optical interconnectionplans are proposed for data center networks [59] Someplans add optical paths to upgrade the existing net-works and other plans completely replace the currentswitches [59ndash64] As a strengthening technology Zhouet al in [65] adopt wireless links in the 60GHz fre-quency band to strengthen wired links Network vir-tualization should also be considered to improve theefficiency and utilization of data center networks

323 Data pre-processing

Because of the wide variety of data sources the collecteddatasets vary with respect to noise redundancy and con-sistency etc and it is undoubtedly a waste to store mean-ingless data In addition some analytical methods haveserious requirements on data quality Therefore in orderto enable effective data analysis we shall pre-process data

under many circumstances to integrate the data from differ-ent sources which can not only reduces storage expensebut also improves analysis accuracy Some relational datapre-processing techniques are discussed as follows

ndash Integration data integration is the cornerstone of mod-ern commercial informatics which involves the com-bination of data from different sources and providesusers with a uniform view of data [66] This is a matureresearch field for traditional database Historically twomethods have been widely recognized data ware-house and data federation Data warehousing includesa process named ETL (Extract Transform and Load)Extraction involves connecting source systems select-ing collecting analyzing and processing necessarydata Transformation is the execution of a series of rulesto transform the extracted data into standard formatsLoading means importing extracted and transformeddata into the target storage infrastructure Loading isthe most complex procedure among the three whichincludes operations such as transformation copy clear-ing standardization screening and data organizationA virtual database can be built to query and aggregatedata from different data sources but such database doesnot contain data On the contrary it includes informa-tion or metadata related to actual data and its positionsSuch two ldquostorage-readingrdquo approaches do not sat-isfy the high performance requirements of data flowsor search programs and applications Compared withqueries data in such two approaches is more dynamicand must be processed during data transmission Gen-erally data integration methods are accompanied withflow processing engines and search engines [30 67]

ndash Cleaning data cleaning is a process to identify inac-curate incomplete or unreasonable data and thenmodify or delete such data to improve data qualityGenerally data cleaning includes five complementaryprocedures [68] defining and determining error typessearching and identifying errors correcting errors doc-umenting error examples and error types and mod-ifying data entry procedures to reduce future errorsDuring cleaning data formats completeness rational-ity and restriction shall be inspected Data cleaning isof vital importance to keep the data consistency whichis widely applied in many fields such as banking insur-ance retail industry telecommunications and trafficcontrol

In e-commerce most data is electronically col-lected which may have serious data quality prob-lems Classic data quality problems mainly come fromsoftware defects customized errors or system mis-configuration Authors in [69] discussed data cleaning

Mobile Netw Appl (2014) 19171ndash209 183

in e-commerce by crawlers and regularly re-copyingcustomer and account information

In [70] the problem of cleaning RFID data wasexamined RFID is widely used in many applica-tions eg inventory management and target track-ing However the original RFID features low qualitywhich includes a lot of abnormal data limited by thephysical design and affected by environmental noisesIn [71] a probability model was developed to copewith data loss in mobile environments Khoussainovaet al in [72] proposed a system to automatically cor-rect errors of input data by defining global integrityconstraints

Herbert et al [73] proposed a framework called BIO-AJAX to standardize biological data so as to conductfurther computation and improve search quality WithBIO-AJAX some errors and repetitions may be elim-inated and common data mining technologies can beexecuted more effectively

ndash Redundancy elimination data redundancy refers to datarepetitions or surplus which usually occurs in manydatasets Data redundancy can increase the unneces-sary data transmission expense and cause defects onstorage systems eg waste of storage space lead-ing to data inconsistency reduction of data reliabil-ity and data damage Therefore various redundancyreduction methods have been proposed such as redun-dancy detection data filtering and data compressionSuch methods may apply to different datasets or appli-cation environments However redundancy reductionmay also bring about certain negative effects Forexample data compression and decompression causeadditional computational burden Therefore the ben-efits of redundancy reduction and the cost should becarefully balanced Data collected from different fieldswill increasingly appear in image or video formatsIt is well-known that images and videos contain con-siderable redundancy including temporal redundancyspacial redundancy statistical redundancy and sens-ing redundancy Video compression is widely usedto reduce redundancy in video data as specified inthe many video coding standards (MPEG-2 MPEG-4H263 and H264AVC) In [74] the authors inves-tigated the problem of video compression in a videosurveillance system with a video sensor network Theauthors propose a new MPEG-4 based method byinvestigating the contextual redundancy related to back-ground and foreground in a scene The low com-plexity and the low compression ratio of the pro-posed approach were demonstrated by the evaluationresults

On generalized data transmission or storage re-peated data deletion is a special data compression

technology which aims to eliminate repeated datacopies [75] With repeated data deletion individual datablocks or data segments will be assigned with identi-fiers (eg using a hash algorithm) and stored with theidentifiers added to the identification list As the anal-ysis of repeated data deletion continues if a new datablock has an identifier that is identical to that listedin the identification list the new data block will bedeemed as redundant and will be replaced by the cor-responding stored data block Repeated data deletioncan greatly reduce storage requirement which is par-ticularly important to a big data storage system Apartfrom the aforementioned data pre-processing methodsspecific data objects shall go through some other oper-ations such as feature extraction Such operation playsan important role in multimedia search and DNA anal-ysis [76ndash78] Usually high-dimensional feature vec-tors (or high-dimensional feature points) are used todescribe such data objects and the system stores thedimensional feature vectors for future retrieval Datatransfer is usually used to process distributed hetero-geneous data sources especially business datasets [79]As a matter of fact in consideration of various datasetsit is non-trivial or impossible to build a uniform datapre-processing procedure and technology that is appli-cable to all types of datasets on the specific featureproblem performance requirements and other factorsof the datasets should be considered so as to select aproper data pre-processing strategy

4 Big data storage

The explosive growth of data has more strict requirementson storage and management In this section we focus onthe storage of big data Big data storage refers to the stor-age and management of large-scale datasets while achiev-ing reliability and availability of data accessing We willreview important issues including massive storage systemsdistributed storage systems and big data storage mecha-nisms On one hand the storage infrastructure needs toprovide information storage service with reliable storagespace on the other hand it must provide a powerful accessinterface for query and analysis of a large amount ofdata

Traditionally as auxiliary equipment of server data stor-age device is used to store manage look up and analyzedata with structured RDBMSs With the sharp growth ofdata data storage device is becoming increasingly moreimportant and many Internet companies pursue big capac-ity of storage to be competitive Therefore there is acompelling need for research on data storage

184 Mobile Netw Appl (2014) 19171ndash209

41 Storage system for massive data

Various storage systems emerge to meet the demands ofmassive data Existing massive storage technologies can beclassified as Direct Attached Storage (DAS) and networkstorage while network storage can be further classifiedinto Network Attached Storage (NAS) and Storage AreaNetwork (SAN)

In DAS various harddisks are directly connected withservers and data management is server-centric such thatstorage devices are peripheral equipments each of whichtakes a certain amount of IO resource and is managed by anindividual application software For this reason DAS is onlysuitable to interconnect servers with a small scale How-ever due to its low scalability DAS will exhibit undesirableefficiency when the storage capacity is increased ie theupgradeability and expandability are greatly limited ThusDAS is mainly used in personal computers and small-sizedservers

Network storage is to utilize network to provide userswith a union interface for data access and sharing Networkstorage equipment includes special data exchange equip-ments disk array tap library and other storage media aswell as special storage software It is characterized withstrong expandability

NAS is actually an auxillary storage equipment of a net-work It is directly connected to a network through a hub orswitch through TCPIP protocols In NAS data is transmit-ted in the form of files Compared to DAS the IO burdenat a NAS server is reduced extensively since the serveraccesses a storage device indirectly through a network

While NAS is network-oriented SAN is especiallydesigned for data storage with a scalable and bandwidthintensive network eg a high-speed network with opticalfiber connections In SAN data storage management is rel-atively independent within a storage local area networkwhere multipath based data switching among any internalnodes is utilized to achieve a maximum degree of datasharing and data management

From the organization of a data storage system DASNAS and SAN can all be divided into three parts (i) discarray it is the foundation of a storage system and the fun-damental guarantee for data storage (ii) connection andnetwork sub-systems which provide connection among oneor more disc arrays and servers (iii) storage managementsoftware which handles data sharing disaster recovery andother storage management tasks of multiple servers

42 Distributed storage system

The first challenge brought about by big data is how todevelop a large scale distributed storage system for effi-ciently data processing and analysis To use a distributed

system to store massive data the following factors shouldbe taken into consideration

ndash Consistency a distributed storage system requires mul-tiple servers to cooperatively store data As there aremore servers the probability of server failures will belarger Usually data is divided into multiple pieces tobe stored at different servers to ensure availability incase of server failure However server failures and par-allel storage may cause inconsistency among differentcopies of the same data Consistency refers to assuringthat multiple copies of the same data are identical

ndash Availability a distributed storage system operates inmultiple sets of servers As more servers are usedserver failures are inevitable It would be desirable ifthe entire system is not seriously affected to satisfy cus-tomerrsquos requests in terms of reading and writing Thisproperty is called availability

ndash Partition Tolerance multiple servers in a distributedstorage system are connected by a network The net-work could have linknode failures or temporary con-gestion The distributed system should have a certainlevel of tolerance to problems caused by network fail-ures It would be desirable that the distributed storagestill works well when the network is partitioned

Eric Brewer proposed a CAP [80 81] theory in 2000which indicated that a distributed system could not simulta-neously meet the requirements on consistency availabilityand partition tolerance at most two of the three require-ments can be satisfied simultaneously Seth Gilbert andNancy Lynch from MIT proved the correctness of CAP the-ory in 2002 Since consistency availability and partitiontolerance could not be achieved simultaneously we can havea CA system by ignoring partition tolerance a CP system byignoring availability and an AP system that ignores consis-tency according to different design goals The three systemsare discussed in the following

CA systems do not have partition tolerance ie theycould not handle network failures Therefore CA sys-tems are generally deemed as storage systems with a sin-gle server such as the traditional small-scale relationaldatabases Such systems feature single copy of data suchthat consistency is easily ensured Availability is guaranteedby the excellent design of relational databases Howeversince CA systems could not handle network failures theycould not be expanded to use many servers Therefore mostlarge-scale storage systems are CP systems and AP systems

Compared with CA systems CP systems ensure parti-tion tolerance Therefore CP systems can be expanded tobecome distributed systems CP systems generally main-tain several copies of the same data in order to ensure a

Mobile Netw Appl (2014) 19171ndash209 185

level of fault tolerance CP systems also ensure data consis-tency ie multiple copies of the same data are guaranteedto be completely identical However CP could not ensuresound availability because of the high cost for consistencyassurance Therefore CP systems are useful for the scenar-ios with moderate load but stringent requirements on dataaccuracy (eg trading data) BigTable and Hbase are twopopular CP systems

AP systems also ensure partition tolerance However APsystems are different from CP systems in that AP systemsalso ensure availability However AP systems only ensureeventual consistency rather than strong consistency in theprevious two systems Therefore AP systems only applyto the scenarios with frequent requests but not very highrequirements on accuracy For example in online SocialNetworking Services (SNS) systems there are many con-current visits to the data but a certain amount of data errorsare tolerable Furthermore because AP systems ensureeventual consistency accurate data can still be obtained aftera certain amount of delay Therefore AP systems may alsobe used under the circumstances with no stringent realtimerequirements Dynamo and Cassandra are two popular APsystems

43 Storage mechanism for big data

Considerable research on big data promotes the develop-ment of storage mechanisms for big data Existing stor-age mechanisms of big data may be classified into threebottom-up levels (i) file systems (ii) databases and (iii)programming models

File systems are the foundation of the applications atupper levels Googlersquos GFS is an expandable distributedfile system to support large-scale distributed data-intensiveapplications [25] GFS uses cheap commodity servers toachieve fault-tolerance and provides customers with high-performance services GFS supports large-scale file appli-cations with more frequent reading than writing HoweverGFS also has some limitations such as a single point offailure and poor performances for small files Such limita-tions have been overcome by Colossus [82] the successorof GFS

In addition other companies and researchers also havetheir solutions to meet the different demands for storageof big data For example HDFS and Kosmosfs are deriva-tives of open source codes of GFS Microsoft developedCosmos [83] to support its search and advertisement busi-ness Facebook utilizes Haystack [84] to store the largeamount of small-sized photos Taobao also developed TFSand FastDFS In conclusion distributed file systems havebeen relatively mature after years of development and busi-ness operation Therefore we will focus on the other twolevels in the rest of this section

431 Database technology

The database technology has been evolving for more than30 years Various database systems are developed to handledatasets at different scales and support various applica-tions Traditional relational databases cannot meet the chal-lenges on categories and scales brought about by big dataNoSQL databases (ie non traditional relational databases)are becoming more popular for big data storage NoSQLdatabases feature flexible modes support for simple andeasy copy simple API eventual consistency and supportof large volume data NoSQL databases are becomingthe core technology for of big data We will examinethe following three main NoSQL databases in this sec-tion Key-value databases column-oriented databases anddocument-oriented databases each based on certain datamodels

ndash Key-value Databases Key-value Databases are con-stituted by a simple data model and data is storedcorresponding to key-values Every key is unique andcustomers may input queried values according to thekeys Such databases feature a simple structure andthe modern key-value databases are characterized withhigh expandability and shorter query response time thanthose of relational databases Over the past few yearsmany key-value databases have appeared as motivatedby Amazonrsquos Dynamo system [85] We will introduceDynamo and several other representative key-valuedatabases

ndash Dynamo Dynamo is a highly available andexpandable distributed key-value data stor-age system It is used to store and managethe status of some core services which canbe realized with key access in the Amazone-Commerce Platform The public mode ofrelational databases may generate invalid dataand limit data scale and availability whileDynamo can resolve these problems with asimple key-object interface which is consti-tuted by simple reading and writing opera-tion Dynamo achieves elasticity and avail-ability through the data partition data copyand object edition mechanisms Dynamo par-tition plan relies on Consistent Hashing [86]which has a main advantage that node pass-ing only affects directly adjacent nodes anddo not affect other nodes to divide the loadfor multiple main storage machines Dynamocopies data to N sets of servers in which Nis a configurable parameter in order to achieve

186 Mobile Netw Appl (2014) 19171ndash209

high availability and durability Dynamo sys-tem also provides eventual consistency so asto conduct asynchronous update on all copies

ndash Voldemort Voldemort is also a key-value stor-age system which was initially developed forand is still used by LinkedIn Key words andvalues in Voldemort are composite objectsconstituted by tables and images Volde-mort interface includes three simple opera-tions reading writing and deletion all ofwhich are confirmed by key words Volde-mort provides asynchronous updating con-current control of multiple editions but doesnot ensure data consistency However Volde-mort supports optimistic locking for consistentmulti-record updating When conflict happensbetween the updating and any other opera-tions the updating operation will quit Thedata copy mechanism of Voldmort is the sameas that of Dynamo Voldemort not only storesdata in RAM but allows data be inserted intoa storage engine Especially Voldemort sup-ports two storage engines including BerkeleyDB and Random Access Files

The key-value database emerged a few years agoDeeply influenced by Amazon Dynamo DB other key-value storage systems include Redis Tokyo Canbinetand Tokyo Tyrant Memcached and Memcache DBRiak and Scalaris all of which provide expandabilityby distributing key words into nodes Voldemort RiakTokyo Cabinet and Memecached can utilize attachedstorage devices to store data in RAM or disks Otherstorage systems store data at RAM and provide diskbackup or rely on copy and recovery to avoid backup

ndash Column-oriented Database The column-orienteddatabases store and process data according to columnsother than rows Both columns and rows are segmentedin multiple nodes to realize expandability The column-oriented databases are mainly inspired by GooglersquosBigTable In this Section we first discuss BigTable andthen introduce several derivative tools

ndash BigTable BigTable is a distributed structureddata storage system which is designed to pro-cess the large-scale (PB class) data amongthousands commercial servers [87] The basicdata structure of Bigtable is a multi-dimensionsequenced mapping with sparse distributedand persistent storage Indexes of mappingare row key column key and timestampsand every value in mapping is an unana-lyzed byte array Each row key in BigTable

is a 64KB character string By lexicograph-ical order rows are stored and continuallysegmented into Tablets (ie units of distribu-tion) for load balance Thus reading a shortrow of data can be highly effective since itonly involves communication with a small por-tion of machines The columns are groupedaccording to the prefixes of keys and thusforming column families These column fami-lies are the basic units for access control Thetimestamps are 64-bit integers to distinguishdifferent editions of cell values Clients mayflexibly determine the number of cell editionsstored These editions are sequenced in thedescending order of timestamps so the latestedition will always be read

The BigTable API features the creation anddeletion of Tablets and column families as wellas modification of metadata of clusters tablesand column families Client applications mayinsert or delete values of BigTable query val-ues from columns or browse sub-datasets in atable Bigtable also supports some other char-acteristics such as transaction processing in asingle row Users may utilize such features toconduct more complex data processing

Every procedure executed by BigTableincludes three main components Masterserver Tablet server and client libraryBigtable only allows one set of Master serverbe distributed to be responsible for distribut-ing tablets for Tablet server detecting addedor removed Tablet servers and conductingload balance In addition it can also mod-ify BigTable schema eg creating tables andcolumn families and collecting garbage savedin GFS as well as deleted or disabled filesand using them in specific BigTable instancesEvery tablet server manages a Tablet set andis responsible for the reading and writing of aloaded Tablet When Tablets are too big theywill be segmented by the server The applica-tion client library is used to communicate withBigTable instances

BigTable is based on many fundamentalcomponents of Google including GFS [25]cluster management system SSTable file for-mat and Chubby [88] GFS is use to store dataand log files The cluster management systemis responsible for task scheduling resourcessharing processing of machine failures andmonitoring of machine statuses SSTable fileformat is used to store BigTable data internally

Mobile Netw Appl (2014) 19171ndash209 187

and it provides mapping between persistentsequenced and unchangeable keys and valuesas any byte strings BigTable utilizes Chubbyfor the following tasks in server 1) ensurethere is at most one active Master copy atany time 2) store the bootstrap location ofBigTable data 3) look up Tablet server 4) con-duct error recovery in case of Table server fail-ures 5) store BigTable schema information 6)store the access control table

ndash Cassandra Cassandra is a distributed storagesystem to manage the huge amount of struc-tured data distributed among multiple commer-cial servers [89] The system was developedby Facebook and became an open source toolin 2008 It adopts the ideas and concepts ofboth Amazon Dynamo and Google BigTableespecially integrating the distributed systemtechnology of Dynamo with the BigTable datamodel Tables in Cassandra are in the form ofdistributed four-dimensional structured map-ping where the four dimensions including rowcolumn column family and super column Arow is distinguished by a string-key with arbi-trary length No matter the amount of columnsto be read or written the operation on rowsis an auto Columns may constitute clusterswhich is called column families and are sim-ilar to the data model of Bigtable Cassandraprovides two kinds of column families columnfamilies and super columns The super columnincludes arbitrary number of columns relatedto same names A column family includescolumns and super columns which may becontinuously inserted to the column familyduring runtime The partition and copy mecha-nisms of Cassandra are very similar to those ofDynamo so as to achieve consistency

ndash Derivative tools of BigTable since theBigTable code cannot be obtained throughthe open source license some open sourceprojects compete to implement the BigTableconcept to develop similar systems such asHBase and Hypertable

HBase is a BigTable cloned version pro-grammed with Java and is a part of Hadoop ofApachersquos MapReduce framework [90] HBasereplaces GFS with HDFS It writes updatedcontents into RAM and regularly writes theminto files on disks The row operations areatomic operations equipped with row-levellocking and transaction processing which is

optional for large scale Partition and distribu-tion are transparently operated and have spacefor client hash or fixed key

HyperTable was developed similar toBigTable to obtain a set of high-performanceexpandable distributed storage and process-ing systems for structured and unstructureddata [91] HyperTable relies on distributedfile systems eg HDFS and distributed lockmanager Data representation processing andpartition mechanism are similar to that inBigTable HyperTable has its own query lan-guage called HyperTable query language(HQL) and allows users to create modify andquery underlying tables

Since the column-oriented storage databases mainlyemulate BigTable their designs are all similar exceptfor the concurrency mechanism and several other fea-tures For example Cassandra emphasizes weak consis-tency of concurrent control of multiple editions whileHBase and HyperTable focus on strong consistencythrough locks or log records

ndash Document Database Compared with key-value stor-age document storage can support more complex dataforms Since documents do not follow strict modesthere is no need to conduct mode migration In additionkey-value pairs can still be saved We will examine threeimportant representatives of document storage systemsie MongoDB SimpleDB and CouchDB

ndash MongoDB MongoDB is open-source anddocument-oriented database [92] MongoDBstores documents as Binary JSON (BSON)objects [93] which is similar to object Everydocument has an ID field as the primary keyQuery in MongoDB is expressed with syn-tax similar to JSON A database driver sendsthe query as a BSON object to MongoDBThe system allows query on all documentsincluding embedded objects and arrays Toenable rapid query indexes can be createdin the queryable fields of documents Thecopy operation in MongoDB can be executedwith log files in the main nodes that supportall the high-level operations conducted in thedatabase During copying the slavers queryall the writing operations since the last syn-chronization to the master and execute opera-tions in log files in local databases MongoDBsupports horizontal expansion with automaticsharing to distribute data among thousandsof nodes by automatically balancing load andfailover

188 Mobile Netw Appl (2014) 19171ndash209

ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

Mobile Netw Appl (2014) 19171ndash209 189

DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

51 Traditional data analysis

5 Big data analysis

190 Mobile Netw Appl (2014) 19171ndash209

ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

52 Big data analytic methods

In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

53 Architecture for big data analysis

Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

Mobile Netw Appl (2014) 19171ndash209 191

Table 1 Comparison of MPI MapReduce and Dryad

MPI MapReduce Dryad

Deployment Computing node and data Computing and data storage Computing and data storage

storage arranged separately arranged at the same node arranged at the same node

(Data should be moved (Computing should (Computing should

computing node) be close to data) be close to data)

Resource management ndash Workqueue(google) Not clear

scheduling HOD(Yahoo)

Low level programming MPI API MapReduce API Dryad API

High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

Data storage The local file system GFS(google) NTFS

NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

the tasks

Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

memory access Shared-memory FIFOs

Fault-tolerant Checkpoint Task re-execute Task re-execute

531 Real-time vs offline analysis

According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

532 Analysis at different levels

Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

192 Mobile Netw Appl (2014) 19171ndash209

533 Analysis with different complexity

The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

54 Tools for big data mining and analysis

Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

6 Big data applications

In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

Mobile Netw Appl (2014) 19171ndash209 193

However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

61 Application evolutions

Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

62 Big data analysis fields

webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

194 Mobile Netw Appl (2014) 19171ndash209

621 Structured data analysis

Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

622 Text data analysis

The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

623 Web data analysis

Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

Mobile Netw Appl (2014) 19171ndash209 195

624 Multimedia data analysis

Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

625 Network data analysis

Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

196 Mobile Netw Appl (2014) 19171ndash209

and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

626 Mobile data analysis

By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

Mobile Netw Appl (2014) 19171ndash209 197

In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

63 Key applications of big data

631 Application of big data in enterprises

At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

632 Application of IoT based big data

IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

633 Application of online social network-oriented big data

Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

198 Mobile Netw Appl (2014) 19171ndash209

information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

Mobile Netw Appl (2014) 19171ndash209 199

or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

634 Applications of healthcare and medical big data

Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

Fig 6 The correlation between Tweets about rice price and food price inflation

200 Mobile Netw Appl (2014) 19171ndash209

imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

635 Collective intelligence

With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

636 Smart grid

Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

Mobile Netw Appl (2014) 19171ndash209 201

according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

7 Conclusion open issues and outlook

In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

71 Open issues

The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

711 Theoretical research

Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

712 Technology development

The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

202 Mobile Netw Appl (2014) 19171ndash209

ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

713 Practical implications

Although there are already many successful big data appli-cations many practical problems should be solved

ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

714 Data security

In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

Mobile Netw Appl (2014) 19171ndash209 203

quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

72 Outlook

The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

not predict the future but may take precautions for possibleevents to occur in the future

ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

204 Mobile Netw Appl (2014) 19171ndash209

utilizes relational diagrams to express interpersonalrelationship

ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

ndash Compared with accurate data we would like toaccept numerous and complicated data

ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

References

1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

Mobile Netw Appl (2014) 19171ndash209 205

20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

54 Cisco data center interconnect design and deployment guide(2010)

55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

206 Mobile Netw Appl (2014) 19171ndash209

60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

Media Inc93 Crockford D (2006) The applicationjson media type for

javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

(2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

Mobile Netw Appl (2014) 19171ndash209 207

100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

the 7th ACM international conference on computing frontiersACM pp 277ndash286

119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

208 Mobile Netw Appl (2014) 19171ndash209

141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

Mobile Netw Appl (2014) 19171ndash209 209

  • Big Data A Survey
    • Abstract
    • Background
      • Dawn of big data era
      • Definition and features of big data
      • Big data value
      • The development of big data
      • Challenges of big data
        • Related technologies
          • Relationship between cloud computing and big data
          • Relationship between IoT and big data
          • Data center
          • Relationship between hadoop and big data
            • Big data generation and acquisition
              • Data generation
                • Enterprise data
                • IoT data
                • Bio-medical data
                • Data generation from other fields
                  • Big data acquisition
                    • Data collection
                    • Data transportation
                    • Data pre-processing
                        • Big data storage
                          • Storage system for massive data
                          • Distributed storage system
                          • Storage mechanism for big data
                            • Database technology
                              • Traditional data analysis
                              • Big data analytic methods
                              • Architecture for big data analysis
                                • Real-time vs offline analysis
                                • Analysis at different levels
                                • Analysis with different complexity
                                  • Tools for big data mining and analysis
                                    • Big data applications
                                      • Key applications of big data
                                        • Application evolutions
                                        • Structured data analysis
                                        • Text data analysis
                                        • Web data analysis
                                        • Multimedia data analysis
                                        • Network data analysis
                                        • Mobile data analysis
                                          • Key applications of big data
                                            • Application of big data in enterprises
                                            • Application of IoT based big data
                                            • Application of online social network-oriented big data
                                            • Applications of healthcare and medical big data
                                            • Collective intelligence
                                            • Smart grid
                                                • Conclusion open issues and outlook
                                                  • Open issues
                                                    • Theoretical research
                                                    • Technology development
                                                    • Practical implications
                                                    • Data security
                                                      • Outlook
                                                        • Acknowledgments
                                                        • References

    generates data of tens of Terabyte (TB) for online tradingper day Figure 1 illustrates the boom of the global data vol-ume While the amount of large datasets is drastically risingit also brings about many challenging problems demandingprompt solutions

    ndash The latest advances of information technology (IT)make it more easily to generate data For example onaverage 72 hours of videos are uploaded to YouTubein every minute [11] Therefore we are confronted withthe main challenge of collecting and integrating massivedata from widely distributed data sources

    ndash The rapid growth of cloud computing and the Internet ofThings (IoT) further promote the sharp growth of dataCloud computing provides safeguarding access sitesand channels for data asset In the paradigm of IoT sen-sors all over the world are collecting and transmittingdata to be stored and processed in the cloud Such datain both quantity and mutual relations will far surpass

    the capacities of the IT architectures and infrastruc-ture of existing enterprises and its realtime requirementwill also greatly stress the available computing capacityThe increasingly growing data cause a problem of howto store and manage such huge heterogeneous datasetswith moderate requirements on hardware and softwareinfrastructure

    ndash In consideration of the heterogeneity scalability real-time complexity and privacy of big data we shalleffectively ldquominerdquo the datasets at different levels duringthe analysis modeling visualization and forecastingso as to reveal its intrinsic property and improve thedecision making

    12 Definition and features of big data

    Big data is an abstract concept Apart from masses of datait also has some other features which determine the differ-ence between itself and ldquomassive datardquo or ldquovery big datardquo

    Fig 1 The continuouslyincreasing big data

    172 Mobile Netw Appl (2014) 19171ndash209

    At present although the importance of big data has beengenerally recognized people still have different opinions onits definition In general big data shall mean the datasetsthat could not be perceived acquired managed and pro-cessed by traditional IT and softwarehardware tools withina tolerable time Because of different concerns scientificand technological enterprises research scholars data ana-lysts and technical practitioners have different definitionsof big data The following definitions may help us have abetter understanding on the profound social economic andtechnological connotations of big data

    In 2010 Apache Hadoop defined big data as ldquodatasetswhich could not be captured managed and processed bygeneral computers within an acceptable scoperdquo On the basisof this definition in May 2011 McKinsey amp Company aglobal consulting agency announced Big Data as the nextfrontier for innovation competition and productivity Bigdata shall mean such datasets which could not be acquiredstored and managed by classic database software This def-inition includes two connotations First datasetsrsquo volumesthat conform to the standard of big data are changing andmay grow over time or with technological advances Sec-ond datasetsrsquo volumes that conform to the standard of bigdata in different applications differ from each other Atpresent big data generally ranges from several TB to sev-eral PB [10] From the definition by McKinsey amp Companyit can be seen that the volume of a dataset is not the onlycriterion for big data The increasingly growing data scaleand its management that could not be handled by traditionaldatabase technologies are the next two key features

    As a matter of fact big data has been defined as earlyas 2001 Doug Laney an analyst of META (presentlyGartner) defined challenges and opportunities brought aboutby increased data with a 3Vs model ie the increase ofVolume Velocity and Variety in a research report [12]Although such a model was not originally used to definebig data Gartner and many other enterprises includingIBM [13] and some research departments of Microsoft [14]still used the ldquo3Vsrdquo model to describe big data withinthe following ten years [15] In the ldquo3Vsrdquo model Volumemeans with the generation and collection of masses ofdata data scale becomes increasingly big Velocity meansthe timeliness of big data specifically data collection andanalysis etc must be rapidly and timely conducted so asto maximumly utilize the commercial value of big dataVariety indicates the various types of data which includesemi-structured and unstructured data such as audio videowebpage and text as well as traditional structured data

    However others have different opinions including IDCone of the most influential leaders in big data and itsresearch fields In 2011 an IDC report defined big data asldquobig data technologies describe a new generation of tech-nologies and architectures designed to economically extract

    value from very large volumes of a wide variety of data byenabling the high-velocity capture discovery andor anal-ysisrdquo [1] With this definition characteristics of big datamay be summarized as four Vs ie Volume (great volume)Variety (various modalities) Velocity (rapid generation)and Value (huge value but very low density) as shown inFig 2 Such 4Vs definition was widely recognized sinceit highlights the meaning and necessity of big data ieexploring the huge hidden values This definition indicatesthe most critical problem in big data which is how to dis-cover values from datasets with an enormous scale varioustypes and rapid generation As Jay Parikh Deputy ChiefEngineer of Facebook said ldquoYou could only own a bunchof data other than big data if you do not utilize the collecteddatardquo [11]

    In addition NIST defines big data as ldquoBig data shallmean the data of which the data volume acquisition speedor data representation limits the capacity of using traditionalrelational methods to conduct effective analysis or the datawhich may be effectively processed with important horizon-tal zoom technologiesrdquo which focuses on the technologicalaspect of big data It indicates that efficient methods ortechnologies need to be developed and used to analyze andprocess big data

    There have been considerable discussions from bothindustry and academia on the definition of big data [16 17]In addition to developing a proper definition the big dataresearch should also focus on how to extract its value howto use data and how to transform ldquoa bunch of datardquo into ldquobigdatardquo

    13 Big data value

    McKinsey amp Company observed how big data created val-ues after in-depth research on the US healthcare the EUpublic sector administration the US retail the global man-ufacturing and the global personal location data Throughresearch on the five core industries that represent the globaleconomy the McKinsey report pointed out that big datamay give a full play to the economic function improve theproductivity and competitiveness of enterprises and publicsectors and create huge benefits for consumers In [10]McKinsey summarized the values that big data could cre-ate if big data could be creatively and effectively utilizedto improve efficiency and quality the potential value ofthe US medical industry gained through data may surpassUSD 300 billion thus reducing the expenditure for the UShealthcare by over 8 retailers that fully utilize big datamay improve their profit by more than 60 big data mayalso be utilized to improve the efficiency of governmentoperations such that the developed economies in Europecould save over EUR 100 billion (which excludes the effectof reduced frauds errors and tax difference)

    Mobile Netw Appl (2014) 19171ndash209 173

    Fig 2 The 4Vs feature of big data

    The McKinsey report is regarded as prospective andpredictive while the following facts may validate the val-ues of big data During the 2009 flu pandemic Googleobtained timely information by analyzing big data whicheven provided more valuable information than that providedby disease prevention centers Nearly all countries requiredhospitals inform agencies such as disease prevention centersof the new type of influenza cases However patients usu-ally did not see doctors immediately when they got infectedIt also took some time to send information from hospitals todisease prevention centers and for disease prevention cen-ters to analyze and summarize such information Thereforewhen the public is aware of the pandemic of the new typeof influenza the disease may have already spread for one totwo weeks with a hysteretic nature Google found that dur-ing the spreading of influenza entries frequently sought atits search engines would be different from those at ordinarytimes and the use frequencies of the entries were corre-lated to the influenza spreading in both time and locationGoogle found 45 search entry groups that were closely rel-evant to the outbreak of influenza and incorporated themin specific mathematic models to forecast the spreading ofinfluenza and even to predict places where influenza spreadfrom The related research results have been published inNature [18]

    In 2008 Microsoft purchased Farecast a sci-tech venturecompany in the US Farecast has an airline ticket forecastsystem that predicts the trends and risingdropping ranges ofairline ticket price The system has been incorporated intothe Bing search engine of Microsoft By 2012 the systemhas saved nearly USD 50 per ticket per passenger with theforecasted accuracy as high as 75

    At present data has become an important production fac-tor that could be comparable to material assets and humancapital As multimedia social media and IoT are devel-oping enterprises will collect more information leading

    to an exponential growth of data volume Big data willhave a huge and increasing potential in creating values forbusinesses and consumers

    14 The development of big data

    In the late 1970s the concept of ldquodatabase machinerdquoemerged which is a technology specially used for stor-ing and analyzing data With the increase of data volumethe storage and processing capacity of a single mainframecomputer system became inadequate In the 1980s peo-ple proposed ldquoshare nothingrdquo a parallel database system tomeet the demand of the increasing data volume [19] Theshare nothing system architecture is based on the use ofcluster and every machine has its own processor storageand disk Teradata system was the first successful com-mercial parallel database system Such database becamevery popular lately On June 2 1986 a milestone eventoccurred when Teradata delivered the first parallel databasesystem with the storage capacity of 1TB to Kmart to helpthe large-scale retail company in North America to expandits data warehouse [20] In the late 1990s the advantagesof parallel database was widely recognized in the databasefield

    However many challenges on big data arose With thedevelopment of Internet servies indexes and queried con-tents were rapidly growing Therefore search engine com-panies had to face the challenges of handling such big dataGoogle created GFS [21] and MapReduce [22] program-ming models to cope with the challenges brought aboutby data management and analysis at the Internet scale Inaddition contents generated by users sensors and otherubiquitous data sources also feuled the overwhelming dataflows which required a fundamental change on the comput-ing architecture and large-scale data processing mechanismIn January 2007 Jim Gray a pioneer of database software

    174 Mobile Netw Appl (2014) 19171ndash209

    called such transformation ldquoThe Fourth Paradigmrdquo [23] Healso thought the only way to cope with such paradigm wasto develop a new generation of computing tools to managevisualize and analyze massive data In June 2011 anothermilestone event occurred EMCIDC published a researchreport titled Extracting Values from Chaos [1] which intro-duced the concept and potential of big data for the firsttime This research report triggered the great interest in bothindustry and academia on big data

    Over the past few years nearly all major companiesincluding EMC Oracle IBM Microsoft Google Ama-zon and Facebook etc have started their big data projectsTaking IBM as an example since 2005 IBM has investedUSD 16 billion on 30 acquisitions related to big data Inacademia big data was also under the spotlight In 2008Nature published a big data special issue In 2011 Sciencealso launched a special issue on the key technologies ofldquodata processingrdquo in big data In 2012 European ResearchConsortium for Informatics and Mathematics (ERCIM)News published a special issue on big data In the beginningof 2012 a report titled Big Data Big Impact presented at theDavos Forum in Switzerland announced that big data hasbecome a new kind of economic assets just like currencyor gold Gartner an international research agency issuedHype Cycles from 2012 to 2013 which classified big datacomputing social analysis and stored data analysis into 48emerging technologies that deserve most attention

    Many national governments such as the US also paidgreat attention to big data In March 2012 the ObamaAdministration announced a USD 200 million investmentto launch the ldquoBig Data Research and Development Planrdquowhich was a second major scientific and technologicaldevelopment initiative after the ldquoInformation Highwayrdquo ini-tiative in 1993 In July 2012 the ldquoVigorous ICT Japanrdquoproject issued by Japanrsquos Ministry of Internal Affairs andCommunications indicated that the big data developmentshould be a national strategy and application technologiesshould be the focus In July 2012 the United Nations issuedBig Data for Development report which summarized howgovernments utilized big data to better serve and protecttheir people

    15 Challenges of big data

    The sharply increasing data deluge in the big data erabrings about huge challenges on data acquisition storagemanagement and analysis Traditional data managementand analysis systems are based on the relational databasemanagement system (RDBMS) However such RDBMSsonly apply to structured data other than semi-structured orunstructured data In addition RDBMSs are increasinglyutilizing more and more expensive hardware It is appar-ently that the traditional RDBMSs could not handle the

    huge volume and heterogeneity of big data The researchcommunity has proposed some solutions from different per-spectives For example cloud computing is utilized to meetthe requirements on infrastructure for big data eg costefficiency elasticity and smooth upgradingdowngradingFor solutions of permanent storage and management oflarge-scale disordered datasets distributed file systems [24]and NoSQL [25] databases are good choices Such program-ming frameworks have achieved great success in processingclustered tasks especially for webpage ranking Various bigdata applications can be developed based on these innova-tive technologies or platforms Moreover it is non-trivial todeploy the big data analysis systems

    Some literature [26ndash28] discuss obstacles in the develop-ment of big data applications The key challenges are listedas follows

    ndash Data representation many datasets have certain levelsof heterogeneity in type structure semantics organiza-tion granularity and accessibility Data representationaims to make data more meaningful for computer anal-ysis and user interpretation Nevertheless an improperdata representation will reduce the value of the origi-nal data and may even obstruct effective data analysisEfficient data representation shall reflect data structureclass and type as well as integrated technologies so asto enable efficient operations on different datasets

    ndash Redundancy reduction and data compression gener-ally there is a high level of redundancy in datasetsRedundancy reduction and data compression is effec-tive to reduce the indirect cost of the entire system onthe premise that the potential values of the data are notaffected For example most data generated by sensornetworks are highly redundant which may be filteredand compressed at orders of magnitude

    ndash Data life cycle management compared with the rel-atively slow advances of storage systems pervasivesensing and computing are generating data at unprece-dented rates and scales We are confronted with a lotof pressing challenges one of which is that the currentstorage system could not support such massive dataGenerally speaking values hidden in big data dependon data freshness Therefore a data importance princi-ple related to the analytical value should be developedto decide which data shall be stored and which datashall be discarded

    ndash Analytical mechanism the analytical system of big datashall process masses of heterogeneous data within alimited time However traditional RDBMSs are strictlydesigned with a lack of scalability and expandabilitywhich could not meet the performance requirementsNon-relational databases have shown their uniqueadvantages in the processing of unstructured data and

    Mobile Netw Appl (2014) 19171ndash209 175

    started to become mainstream in big data analysisEven so there are still some problems of non-relationaldatabases in their performance and particular applica-tions We shall find a compromising solution betweenRDBMSs and non-relational databases For examplesome enterprises have utilized a mixed database archi-tecture that integrates the advantages of both types ofdatabase (eg Facebook and Taobao) More researchis needed on the in-memory database and sample databased on approximate analysis

    ndash Data confidentiality most big data service providers orowners at present could not effectively maintain andanalyze such huge datasets because of their limitedcapacity They must rely on professionals or tools toanalyze such data which increase the potential safetyrisks For example the transactional dataset generallyincludes a set of complete operating data to drive keybusiness processes Such data contains details of thelowest granularity and some sensitive information suchas credit card numbers Therefore analysis of big datamay be delivered to a third party for processing onlywhen proper preventive measures are taken to protectsuch sensitive data to ensure its safety

    ndash Energy management the energy consumption of main-frame computing systems has drawn much attentionfrom both economy and environment perspectives Withthe increase of data volume and analytical demandsthe processing storage and transmission of big datawill inevitably consume more and more electric energyTherefore system-level power consumption controland management mechanism shall be established forbig data while the expandability and accessibility areensured

    ndash Expendability and scalability the analytical system ofbig data must support present and future datasets Theanalytical algorithm must be able to process increas-ingly expanding and more complex datasets

    ndash Cooperation analysis of big data is an interdisci-plinary research which requires experts in differentfields cooperate to harvest the potential of big dataA comprehensive big data network architecture mustbe established to help scientists and engineers in var-ious fields access different kinds of data and fullyutilize their expertise so as to cooperate to complete theanalytical objectives

    2 Related technologies

    In order to gain a deep understanding of big data this sec-tion will introduce several fundamental technologies that areclosely related to big data including cloud computing IoTdata center and Hadoop

    21 Relationship between cloud computing and big data

    Cloud computing is closely related to big data The keycomponents of cloud computing are shown in Fig 3 Bigdata is the object of the computation-intensive operation andstresses the storage capacity of a cloud system The mainobjective of cloud computing is to use huge computing andstorage resources under concentrated management so asto provide big data applications with fine-grained comput-ing capacity The development of cloud computing providessolutions for the storage and processing of big data On theother hand the emergence of big data also accelerates thedevelopment of cloud computing The distributed storagetechnology based on cloud computing can effectively man-age big data the parallel computing capacity by virtue ofcloud computing can improve the efficiency of acquisitionand analyzing big data

    Even though there are many overlapped technologiesin cloud computing and big data they differ in the fol-lowing two aspects First the concepts are different to acertain extent Cloud computing transforms the IT archi-tecture while big data influences business decision-makingHowever big data depends on cloud computing as thefundamental infrastructure for smooth operation

    Second big data and cloud computing have differenttarget customers Cloud computing is a technology andproduct targeting Chief Information Officers (CIO) as anadvanced IT solution Big data is a product targeting ChiefExecutive Officers (CEO) focusing on business operationsSince the decision makers may directly feel the pressurefrom market competition they must defeat business oppo-nents in more competitive ways With the advances ofbig data and cloud computing these two technologies arecertainly and increasingly entwine with each other Cloudcomputing with functions similar to those of computers andoperating systems provides system-level resources big data

    Fig 3 Key components of cloud computing

    176 Mobile Netw Appl (2014) 19171ndash209

    operates in the upper level supported by cloud computingand provides functions similar to those of database and effi-cient data processing capacity Kissinger President of EMCindicated that the application of big data must be based oncloud computing

    The evolution of big data was driven by the rapid growthof application demands and cloud computing developedfrom virtualized technologies Therefore cloud computingnot only provides computation and processing for big databut also itself is a service mode To a certain extent theadvances of cloud computing also promote the developmentof big data both of which supplement each other

    22 Relationship between IoT and big data

    In the IoT paradigm an enormous amount of networkingsensors are embedded into various devices and machinesin the real world Such sensors deployed in different fieldsmay collect various kinds of data such as environmentaldata geographical data astronomical data and logistic dataMobile equipments transportation facilities public facil-ities and home appliances could all be data acquisitionequipments in IoT as illustrated in Fig 4

    The big data generated by IoT has different characteris-tics compared with general big data because of the differenttypes of data collected of which the most classical charac-teristics include heterogeneity variety unstructured featurenoise and high redundancy Although the current IoT datais not the dominant part of big data by 2030 the quantity of

    sensors will reach one trillion and then the IoT data will be

    the most important part of big data according to the fore-

    cast of HP A report from Intel pointed out that big data in

    IoT has three features that conform to the big data paradigm

    (i) abundant terminals generating masses of data (ii) data

    generated by IoT is usually semi-structured or unstructured

    (iii) data of IoT is useful only when it is analyzed

    At present the data processing capacity of IoT has fallen

    behind the collected data and it is extremely urgent to accel-

    erate the introduction of big data technologies to promote

    the development of IoT Many operators of IoT realize the

    importance of big data since the success of IoT is hinged

    upon the effective integration of big data and cloud com-

    puting The widespread deployment of IoT will also bring

    many cities into the big data era

    There is a compelling need to adopt big data for IoT

    applications while the development of big data is already

    legged behind It has been widely recognized that these

    two technologies are inter-dependent and should be jointly

    developed on one hand the widespread deployment of IoT

    drives the high growth of data both in quantity and cate-

    gory thus providing the opportunity for the application and

    development of big data on the other hand the application

    of big data technology to IoT also accelerates the research

    advances and business models of of IoT

    Fig 4 Illustration of data acquisition equipment in IoT

    Mobile Netw Appl (2014) 19171ndash209 177

    23 Data center

    In the big data paradigm the data center not only is a plat-form for concentrated storage of data but also undertakesmore responsibilities such as acquiring data managingdata organizing data and leveraging the data values andfunctions Data centers mainly concern ldquodatardquo other thanldquocenterrdquo It has masses of data and organizes and man-ages data according to its core objective and developmentpath which is more valuable than owning a good site andresource The emergence of big data brings about sounddevelopment opportunities and great challenges to data cen-ters Big data is an emerging paradigm which will promotethe explosive growth of the infrastructure and related soft-ware of data center The physical data center network isthe core for supporting big data but at present is the keyinfrastructure that is most urgently required [29]

    ndash Big data requires data center provide powerful back-stage support The big data paradigm has more strin-gent requirements on storage capacity and processingcapacity as well as network transmission capacityEnterprises must take the development of data centersinto consideration to improve the capacity of rapidlyand effectively processing of big data under limitedpriceperformance ratio The data center shall providethe infrastructure with a large number of nodes build ahigh-speed internal network effectively dissipate heatand effective backup data Only when a highly energy-efficient stable safe expandable and redundant datacenter is built the normal operation of big data applica-tions may be ensured

    ndash The growth of big data applications accelerates therevolution and innovation of data centers Many bigdata applications have developed their unique architec-tures and directly promote the development of storagenetwork and computing technologies related to datacenter With the continued growth of the volumes ofstructured and unstructured data and the variety ofsources of analytical data the data processing and com-puting capacities of the data center shall be greatlyenhanced In addition as the scale of data center isincreasingly expanding it is also an important issue onhow to reduce the operational cost for the developmentof data centers

    ndash Big data endows more functions to the data center Inthe big data paradigm data center shall not only con-cern with hardware facilities but also strengthen softcapacities ie the capacities of acquisition processingorganization analysis and application of big data Thedata center may help business personnel analyze theexisting data discover problems in business operationand develop solutions from big data

    24 Relationship between hadoop and big data

    Presently Hadoop is widely used in big data applications inthe industry eg spam filtering network searching click-stream analysis and social recommendation In additionconsiderable academic research is now based on HadoopSome representative cases are given below As declaredin June 2012 Yahoo runs Hadoop in 42000 servers atfour data centers to support its products and services egsearching and spam filtering etc At present the biggestHadoop cluster has 4000 nodes but the number of nodeswill be increased to 10000 with the release of Hadoop 20In the same month Facebook announced that their Hadoopcluster can process 100 PB data which grew by 05 PB perday as in November 2012 Some well-known agencies thatuse Hadoop to conduct distributed computation are listedin [30] In addition many companies provide Hadoop com-mercial execution andor support including Cloudera IBMMapR EMC and Oracle

    Among modern industrial machinery and systems sen-sors are widely deployed to collect information for environ-ment monitoring and failure forecasting etc Bahga and oth-ers in [31] proposed a framework for data organization andcloud computing infrastructure termed CloudView Cloud-View uses mixed architectures local nodes and remoteclusters based on Hadoop to analyze machine-generateddata Local nodes are used for the forecast of real-time fail-ures clusters based on Hadoop are used for complex offlineanalysis eg case-driven data analysis

    The exponential growth of the genome data and the sharpdrop of sequencing cost transform bio-science and bio-medicine to data-driven science Gunarathne et al in [32]utilized cloud computing infrastructures Amazon AWSMicrosoft Azune and data processing framework basedon MapReduce Hadoop and Microsoft DryadLINQ torun two parallel bio-medicine applications (i) assembly ofgenome segments (ii) dimension reduction in the analy-sis of chemical structure In the subsequent application the166-D datasets used include 26000000 data points Theauthors compared the performance of all the frameworks interms of efficiency cost and availability According to thestudy the authors concluded that the loose coupling will beincreasingly applied to research on electron cloud and theparallel programming technology (MapReduce) frameworkmay provide the user an interface with more convenientservices and reduce unnecessary costs

    3 Big data generation and acquisition

    We have introduced several key technologies related to bigdata ie cloud computing IoT data center and HadoopNext we will focus on the value chain of big data which

    178 Mobile Netw Appl (2014) 19171ndash209

    can be generally divided into four phases data generationdata acquisition data storage and data analysis If we takedata as a raw material data generation and data acquisitionare an exploitation process data storage is a storage processand data analysis is a production process that utilizes theraw material to create new value

    31 Data generation

    Data generation is the first step of big data Given Inter-net data as an example huge amount of data in terms ofsearching entries Internet forum posts chatting records andmicroblog messages are generated Those data are closelyrelated to peoplersquos daily life and have similar features ofhigh value and low density Such Internet data may bevalueless individually but through the exploitation of accu-mulated big data useful information such as habits andhobbies of users can be identified and it is even possible toforecast usersrsquo behaviors and emotional moods

    Moreover generated through longitudinal andor dis-tributed data sources datasets are more large-scale highlydiverse and complex Such data sources include sensorsvideos clickstreams andor all other available data sourcesAt present main sources of big data are the operationand trading information in enterprises logistic and sens-ing information in the IoT human interaction informationand position information in the Internet world and datagenerated in scientific research etc The information far sur-passes the capacities of IT architectures and infrastructuresof existing enterprises while its real time requirement alsogreatly stresses the existing computing capacity

    311 Enterprise data

    In 2013 IBM issued Analysis the Applications of Big Datato the Real World which indicates that the internal data ofenterprises are the main sources of big data The internaldata of enterprises mainly consists of online trading data andonline analysis data most of which are historically staticdata and are managed by RDBMSs in a structured man-ner In addition production data inventory data sales dataand financial data etc also constitute enterprise internaldata which aims to capture informationized and data-drivenactivities in enterprises so as to record all activities ofenterprises in the form of internal data

    Over the past decades IT and digital data have con-tributed a lot to improve the profitability of business depart-ments It is estimated that the business data volume of allcompanies in the world may double every 12 years [10]in which the business turnover through the Internet enter-prises to enterprises and enterprises to consumers per daywill reach USD 450 billion [33] The continuously increas-ing business data volume requires more effective real-time

    analysis so as to fully harvest its potential For exampleAmazon processes millions of terminal operations and morethan 500000 queries from third-party sellers per day [12]Walmart processes one million customer trades per hour andsuch trading data are imported into a database with a capac-ity of over 25PB [3] Akamai analyzes 75 million eventsper day for its target advertisements [13]

    312 IoT data

    As discussed IoT is an important source of big data Amongsmart cities constructed based on IoT big data may comefrom industry agriculture traffic transportation medicalcare public departments and families etc

    According to the processes of data acquisition and trans-mission in IoT its network architecture may be dividedinto three layers the sensing layer the network layer andthe application layer The sensing layer is responsible fordata acquisition and mainly consists of sensor networksThe network layer is responsible for information transmis-sion and processing where close transmission may rely onsensor networks and remote transmission shall depend onthe Internet Finally the application layer support specificapplications of IoT

    According to characteristics of Internet of Things thedata generated from IoT has the following features

    ndash Large-scale data in IoT masses of data acquisi-tion equipments are distributedly deployed which mayacquire simple numeric data eg location or complexmultimedia data eg surveillance video In order tomeet the demands of analysis and processing not onlythe currently acquired data but also the historical datawithin a certain time frame should be stored Thereforedata generated by IoT are characterized by large scales

    ndash Heterogeneity because of the variety data acquisitiondevices the acquired data is also different and such datafeatures heterogeneity

    ndash Strong time and space correlation in IoT every dataacquisition device are placed at a specific geographiclocation and every piece of data has time stamp Thetime and space correlation are an important propertyof data from IoT During data analysis and process-ing time and space are also important dimensions forstatistical analysis

    ndash Effective data accounts for only a small portion of thebig data a great quantity of noises may occur dur-ing the acquisition and transmission of data in IoTAmong datasets acquired by acquisition devices only asmall amount of abnormal data is valuable For exam-ple during the acquisition of traffic video the few videoframes that capture the violation of traffic regulations

    Mobile Netw Appl (2014) 19171ndash209 179

    and traffic accidents are more valuable than those onlycapturing the normal flow of traffic

    313 Bio-medical data

    As a series of high-throughput bio-measurement technolo-gies are innovatively developed in the beginning of the21st century the frontier research in the bio-medicine fieldalso enters the era of big data By constructing smartefficient and accurate analytical models and theoretical sys-tems for bio-medicine applications the essential governingmechanism behind complex biological phenomena may berevealed Not only the future development of bio-medicinecan be determined but also the leading roles can be assumedin the development of a series of important strategic indus-tries related to the national economy peoplersquos livelihoodand national security with important applications such asmedical care new drug R amp D and grain production (egtransgenic crops)

    The completion of HGP (Human Genome Project) andthe continued development of sequencing technology alsolead to widespread applications of big data in the fieldThe masses of data generated by gene sequencing gothrough specialized analysis according to different applica-tion demands to combine it with the clinical gene diag-nosis and provide valuable information for early diagnosisand personalized treatment of disease One sequencing ofhuman gene may generate 100 600GB raw data In theChina National Genebank in Shenzhen there are 13 mil-lion samples including 115 million human samples and150000 animal plant and microorganism samples By theend of 2013 10 million traceable biological samples willbe stored and by the end of 2015 this figure will reach30 million It is predictable that with the development ofbio-medicine technologies gene sequencing will becomefaster and more convenient and thus making big data ofbio-medicine continuously grow beyond all doubt

    In addition data generated from clinical medical care andmedical R amp D also rise quickly For example the Uni-versity of Pittsburgh Medical Center (UPMC) has stored2TB such data Explorys an American company providesplatforms to collocate clinical data operation and mainte-nance data and financial data At present about 13 millionpeoplersquos information have been collocated with 44 arti-cles of data at the scale of about 60TB which will reach70TB in 2013 Practice Fusion another American com-pany manages electronic medical records of about 200000patients

    Apart from such small and medium-sized enterprisesother well-known IT companies such as Google Microsoftand IBM have invested extensively in the research and com-putational analysis of methods related to high-throughputbiological big data for shares in the huge market as known

    as the ldquoNext Internetrdquo IBM forecasts in the 2013 StrategyConference that with the sharp increase of medical imagesand electronic medical records medical professionals mayutilize big data to extract useful clinical information frommasses of data to obtain a medical history and forecast treat-ment effects thus improving patient care and reduce costIt is anticipated that by 2015 the average data volume ofevery hospital will increase from 167TB to 665TB

    314 Data generation from other fields

    As scientific applications are increasing the scale ofdatasets is gradually expanding and the development ofsome disciplines greatly relies on the analysis of masses ofdata Here we examine several such applications Althoughbeing in different scientific fields the applications havesimilar and increasing demand on data analysis The firstexample is related to computational biology GenBank isa nucleotide sequence database maintained by the USNational Bio-Technology Innovation Center Data in thisdatabase may double every 10 months By August 2009Genbank has more than 250 billion bases from 150000 dif-ferent organisms [34] The second example is related toastronomy Sloan Digital Sky Survey (SDSS) the biggestsky survey project in astronomy has recorded 25TB datafrom 1998 to 2008 As the resolution of the telescope isimproved by 2004 the data volume generated per night willsurpass 20TB The last application is related to high-energyphysics In the beginning of 2008 the Atlas experiment ofLarge Hadron Collider (LHC) of European Organization forNuclear Research generates raw data at 2PBs and storesabout 10TB processed data per year

    In addition pervasive sensing and computing amongnature commercial Internet government and social envi-ronments are generating heterogeneous data with unprece-dented complexity These datasets have their unique datacharacteristics in scale time dimension and data categoryFor example mobile data were recorded with respect topositions movement approximation degrees communica-tions multimedia use of applications and audio environ-ment [108] According to the application environment andrequirements such datasets into different categories so asto select the proper and feasible solutions for big data

    32 Big data acquisition

    As the second phase of the big data system big data acqui-sition includes data collection data transmission and datapre-processing During big data acquisition once we col-lect the raw data we shall utilize an efficient transmissionmechanism to send it to a proper storage managementsystem to support different analytical applications The col-lected datasets may sometimes include much redundant or

    180 Mobile Netw Appl (2014) 19171ndash209

    useless data which unnecessarily increases storage spaceand affects the subsequent data analysis For examplehigh redundancy is very common among datasets collectedby sensors for environment monitoring Data compressiontechnology can be applied to reduce the redundancy There-fore data pre-processing operations are indispensable toensure efficient data storage and exploitation

    321 Data collection

    Data collection is to utilize special data collection tech-niques to acquire raw data from a specific data generationenvironment Four common data collection methods areshown as follows

    ndash Log files As one widely used data collection methodlog files are record files automatically generated by thedata source system so as to record activities in desig-nated file formats for subsequent analysis Log files aretypically used in nearly all digital devices For exam-ple web servers record in log files number of clicksclick rates visits and other property records of webusers [35] To capture activities of users at the web sitesweb servers mainly include the following three log fileformats public log file format (NCSA) expanded logformat (W3C) and IIS log format (Microsoft) All thethree types of log files are in the ASCII text formatDatabases other than text files may sometimes be usedto store log information to improve the query efficiencyof the massive log store [36 37] There are also someother log files based on data collection including stockindicators in financial applications and determinationof operating states in network monitoring and trafficmanagement

    ndash Sensing Sensors are common in daily life to measurephysical quantities and transform physical quantitiesinto readable digital signals for subsequent process-ing (and storage) Sensory data may be classified assound wave voice vibration automobile chemicalcurrent weather pressure temperature etc Sensedinformation is transferred to a data collection pointthrough wired or wireless networks For applicationsthat may be easily deployed and managed eg videosurveillance system [38] the wired sensor network isa convenient solution to acquire related informationSometimes the accurate position of a specific phe-nomenon is unknown and sometimes the monitoredenvironment does not have the energy or communica-tion infrastructures Then wireless communication mustbe used to enable data transmission among sensor nodesunder limited energy and communication capability Inrecent years WSNs have received considerable inter-est and have been applied to many applications such

    as environmental research [39 40] water quality mon-itoring [41] civil engineering [42 43] and wildlifehabit monitoring [44] A WSN generally consists ofa large number of geographically distributed sensornodes each being a micro device powered by batterySuch sensors are deployed at designated positions asrequired by the application to collect remote sensingdata Once the sensors are deployed the base stationwill send control information for network configura-tionmanagement or data collection to sensor nodesBased on such control information the sensory data isassembled in different sensor nodes and sent back to thebase station for further processing Interested readersare referred to [45] for more detailed discussions

    ndash Methods for acquiring network data At present net-work data acquisition is accomplished using a com-bination of web crawler word segmentation systemtask system and index system etc Web crawler isa program used by search engines for downloadingand storing web pages [46] Generally speaking webcrawler starts from the uniform resource locator (URL)of an initial web page to access other linked web pagesduring which it stores and sequences all the retrievedURLs Web crawler acquires a URL in the order ofprecedence through a URL queue and then downloadsweb pages and identifies all URLs in the downloadedweb pages and extracts new URLs to be put in thequeue This process is repeated until the web crawleris stopped Data acquisition through a web crawler iswidely applied in applications based on web pagessuch as search engines or web caching Traditional webpage extraction technologies feature multiple efficientsolutions and considerable research has been done inthis field As more advanced web page applicationsare emerging some extraction strategies are proposedin [47] to cope with rich Internet applications

    The current network data acquisition technologiesmainly include traditional Libpcap-based packet capturetechnology zero-copy packet capture technology as wellas some specialized network monitoring software such asWireshark SmartSniff and WinNetCap

    ndash Libpcap-based packet capture technology Libpcap(packet capture library) is a widely used network datapacket capture function library It is a general tool thatdoes not depend on any specific system and is mainlyused to capture data in the data link layer It featuressimplicity easy-to-use and portability but has a rel-atively low efficiency Therefore under a high-speednetwork environment considerable packet losses mayoccur when Libpcap is used

    Mobile Netw Appl (2014) 19171ndash209 181

    ndash Zero-copy packet capture technology The so-calledzero-copy (ZC) means that no copies between any inter-nal memories occur during packet receiving and send-ing at a node In sending the data packets directly startfrom the user buffer of applications pass through thenetwork interfaces and arrive at an external networkIn receiving the network interfaces directly send datapackets to the user buffer The basic idea of zero-copyis to reduce data copy times reduce system calls andreduce CPU load while ddatagrams are passed from net-work equipments to user program space The zero-copytechnology first utilizes direct memory access (DMA)technology to directly transmit network datagrams to anaddress space pre-allocated by the system kernel so asto avoid the participation of CPU In the meanwhile itmaps the internal memory of the datagrams in the sys-tem kernel to the that of the detection program or buildsa cache region in the user space and maps it to the ker-nel space Then the detection program directly accessesthe internal memory so as to reduce internal memorycopy from system kernel to user space and reduce theamount of system calls

    ndash Mobile equipments At present mobile devices aremore widely used As mobile device functions becomeincreasingly stronger they feature more complex andmultiple means of data acquisition as well as morevariety of data Mobile devices may acquire geo-graphical location information through positioning sys-tems acquire audio information through microphonesacquire pictures videos streetscapes two-dimensionalbarcodes and other multimedia information throughcameras acquire user gestures and other body languageinformation through touch screens and gravity sensorsOver the years wireless operators have improved theservice level of the mobile Internet by acquiring andanalyzing such information For example iPhone itselfis a ldquomobile spyrdquo It may collect wireless data andgeographical location information and then send suchinformation back to Apple Inc for processing of whichthe user is not aware Apart from Apple smart phoneoperating systems such as Android of Google and Win-dows Phone of Microsoft can also collect informationin the similar manner

    In addition to the aforementioned three data acquisitionmethods of main data sources there are many other datacollect methods or systems For example in scientific exper-iments many special tools can be used to collect exper-imental data such as magnetic spectrometers and radiotelescopes We may classify data collection methods fromdifferent perspectives From the perspective of data sourcesdata collection methods can be classified into two cate-gories collection methods recording through data sources

    and collection methods recording through other auxiliarytools

    322 Data transportation

    Upon the completion of raw data collection data will betransferred to a data storage infrastructure for processingand analysis As discussed in Section 23 big data is mainlystored in a data center The data layout should be adjusted toimprove computing efficiency or facilitate hardware mainte-nance In other words internal data transmission may occurin the data center Therefore data transmission consistsof two phases Inter-DCN transmissions and Intra-DCNtransmissions

    ndash Inter-DCN transmissions Inter-DCN transmissions arefrom data source to data center which is generallyachieved with the existing physical network infrastruc-ture Because of the rapid growth of traffic demandsthe physical network infrastructure in most regionsaround the world are constituted by high-volumn high-rate and cost-effective optic fiber transmission systemsOver the past 20 years advanced management equip-ment and technologies have been developed such asIP-based wavelength division multiplexing (WDM) net-work architecture to conduct smart control and man-agement of optical fiber networks [48 49] WDM isa technology that multiplexes multiple optical carriersignals with different wave lengths and couples themto the same optical fiber of the optical link In suchtechnology lasers with different wave lengths carry dif-ferent signals By far the backbone network have beendeployed with WDM optical transmission systems withsingle channel rate of 40Gbs At present 100Gbs com-mercial interface are available and 100Gbs systems (orTBs systems) will be available in the near future [50]However traditional optical transmission technologiesare limited by the bandwidth of the electronic bot-tleneck [51] Recently orthogonal frequency-divisionmultiplexing (OFDM) initially designed for wirelesssystems is regarded as one of the main candidatetechnologies for future high-speed optical transmis-sion OFDM is a multi-carrier parallel transmissiontechnology It segments a high-speed data flow to trans-form it into low-speed sub-data-flows to be transmittedover multiple orthogonal sub-carriers [52] Comparedwith fixed channel spacing of WDM OFDM allowssub-channel frequency spectrums to overlap with eachother [53] Therefore it is a flexible and efficient opticalnetworking technology

    ndash Intra-DCN Transmissions Intra-DCN transmissionsare the data communication flows within data centersIntra-DCN transmissions depend on the communication

    182 Mobile Netw Appl (2014) 19171ndash209

    mechanism within the data center (ie on physical con-nection plates chips internal memories of data serversnetwork architectures of data centers and communica-tion protocols) A data center consists of multiple inte-grated server racks interconnected with its internal con-nection networks Nowadays the internal connectionnetworks of most data centers are fat-tree two-layeror three-layer structures based on multi-commoditynetwork flows [51 54] In the two-layer topologicalstructure the racks are connected by 1Gbps top rackswitches (TOR) and then such top rack switches areconnected with 10Gbps aggregation switches in thetopological structure The three-layer topological struc-ture is a structure augmented with one layer on the topof the two-layer topological structure and such layeris constituted by 10Gbps or 100Gbps core switchesto connect aggregation switches in the topologicalstructure There are also other topological structureswhich aim to improve the data center networks [55ndash58] Because of the inadequacy of electronic packetswitches it is difficult to increase communication band-widths while keeps energy consumption is low Overthe years due to the huge success achieved by opti-cal technologies the optical interconnection among thenetworks in data centers has drawn great interest Opti-cal interconnection is a high-throughput low-delayand low-energy-consumption solution At present opti-cal technologies are only used for point-to-point linksin data centers Such optical links provide connectionfor the switches using the low-cost multi-mode fiber(MMF) with 10Gbps data rate Optical interconnec-tion (switching in the optical domain) of networks indata centers is a feasible solution which can provideTbps-level transmission bandwidth with low energyconsumption Recently many optical interconnectionplans are proposed for data center networks [59] Someplans add optical paths to upgrade the existing net-works and other plans completely replace the currentswitches [59ndash64] As a strengthening technology Zhouet al in [65] adopt wireless links in the 60GHz fre-quency band to strengthen wired links Network vir-tualization should also be considered to improve theefficiency and utilization of data center networks

    323 Data pre-processing

    Because of the wide variety of data sources the collecteddatasets vary with respect to noise redundancy and con-sistency etc and it is undoubtedly a waste to store mean-ingless data In addition some analytical methods haveserious requirements on data quality Therefore in orderto enable effective data analysis we shall pre-process data

    under many circumstances to integrate the data from differ-ent sources which can not only reduces storage expensebut also improves analysis accuracy Some relational datapre-processing techniques are discussed as follows

    ndash Integration data integration is the cornerstone of mod-ern commercial informatics which involves the com-bination of data from different sources and providesusers with a uniform view of data [66] This is a matureresearch field for traditional database Historically twomethods have been widely recognized data ware-house and data federation Data warehousing includesa process named ETL (Extract Transform and Load)Extraction involves connecting source systems select-ing collecting analyzing and processing necessarydata Transformation is the execution of a series of rulesto transform the extracted data into standard formatsLoading means importing extracted and transformeddata into the target storage infrastructure Loading isthe most complex procedure among the three whichincludes operations such as transformation copy clear-ing standardization screening and data organizationA virtual database can be built to query and aggregatedata from different data sources but such database doesnot contain data On the contrary it includes informa-tion or metadata related to actual data and its positionsSuch two ldquostorage-readingrdquo approaches do not sat-isfy the high performance requirements of data flowsor search programs and applications Compared withqueries data in such two approaches is more dynamicand must be processed during data transmission Gen-erally data integration methods are accompanied withflow processing engines and search engines [30 67]

    ndash Cleaning data cleaning is a process to identify inac-curate incomplete or unreasonable data and thenmodify or delete such data to improve data qualityGenerally data cleaning includes five complementaryprocedures [68] defining and determining error typessearching and identifying errors correcting errors doc-umenting error examples and error types and mod-ifying data entry procedures to reduce future errorsDuring cleaning data formats completeness rational-ity and restriction shall be inspected Data cleaning isof vital importance to keep the data consistency whichis widely applied in many fields such as banking insur-ance retail industry telecommunications and trafficcontrol

    In e-commerce most data is electronically col-lected which may have serious data quality prob-lems Classic data quality problems mainly come fromsoftware defects customized errors or system mis-configuration Authors in [69] discussed data cleaning

    Mobile Netw Appl (2014) 19171ndash209 183

    in e-commerce by crawlers and regularly re-copyingcustomer and account information

    In [70] the problem of cleaning RFID data wasexamined RFID is widely used in many applica-tions eg inventory management and target track-ing However the original RFID features low qualitywhich includes a lot of abnormal data limited by thephysical design and affected by environmental noisesIn [71] a probability model was developed to copewith data loss in mobile environments Khoussainovaet al in [72] proposed a system to automatically cor-rect errors of input data by defining global integrityconstraints

    Herbert et al [73] proposed a framework called BIO-AJAX to standardize biological data so as to conductfurther computation and improve search quality WithBIO-AJAX some errors and repetitions may be elim-inated and common data mining technologies can beexecuted more effectively

    ndash Redundancy elimination data redundancy refers to datarepetitions or surplus which usually occurs in manydatasets Data redundancy can increase the unneces-sary data transmission expense and cause defects onstorage systems eg waste of storage space lead-ing to data inconsistency reduction of data reliabil-ity and data damage Therefore various redundancyreduction methods have been proposed such as redun-dancy detection data filtering and data compressionSuch methods may apply to different datasets or appli-cation environments However redundancy reductionmay also bring about certain negative effects Forexample data compression and decompression causeadditional computational burden Therefore the ben-efits of redundancy reduction and the cost should becarefully balanced Data collected from different fieldswill increasingly appear in image or video formatsIt is well-known that images and videos contain con-siderable redundancy including temporal redundancyspacial redundancy statistical redundancy and sens-ing redundancy Video compression is widely usedto reduce redundancy in video data as specified inthe many video coding standards (MPEG-2 MPEG-4H263 and H264AVC) In [74] the authors inves-tigated the problem of video compression in a videosurveillance system with a video sensor network Theauthors propose a new MPEG-4 based method byinvestigating the contextual redundancy related to back-ground and foreground in a scene The low com-plexity and the low compression ratio of the pro-posed approach were demonstrated by the evaluationresults

    On generalized data transmission or storage re-peated data deletion is a special data compression

    technology which aims to eliminate repeated datacopies [75] With repeated data deletion individual datablocks or data segments will be assigned with identi-fiers (eg using a hash algorithm) and stored with theidentifiers added to the identification list As the anal-ysis of repeated data deletion continues if a new datablock has an identifier that is identical to that listedin the identification list the new data block will bedeemed as redundant and will be replaced by the cor-responding stored data block Repeated data deletioncan greatly reduce storage requirement which is par-ticularly important to a big data storage system Apartfrom the aforementioned data pre-processing methodsspecific data objects shall go through some other oper-ations such as feature extraction Such operation playsan important role in multimedia search and DNA anal-ysis [76ndash78] Usually high-dimensional feature vec-tors (or high-dimensional feature points) are used todescribe such data objects and the system stores thedimensional feature vectors for future retrieval Datatransfer is usually used to process distributed hetero-geneous data sources especially business datasets [79]As a matter of fact in consideration of various datasetsit is non-trivial or impossible to build a uniform datapre-processing procedure and technology that is appli-cable to all types of datasets on the specific featureproblem performance requirements and other factorsof the datasets should be considered so as to select aproper data pre-processing strategy

    4 Big data storage

    The explosive growth of data has more strict requirementson storage and management In this section we focus onthe storage of big data Big data storage refers to the stor-age and management of large-scale datasets while achiev-ing reliability and availability of data accessing We willreview important issues including massive storage systemsdistributed storage systems and big data storage mecha-nisms On one hand the storage infrastructure needs toprovide information storage service with reliable storagespace on the other hand it must provide a powerful accessinterface for query and analysis of a large amount ofdata

    Traditionally as auxiliary equipment of server data stor-age device is used to store manage look up and analyzedata with structured RDBMSs With the sharp growth ofdata data storage device is becoming increasingly moreimportant and many Internet companies pursue big capac-ity of storage to be competitive Therefore there is acompelling need for research on data storage

    184 Mobile Netw Appl (2014) 19171ndash209

    41 Storage system for massive data

    Various storage systems emerge to meet the demands ofmassive data Existing massive storage technologies can beclassified as Direct Attached Storage (DAS) and networkstorage while network storage can be further classifiedinto Network Attached Storage (NAS) and Storage AreaNetwork (SAN)

    In DAS various harddisks are directly connected withservers and data management is server-centric such thatstorage devices are peripheral equipments each of whichtakes a certain amount of IO resource and is managed by anindividual application software For this reason DAS is onlysuitable to interconnect servers with a small scale How-ever due to its low scalability DAS will exhibit undesirableefficiency when the storage capacity is increased ie theupgradeability and expandability are greatly limited ThusDAS is mainly used in personal computers and small-sizedservers

    Network storage is to utilize network to provide userswith a union interface for data access and sharing Networkstorage equipment includes special data exchange equip-ments disk array tap library and other storage media aswell as special storage software It is characterized withstrong expandability

    NAS is actually an auxillary storage equipment of a net-work It is directly connected to a network through a hub orswitch through TCPIP protocols In NAS data is transmit-ted in the form of files Compared to DAS the IO burdenat a NAS server is reduced extensively since the serveraccesses a storage device indirectly through a network

    While NAS is network-oriented SAN is especiallydesigned for data storage with a scalable and bandwidthintensive network eg a high-speed network with opticalfiber connections In SAN data storage management is rel-atively independent within a storage local area networkwhere multipath based data switching among any internalnodes is utilized to achieve a maximum degree of datasharing and data management

    From the organization of a data storage system DASNAS and SAN can all be divided into three parts (i) discarray it is the foundation of a storage system and the fun-damental guarantee for data storage (ii) connection andnetwork sub-systems which provide connection among oneor more disc arrays and servers (iii) storage managementsoftware which handles data sharing disaster recovery andother storage management tasks of multiple servers

    42 Distributed storage system

    The first challenge brought about by big data is how todevelop a large scale distributed storage system for effi-ciently data processing and analysis To use a distributed

    system to store massive data the following factors shouldbe taken into consideration

    ndash Consistency a distributed storage system requires mul-tiple servers to cooperatively store data As there aremore servers the probability of server failures will belarger Usually data is divided into multiple pieces tobe stored at different servers to ensure availability incase of server failure However server failures and par-allel storage may cause inconsistency among differentcopies of the same data Consistency refers to assuringthat multiple copies of the same data are identical

    ndash Availability a distributed storage system operates inmultiple sets of servers As more servers are usedserver failures are inevitable It would be desirable ifthe entire system is not seriously affected to satisfy cus-tomerrsquos requests in terms of reading and writing Thisproperty is called availability

    ndash Partition Tolerance multiple servers in a distributedstorage system are connected by a network The net-work could have linknode failures or temporary con-gestion The distributed system should have a certainlevel of tolerance to problems caused by network fail-ures It would be desirable that the distributed storagestill works well when the network is partitioned

    Eric Brewer proposed a CAP [80 81] theory in 2000which indicated that a distributed system could not simulta-neously meet the requirements on consistency availabilityand partition tolerance at most two of the three require-ments can be satisfied simultaneously Seth Gilbert andNancy Lynch from MIT proved the correctness of CAP the-ory in 2002 Since consistency availability and partitiontolerance could not be achieved simultaneously we can havea CA system by ignoring partition tolerance a CP system byignoring availability and an AP system that ignores consis-tency according to different design goals The three systemsare discussed in the following

    CA systems do not have partition tolerance ie theycould not handle network failures Therefore CA sys-tems are generally deemed as storage systems with a sin-gle server such as the traditional small-scale relationaldatabases Such systems feature single copy of data suchthat consistency is easily ensured Availability is guaranteedby the excellent design of relational databases Howeversince CA systems could not handle network failures theycould not be expanded to use many servers Therefore mostlarge-scale storage systems are CP systems and AP systems

    Compared with CA systems CP systems ensure parti-tion tolerance Therefore CP systems can be expanded tobecome distributed systems CP systems generally main-tain several copies of the same data in order to ensure a

    Mobile Netw Appl (2014) 19171ndash209 185

    level of fault tolerance CP systems also ensure data consis-tency ie multiple copies of the same data are guaranteedto be completely identical However CP could not ensuresound availability because of the high cost for consistencyassurance Therefore CP systems are useful for the scenar-ios with moderate load but stringent requirements on dataaccuracy (eg trading data) BigTable and Hbase are twopopular CP systems

    AP systems also ensure partition tolerance However APsystems are different from CP systems in that AP systemsalso ensure availability However AP systems only ensureeventual consistency rather than strong consistency in theprevious two systems Therefore AP systems only applyto the scenarios with frequent requests but not very highrequirements on accuracy For example in online SocialNetworking Services (SNS) systems there are many con-current visits to the data but a certain amount of data errorsare tolerable Furthermore because AP systems ensureeventual consistency accurate data can still be obtained aftera certain amount of delay Therefore AP systems may alsobe used under the circumstances with no stringent realtimerequirements Dynamo and Cassandra are two popular APsystems

    43 Storage mechanism for big data

    Considerable research on big data promotes the develop-ment of storage mechanisms for big data Existing stor-age mechanisms of big data may be classified into threebottom-up levels (i) file systems (ii) databases and (iii)programming models

    File systems are the foundation of the applications atupper levels Googlersquos GFS is an expandable distributedfile system to support large-scale distributed data-intensiveapplications [25] GFS uses cheap commodity servers toachieve fault-tolerance and provides customers with high-performance services GFS supports large-scale file appli-cations with more frequent reading than writing HoweverGFS also has some limitations such as a single point offailure and poor performances for small files Such limita-tions have been overcome by Colossus [82] the successorof GFS

    In addition other companies and researchers also havetheir solutions to meet the different demands for storageof big data For example HDFS and Kosmosfs are deriva-tives of open source codes of GFS Microsoft developedCosmos [83] to support its search and advertisement busi-ness Facebook utilizes Haystack [84] to store the largeamount of small-sized photos Taobao also developed TFSand FastDFS In conclusion distributed file systems havebeen relatively mature after years of development and busi-ness operation Therefore we will focus on the other twolevels in the rest of this section

    431 Database technology

    The database technology has been evolving for more than30 years Various database systems are developed to handledatasets at different scales and support various applica-tions Traditional relational databases cannot meet the chal-lenges on categories and scales brought about by big dataNoSQL databases (ie non traditional relational databases)are becoming more popular for big data storage NoSQLdatabases feature flexible modes support for simple andeasy copy simple API eventual consistency and supportof large volume data NoSQL databases are becomingthe core technology for of big data We will examinethe following three main NoSQL databases in this sec-tion Key-value databases column-oriented databases anddocument-oriented databases each based on certain datamodels

    ndash Key-value Databases Key-value Databases are con-stituted by a simple data model and data is storedcorresponding to key-values Every key is unique andcustomers may input queried values according to thekeys Such databases feature a simple structure andthe modern key-value databases are characterized withhigh expandability and shorter query response time thanthose of relational databases Over the past few yearsmany key-value databases have appeared as motivatedby Amazonrsquos Dynamo system [85] We will introduceDynamo and several other representative key-valuedatabases

    ndash Dynamo Dynamo is a highly available andexpandable distributed key-value data stor-age system It is used to store and managethe status of some core services which canbe realized with key access in the Amazone-Commerce Platform The public mode ofrelational databases may generate invalid dataand limit data scale and availability whileDynamo can resolve these problems with asimple key-object interface which is consti-tuted by simple reading and writing opera-tion Dynamo achieves elasticity and avail-ability through the data partition data copyand object edition mechanisms Dynamo par-tition plan relies on Consistent Hashing [86]which has a main advantage that node pass-ing only affects directly adjacent nodes anddo not affect other nodes to divide the loadfor multiple main storage machines Dynamocopies data to N sets of servers in which Nis a configurable parameter in order to achieve

    186 Mobile Netw Appl (2014) 19171ndash209

    high availability and durability Dynamo sys-tem also provides eventual consistency so asto conduct asynchronous update on all copies

    ndash Voldemort Voldemort is also a key-value stor-age system which was initially developed forand is still used by LinkedIn Key words andvalues in Voldemort are composite objectsconstituted by tables and images Volde-mort interface includes three simple opera-tions reading writing and deletion all ofwhich are confirmed by key words Volde-mort provides asynchronous updating con-current control of multiple editions but doesnot ensure data consistency However Volde-mort supports optimistic locking for consistentmulti-record updating When conflict happensbetween the updating and any other opera-tions the updating operation will quit Thedata copy mechanism of Voldmort is the sameas that of Dynamo Voldemort not only storesdata in RAM but allows data be inserted intoa storage engine Especially Voldemort sup-ports two storage engines including BerkeleyDB and Random Access Files

    The key-value database emerged a few years agoDeeply influenced by Amazon Dynamo DB other key-value storage systems include Redis Tokyo Canbinetand Tokyo Tyrant Memcached and Memcache DBRiak and Scalaris all of which provide expandabilityby distributing key words into nodes Voldemort RiakTokyo Cabinet and Memecached can utilize attachedstorage devices to store data in RAM or disks Otherstorage systems store data at RAM and provide diskbackup or rely on copy and recovery to avoid backup

    ndash Column-oriented Database The column-orienteddatabases store and process data according to columnsother than rows Both columns and rows are segmentedin multiple nodes to realize expandability The column-oriented databases are mainly inspired by GooglersquosBigTable In this Section we first discuss BigTable andthen introduce several derivative tools

    ndash BigTable BigTable is a distributed structureddata storage system which is designed to pro-cess the large-scale (PB class) data amongthousands commercial servers [87] The basicdata structure of Bigtable is a multi-dimensionsequenced mapping with sparse distributedand persistent storage Indexes of mappingare row key column key and timestampsand every value in mapping is an unana-lyzed byte array Each row key in BigTable

    is a 64KB character string By lexicograph-ical order rows are stored and continuallysegmented into Tablets (ie units of distribu-tion) for load balance Thus reading a shortrow of data can be highly effective since itonly involves communication with a small por-tion of machines The columns are groupedaccording to the prefixes of keys and thusforming column families These column fami-lies are the basic units for access control Thetimestamps are 64-bit integers to distinguishdifferent editions of cell values Clients mayflexibly determine the number of cell editionsstored These editions are sequenced in thedescending order of timestamps so the latestedition will always be read

    The BigTable API features the creation anddeletion of Tablets and column families as wellas modification of metadata of clusters tablesand column families Client applications mayinsert or delete values of BigTable query val-ues from columns or browse sub-datasets in atable Bigtable also supports some other char-acteristics such as transaction processing in asingle row Users may utilize such features toconduct more complex data processing

    Every procedure executed by BigTableincludes three main components Masterserver Tablet server and client libraryBigtable only allows one set of Master serverbe distributed to be responsible for distribut-ing tablets for Tablet server detecting addedor removed Tablet servers and conductingload balance In addition it can also mod-ify BigTable schema eg creating tables andcolumn families and collecting garbage savedin GFS as well as deleted or disabled filesand using them in specific BigTable instancesEvery tablet server manages a Tablet set andis responsible for the reading and writing of aloaded Tablet When Tablets are too big theywill be segmented by the server The applica-tion client library is used to communicate withBigTable instances

    BigTable is based on many fundamentalcomponents of Google including GFS [25]cluster management system SSTable file for-mat and Chubby [88] GFS is use to store dataand log files The cluster management systemis responsible for task scheduling resourcessharing processing of machine failures andmonitoring of machine statuses SSTable fileformat is used to store BigTable data internally

    Mobile Netw Appl (2014) 19171ndash209 187

    and it provides mapping between persistentsequenced and unchangeable keys and valuesas any byte strings BigTable utilizes Chubbyfor the following tasks in server 1) ensurethere is at most one active Master copy atany time 2) store the bootstrap location ofBigTable data 3) look up Tablet server 4) con-duct error recovery in case of Table server fail-ures 5) store BigTable schema information 6)store the access control table

    ndash Cassandra Cassandra is a distributed storagesystem to manage the huge amount of struc-tured data distributed among multiple commer-cial servers [89] The system was developedby Facebook and became an open source toolin 2008 It adopts the ideas and concepts ofboth Amazon Dynamo and Google BigTableespecially integrating the distributed systemtechnology of Dynamo with the BigTable datamodel Tables in Cassandra are in the form ofdistributed four-dimensional structured map-ping where the four dimensions including rowcolumn column family and super column Arow is distinguished by a string-key with arbi-trary length No matter the amount of columnsto be read or written the operation on rowsis an auto Columns may constitute clusterswhich is called column families and are sim-ilar to the data model of Bigtable Cassandraprovides two kinds of column families columnfamilies and super columns The super columnincludes arbitrary number of columns relatedto same names A column family includescolumns and super columns which may becontinuously inserted to the column familyduring runtime The partition and copy mecha-nisms of Cassandra are very similar to those ofDynamo so as to achieve consistency

    ndash Derivative tools of BigTable since theBigTable code cannot be obtained throughthe open source license some open sourceprojects compete to implement the BigTableconcept to develop similar systems such asHBase and Hypertable

    HBase is a BigTable cloned version pro-grammed with Java and is a part of Hadoop ofApachersquos MapReduce framework [90] HBasereplaces GFS with HDFS It writes updatedcontents into RAM and regularly writes theminto files on disks The row operations areatomic operations equipped with row-levellocking and transaction processing which is

    optional for large scale Partition and distribu-tion are transparently operated and have spacefor client hash or fixed key

    HyperTable was developed similar toBigTable to obtain a set of high-performanceexpandable distributed storage and process-ing systems for structured and unstructureddata [91] HyperTable relies on distributedfile systems eg HDFS and distributed lockmanager Data representation processing andpartition mechanism are similar to that inBigTable HyperTable has its own query lan-guage called HyperTable query language(HQL) and allows users to create modify andquery underlying tables

    Since the column-oriented storage databases mainlyemulate BigTable their designs are all similar exceptfor the concurrency mechanism and several other fea-tures For example Cassandra emphasizes weak consis-tency of concurrent control of multiple editions whileHBase and HyperTable focus on strong consistencythrough locks or log records

    ndash Document Database Compared with key-value stor-age document storage can support more complex dataforms Since documents do not follow strict modesthere is no need to conduct mode migration In additionkey-value pairs can still be saved We will examine threeimportant representatives of document storage systemsie MongoDB SimpleDB and CouchDB

    ndash MongoDB MongoDB is open-source anddocument-oriented database [92] MongoDBstores documents as Binary JSON (BSON)objects [93] which is similar to object Everydocument has an ID field as the primary keyQuery in MongoDB is expressed with syn-tax similar to JSON A database driver sendsthe query as a BSON object to MongoDBThe system allows query on all documentsincluding embedded objects and arrays Toenable rapid query indexes can be createdin the queryable fields of documents Thecopy operation in MongoDB can be executedwith log files in the main nodes that supportall the high-level operations conducted in thedatabase During copying the slavers queryall the writing operations since the last syn-chronization to the master and execute opera-tions in log files in local databases MongoDBsupports horizontal expansion with automaticsharing to distribute data among thousandsof nodes by automatically balancing load andfailover

    188 Mobile Netw Appl (2014) 19171ndash209

    ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

    ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

    Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

    ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

    functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

    Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

    ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

    The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

    In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

    Mobile Netw Appl (2014) 19171ndash209 189

    DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

    ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

    All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

    ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

    The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

    Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

    The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

    Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

    ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

    ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

    51 Traditional data analysis

    5 Big data analysis

    190 Mobile Netw Appl (2014) 19171ndash209

    ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

    ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

    ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

    ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

    ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

    52 Big data analytic methods

    In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

    ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

    ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

    ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

    ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

    ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

    Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

    53 Architecture for big data analysis

    Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

    Mobile Netw Appl (2014) 19171ndash209 191

    Table 1 Comparison of MPI MapReduce and Dryad

    MPI MapReduce Dryad

    Deployment Computing node and data Computing and data storage Computing and data storage

    storage arranged separately arranged at the same node arranged at the same node

    (Data should be moved (Computing should (Computing should

    computing node) be close to data) be close to data)

    Resource management ndash Workqueue(google) Not clear

    scheduling HOD(Yahoo)

    Low level programming MPI API MapReduce API Dryad API

    High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

    Data storage The local file system GFS(google) NTFS

    NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

    Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

    the tasks

    Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

    memory access Shared-memory FIFOs

    Fault-tolerant Checkpoint Task re-execute Task re-execute

    531 Real-time vs offline analysis

    According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

    ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

    ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

    532 Analysis at different levels

    Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

    ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

    ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

    ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

    192 Mobile Netw Appl (2014) 19171ndash209

    533 Analysis with different complexity

    The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

    54 Tools for big data mining and analysis

    Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

    ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

    ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

    ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

    The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

    ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

    ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

    6 Big data applications

    In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

    Mobile Netw Appl (2014) 19171ndash209 193

    However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

    61 Application evolutions

    Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

    ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

    ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

    most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

    ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

    As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

    62 Big data analysis fields

    webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

    194 Mobile Netw Appl (2014) 19171ndash209

    621 Structured data analysis

    Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

    However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

    622 Text data analysis

    The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

    623 Web data analysis

    Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

    mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

    Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

    Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

    Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

    Mobile Netw Appl (2014) 19171ndash209 195

    624 Multimedia data analysis

    Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

    Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

    Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

    Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

    segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

    Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

    The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

    625 Network data analysis

    Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

    196 Mobile Netw Appl (2014) 19171ndash209

    and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

    The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

    Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

    Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

    Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

    is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

    626 Mobile data analysis

    By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

    With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

    Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

    Mobile Netw Appl (2014) 19171ndash209 197

    In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

    Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

    63 Key applications of big data

    631 Application of big data in enterprises

    At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

    In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

    Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

    632 Application of IoT based big data

    IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

    Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

    Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

    633 Application of online social network-oriented big data

    Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

    198 Mobile Netw Appl (2014) 19171ndash209

    information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

    ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

    ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

    is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

    The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

    In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

    Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

    Mobile Netw Appl (2014) 19171ndash209 199

    or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

    Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

    ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

    ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

    ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

    634 Applications of healthcare and medical big data

    Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

    effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

    For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

    The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

    HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

    Fig 6 The correlation between Tweets about rice price and food price inflation

    200 Mobile Netw Appl (2014) 19171ndash209

    imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

    635 Collective intelligence

    With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

    Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

    In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

    636 Smart grid

    Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

    supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

    ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

    ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

    Mobile Netw Appl (2014) 19171ndash209 201

    according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

    ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

    7 Conclusion open issues and outlook

    In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

    In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

    71 Open issues

    The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

    711 Theoretical research

    Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

    ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

    ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

    ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

    712 Technology development

    The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

    ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

    202 Mobile Netw Appl (2014) 19171ndash209

    ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

    ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

    ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

    713 Practical implications

    Although there are already many successful big data appli-cations many practical problems should be solved

    ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

    ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

    ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

    individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

    ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

    714 Data security

    In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

    ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

    ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

    Mobile Netw Appl (2014) 19171ndash209 203

    quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

    ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

    ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

    The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

    72 Outlook

    The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

    not predict the future but may take precautions for possibleevents to occur in the future

    ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

    ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

    ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

    ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

    204 Mobile Netw Appl (2014) 19171ndash209

    utilizes relational diagrams to express interpersonalrelationship

    ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

    ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

    ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

    ndash Compared with accurate data we would like toaccept numerous and complicated data

    ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

    ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

    ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

    Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

    increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

    Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

    References

    1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

    2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

    3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

    4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

    5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

    httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

    7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

    8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

    9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

    10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

    11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

    12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

    13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

    14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

    15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

    16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

    17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

    18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

    19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

    Mobile Netw Appl (2014) 19171ndash209 205

    20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

    21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

    22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

    23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

    24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

    25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

    26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

    27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

    28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

    29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

    30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

    31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

    32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

    33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

    34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

    35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

    36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

    37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

    38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

    39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

    40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

    41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

    42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

    43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

    44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

    45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

    46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

    47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

    48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

    49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

    50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

    51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

    52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

    53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

    54 Cisco data center interconnect design and deployment guide(2010)

    55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

    56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

    57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

    58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

    59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

    206 Mobile Netw Appl (2014) 19171ndash209

    60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

    61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

    62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

    63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

    64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

    65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

    66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

    67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

    68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

    69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

    70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

    71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

    72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

    73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

    74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

    75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

    76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

    77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

    78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

    79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

    80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

    81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

    82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

    83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

    84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

    85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

    86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

    87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

    88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

    89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

    90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

    Media Inc93 Crockford D (2006) The applicationjson media type for

    javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

    SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

    tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

    (2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

    97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

    98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

    99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

    Mobile Netw Appl (2014) 19171ndash209 207

    100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

    101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

    102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

    103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

    104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

    105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

    106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

    107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

    108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

    109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

    110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

    111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

    112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

    113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

    114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

    115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

    D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

    117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

    118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

    the 7th ACM international conference on computing frontiersACM pp 277ndash286

    119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

    120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

    121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

    122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

    123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

    124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

    125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

    126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

    127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

    128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

    129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

    130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

    131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

    132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

    133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

    134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

    135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

    136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

    137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

    138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

    139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

    140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

    208 Mobile Netw Appl (2014) 19171ndash209

    141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

    142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

    143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

    144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

    145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

    146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

    147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

    148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

    149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

    150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

    151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

    152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

    153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

    154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

    155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

    156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

    Mobile Netw Appl (2014) 19171ndash209 209

    • Big Data A Survey
      • Abstract
      • Background
        • Dawn of big data era
        • Definition and features of big data
        • Big data value
        • The development of big data
        • Challenges of big data
          • Related technologies
            • Relationship between cloud computing and big data
            • Relationship between IoT and big data
            • Data center
            • Relationship between hadoop and big data
              • Big data generation and acquisition
                • Data generation
                  • Enterprise data
                  • IoT data
                  • Bio-medical data
                  • Data generation from other fields
                    • Big data acquisition
                      • Data collection
                      • Data transportation
                      • Data pre-processing
                          • Big data storage
                            • Storage system for massive data
                            • Distributed storage system
                            • Storage mechanism for big data
                              • Database technology
                                • Traditional data analysis
                                • Big data analytic methods
                                • Architecture for big data analysis
                                  • Real-time vs offline analysis
                                  • Analysis at different levels
                                  • Analysis with different complexity
                                    • Tools for big data mining and analysis
                                      • Big data applications
                                        • Key applications of big data
                                          • Application evolutions
                                          • Structured data analysis
                                          • Text data analysis
                                          • Web data analysis
                                          • Multimedia data analysis
                                          • Network data analysis
                                          • Mobile data analysis
                                            • Key applications of big data
                                              • Application of big data in enterprises
                                              • Application of IoT based big data
                                              • Application of online social network-oriented big data
                                              • Applications of healthcare and medical big data
                                              • Collective intelligence
                                              • Smart grid
                                                  • Conclusion open issues and outlook
                                                    • Open issues
                                                      • Theoretical research
                                                      • Technology development
                                                      • Practical implications
                                                      • Data security
                                                        • Outlook
                                                          • Acknowledgments
                                                          • References

      At present although the importance of big data has beengenerally recognized people still have different opinions onits definition In general big data shall mean the datasetsthat could not be perceived acquired managed and pro-cessed by traditional IT and softwarehardware tools withina tolerable time Because of different concerns scientificand technological enterprises research scholars data ana-lysts and technical practitioners have different definitionsof big data The following definitions may help us have abetter understanding on the profound social economic andtechnological connotations of big data

      In 2010 Apache Hadoop defined big data as ldquodatasetswhich could not be captured managed and processed bygeneral computers within an acceptable scoperdquo On the basisof this definition in May 2011 McKinsey amp Company aglobal consulting agency announced Big Data as the nextfrontier for innovation competition and productivity Bigdata shall mean such datasets which could not be acquiredstored and managed by classic database software This def-inition includes two connotations First datasetsrsquo volumesthat conform to the standard of big data are changing andmay grow over time or with technological advances Sec-ond datasetsrsquo volumes that conform to the standard of bigdata in different applications differ from each other Atpresent big data generally ranges from several TB to sev-eral PB [10] From the definition by McKinsey amp Companyit can be seen that the volume of a dataset is not the onlycriterion for big data The increasingly growing data scaleand its management that could not be handled by traditionaldatabase technologies are the next two key features

      As a matter of fact big data has been defined as earlyas 2001 Doug Laney an analyst of META (presentlyGartner) defined challenges and opportunities brought aboutby increased data with a 3Vs model ie the increase ofVolume Velocity and Variety in a research report [12]Although such a model was not originally used to definebig data Gartner and many other enterprises includingIBM [13] and some research departments of Microsoft [14]still used the ldquo3Vsrdquo model to describe big data withinthe following ten years [15] In the ldquo3Vsrdquo model Volumemeans with the generation and collection of masses ofdata data scale becomes increasingly big Velocity meansthe timeliness of big data specifically data collection andanalysis etc must be rapidly and timely conducted so asto maximumly utilize the commercial value of big dataVariety indicates the various types of data which includesemi-structured and unstructured data such as audio videowebpage and text as well as traditional structured data

      However others have different opinions including IDCone of the most influential leaders in big data and itsresearch fields In 2011 an IDC report defined big data asldquobig data technologies describe a new generation of tech-nologies and architectures designed to economically extract

      value from very large volumes of a wide variety of data byenabling the high-velocity capture discovery andor anal-ysisrdquo [1] With this definition characteristics of big datamay be summarized as four Vs ie Volume (great volume)Variety (various modalities) Velocity (rapid generation)and Value (huge value but very low density) as shown inFig 2 Such 4Vs definition was widely recognized sinceit highlights the meaning and necessity of big data ieexploring the huge hidden values This definition indicatesthe most critical problem in big data which is how to dis-cover values from datasets with an enormous scale varioustypes and rapid generation As Jay Parikh Deputy ChiefEngineer of Facebook said ldquoYou could only own a bunchof data other than big data if you do not utilize the collecteddatardquo [11]

      In addition NIST defines big data as ldquoBig data shallmean the data of which the data volume acquisition speedor data representation limits the capacity of using traditionalrelational methods to conduct effective analysis or the datawhich may be effectively processed with important horizon-tal zoom technologiesrdquo which focuses on the technologicalaspect of big data It indicates that efficient methods ortechnologies need to be developed and used to analyze andprocess big data

      There have been considerable discussions from bothindustry and academia on the definition of big data [16 17]In addition to developing a proper definition the big dataresearch should also focus on how to extract its value howto use data and how to transform ldquoa bunch of datardquo into ldquobigdatardquo

      13 Big data value

      McKinsey amp Company observed how big data created val-ues after in-depth research on the US healthcare the EUpublic sector administration the US retail the global man-ufacturing and the global personal location data Throughresearch on the five core industries that represent the globaleconomy the McKinsey report pointed out that big datamay give a full play to the economic function improve theproductivity and competitiveness of enterprises and publicsectors and create huge benefits for consumers In [10]McKinsey summarized the values that big data could cre-ate if big data could be creatively and effectively utilizedto improve efficiency and quality the potential value ofthe US medical industry gained through data may surpassUSD 300 billion thus reducing the expenditure for the UShealthcare by over 8 retailers that fully utilize big datamay improve their profit by more than 60 big data mayalso be utilized to improve the efficiency of governmentoperations such that the developed economies in Europecould save over EUR 100 billion (which excludes the effectof reduced frauds errors and tax difference)

      Mobile Netw Appl (2014) 19171ndash209 173

      Fig 2 The 4Vs feature of big data

      The McKinsey report is regarded as prospective andpredictive while the following facts may validate the val-ues of big data During the 2009 flu pandemic Googleobtained timely information by analyzing big data whicheven provided more valuable information than that providedby disease prevention centers Nearly all countries requiredhospitals inform agencies such as disease prevention centersof the new type of influenza cases However patients usu-ally did not see doctors immediately when they got infectedIt also took some time to send information from hospitals todisease prevention centers and for disease prevention cen-ters to analyze and summarize such information Thereforewhen the public is aware of the pandemic of the new typeof influenza the disease may have already spread for one totwo weeks with a hysteretic nature Google found that dur-ing the spreading of influenza entries frequently sought atits search engines would be different from those at ordinarytimes and the use frequencies of the entries were corre-lated to the influenza spreading in both time and locationGoogle found 45 search entry groups that were closely rel-evant to the outbreak of influenza and incorporated themin specific mathematic models to forecast the spreading ofinfluenza and even to predict places where influenza spreadfrom The related research results have been published inNature [18]

      In 2008 Microsoft purchased Farecast a sci-tech venturecompany in the US Farecast has an airline ticket forecastsystem that predicts the trends and risingdropping ranges ofairline ticket price The system has been incorporated intothe Bing search engine of Microsoft By 2012 the systemhas saved nearly USD 50 per ticket per passenger with theforecasted accuracy as high as 75

      At present data has become an important production fac-tor that could be comparable to material assets and humancapital As multimedia social media and IoT are devel-oping enterprises will collect more information leading

      to an exponential growth of data volume Big data willhave a huge and increasing potential in creating values forbusinesses and consumers

      14 The development of big data

      In the late 1970s the concept of ldquodatabase machinerdquoemerged which is a technology specially used for stor-ing and analyzing data With the increase of data volumethe storage and processing capacity of a single mainframecomputer system became inadequate In the 1980s peo-ple proposed ldquoshare nothingrdquo a parallel database system tomeet the demand of the increasing data volume [19] Theshare nothing system architecture is based on the use ofcluster and every machine has its own processor storageand disk Teradata system was the first successful com-mercial parallel database system Such database becamevery popular lately On June 2 1986 a milestone eventoccurred when Teradata delivered the first parallel databasesystem with the storage capacity of 1TB to Kmart to helpthe large-scale retail company in North America to expandits data warehouse [20] In the late 1990s the advantagesof parallel database was widely recognized in the databasefield

      However many challenges on big data arose With thedevelopment of Internet servies indexes and queried con-tents were rapidly growing Therefore search engine com-panies had to face the challenges of handling such big dataGoogle created GFS [21] and MapReduce [22] program-ming models to cope with the challenges brought aboutby data management and analysis at the Internet scale Inaddition contents generated by users sensors and otherubiquitous data sources also feuled the overwhelming dataflows which required a fundamental change on the comput-ing architecture and large-scale data processing mechanismIn January 2007 Jim Gray a pioneer of database software

      174 Mobile Netw Appl (2014) 19171ndash209

      called such transformation ldquoThe Fourth Paradigmrdquo [23] Healso thought the only way to cope with such paradigm wasto develop a new generation of computing tools to managevisualize and analyze massive data In June 2011 anothermilestone event occurred EMCIDC published a researchreport titled Extracting Values from Chaos [1] which intro-duced the concept and potential of big data for the firsttime This research report triggered the great interest in bothindustry and academia on big data

      Over the past few years nearly all major companiesincluding EMC Oracle IBM Microsoft Google Ama-zon and Facebook etc have started their big data projectsTaking IBM as an example since 2005 IBM has investedUSD 16 billion on 30 acquisitions related to big data Inacademia big data was also under the spotlight In 2008Nature published a big data special issue In 2011 Sciencealso launched a special issue on the key technologies ofldquodata processingrdquo in big data In 2012 European ResearchConsortium for Informatics and Mathematics (ERCIM)News published a special issue on big data In the beginningof 2012 a report titled Big Data Big Impact presented at theDavos Forum in Switzerland announced that big data hasbecome a new kind of economic assets just like currencyor gold Gartner an international research agency issuedHype Cycles from 2012 to 2013 which classified big datacomputing social analysis and stored data analysis into 48emerging technologies that deserve most attention

      Many national governments such as the US also paidgreat attention to big data In March 2012 the ObamaAdministration announced a USD 200 million investmentto launch the ldquoBig Data Research and Development Planrdquowhich was a second major scientific and technologicaldevelopment initiative after the ldquoInformation Highwayrdquo ini-tiative in 1993 In July 2012 the ldquoVigorous ICT Japanrdquoproject issued by Japanrsquos Ministry of Internal Affairs andCommunications indicated that the big data developmentshould be a national strategy and application technologiesshould be the focus In July 2012 the United Nations issuedBig Data for Development report which summarized howgovernments utilized big data to better serve and protecttheir people

      15 Challenges of big data

      The sharply increasing data deluge in the big data erabrings about huge challenges on data acquisition storagemanagement and analysis Traditional data managementand analysis systems are based on the relational databasemanagement system (RDBMS) However such RDBMSsonly apply to structured data other than semi-structured orunstructured data In addition RDBMSs are increasinglyutilizing more and more expensive hardware It is appar-ently that the traditional RDBMSs could not handle the

      huge volume and heterogeneity of big data The researchcommunity has proposed some solutions from different per-spectives For example cloud computing is utilized to meetthe requirements on infrastructure for big data eg costefficiency elasticity and smooth upgradingdowngradingFor solutions of permanent storage and management oflarge-scale disordered datasets distributed file systems [24]and NoSQL [25] databases are good choices Such program-ming frameworks have achieved great success in processingclustered tasks especially for webpage ranking Various bigdata applications can be developed based on these innova-tive technologies or platforms Moreover it is non-trivial todeploy the big data analysis systems

      Some literature [26ndash28] discuss obstacles in the develop-ment of big data applications The key challenges are listedas follows

      ndash Data representation many datasets have certain levelsof heterogeneity in type structure semantics organiza-tion granularity and accessibility Data representationaims to make data more meaningful for computer anal-ysis and user interpretation Nevertheless an improperdata representation will reduce the value of the origi-nal data and may even obstruct effective data analysisEfficient data representation shall reflect data structureclass and type as well as integrated technologies so asto enable efficient operations on different datasets

      ndash Redundancy reduction and data compression gener-ally there is a high level of redundancy in datasetsRedundancy reduction and data compression is effec-tive to reduce the indirect cost of the entire system onthe premise that the potential values of the data are notaffected For example most data generated by sensornetworks are highly redundant which may be filteredand compressed at orders of magnitude

      ndash Data life cycle management compared with the rel-atively slow advances of storage systems pervasivesensing and computing are generating data at unprece-dented rates and scales We are confronted with a lotof pressing challenges one of which is that the currentstorage system could not support such massive dataGenerally speaking values hidden in big data dependon data freshness Therefore a data importance princi-ple related to the analytical value should be developedto decide which data shall be stored and which datashall be discarded

      ndash Analytical mechanism the analytical system of big datashall process masses of heterogeneous data within alimited time However traditional RDBMSs are strictlydesigned with a lack of scalability and expandabilitywhich could not meet the performance requirementsNon-relational databases have shown their uniqueadvantages in the processing of unstructured data and

      Mobile Netw Appl (2014) 19171ndash209 175

      started to become mainstream in big data analysisEven so there are still some problems of non-relationaldatabases in their performance and particular applica-tions We shall find a compromising solution betweenRDBMSs and non-relational databases For examplesome enterprises have utilized a mixed database archi-tecture that integrates the advantages of both types ofdatabase (eg Facebook and Taobao) More researchis needed on the in-memory database and sample databased on approximate analysis

      ndash Data confidentiality most big data service providers orowners at present could not effectively maintain andanalyze such huge datasets because of their limitedcapacity They must rely on professionals or tools toanalyze such data which increase the potential safetyrisks For example the transactional dataset generallyincludes a set of complete operating data to drive keybusiness processes Such data contains details of thelowest granularity and some sensitive information suchas credit card numbers Therefore analysis of big datamay be delivered to a third party for processing onlywhen proper preventive measures are taken to protectsuch sensitive data to ensure its safety

      ndash Energy management the energy consumption of main-frame computing systems has drawn much attentionfrom both economy and environment perspectives Withthe increase of data volume and analytical demandsthe processing storage and transmission of big datawill inevitably consume more and more electric energyTherefore system-level power consumption controland management mechanism shall be established forbig data while the expandability and accessibility areensured

      ndash Expendability and scalability the analytical system ofbig data must support present and future datasets Theanalytical algorithm must be able to process increas-ingly expanding and more complex datasets

      ndash Cooperation analysis of big data is an interdisci-plinary research which requires experts in differentfields cooperate to harvest the potential of big dataA comprehensive big data network architecture mustbe established to help scientists and engineers in var-ious fields access different kinds of data and fullyutilize their expertise so as to cooperate to complete theanalytical objectives

      2 Related technologies

      In order to gain a deep understanding of big data this sec-tion will introduce several fundamental technologies that areclosely related to big data including cloud computing IoTdata center and Hadoop

      21 Relationship between cloud computing and big data

      Cloud computing is closely related to big data The keycomponents of cloud computing are shown in Fig 3 Bigdata is the object of the computation-intensive operation andstresses the storage capacity of a cloud system The mainobjective of cloud computing is to use huge computing andstorage resources under concentrated management so asto provide big data applications with fine-grained comput-ing capacity The development of cloud computing providessolutions for the storage and processing of big data On theother hand the emergence of big data also accelerates thedevelopment of cloud computing The distributed storagetechnology based on cloud computing can effectively man-age big data the parallel computing capacity by virtue ofcloud computing can improve the efficiency of acquisitionand analyzing big data

      Even though there are many overlapped technologiesin cloud computing and big data they differ in the fol-lowing two aspects First the concepts are different to acertain extent Cloud computing transforms the IT archi-tecture while big data influences business decision-makingHowever big data depends on cloud computing as thefundamental infrastructure for smooth operation

      Second big data and cloud computing have differenttarget customers Cloud computing is a technology andproduct targeting Chief Information Officers (CIO) as anadvanced IT solution Big data is a product targeting ChiefExecutive Officers (CEO) focusing on business operationsSince the decision makers may directly feel the pressurefrom market competition they must defeat business oppo-nents in more competitive ways With the advances ofbig data and cloud computing these two technologies arecertainly and increasingly entwine with each other Cloudcomputing with functions similar to those of computers andoperating systems provides system-level resources big data

      Fig 3 Key components of cloud computing

      176 Mobile Netw Appl (2014) 19171ndash209

      operates in the upper level supported by cloud computingand provides functions similar to those of database and effi-cient data processing capacity Kissinger President of EMCindicated that the application of big data must be based oncloud computing

      The evolution of big data was driven by the rapid growthof application demands and cloud computing developedfrom virtualized technologies Therefore cloud computingnot only provides computation and processing for big databut also itself is a service mode To a certain extent theadvances of cloud computing also promote the developmentof big data both of which supplement each other

      22 Relationship between IoT and big data

      In the IoT paradigm an enormous amount of networkingsensors are embedded into various devices and machinesin the real world Such sensors deployed in different fieldsmay collect various kinds of data such as environmentaldata geographical data astronomical data and logistic dataMobile equipments transportation facilities public facil-ities and home appliances could all be data acquisitionequipments in IoT as illustrated in Fig 4

      The big data generated by IoT has different characteris-tics compared with general big data because of the differenttypes of data collected of which the most classical charac-teristics include heterogeneity variety unstructured featurenoise and high redundancy Although the current IoT datais not the dominant part of big data by 2030 the quantity of

      sensors will reach one trillion and then the IoT data will be

      the most important part of big data according to the fore-

      cast of HP A report from Intel pointed out that big data in

      IoT has three features that conform to the big data paradigm

      (i) abundant terminals generating masses of data (ii) data

      generated by IoT is usually semi-structured or unstructured

      (iii) data of IoT is useful only when it is analyzed

      At present the data processing capacity of IoT has fallen

      behind the collected data and it is extremely urgent to accel-

      erate the introduction of big data technologies to promote

      the development of IoT Many operators of IoT realize the

      importance of big data since the success of IoT is hinged

      upon the effective integration of big data and cloud com-

      puting The widespread deployment of IoT will also bring

      many cities into the big data era

      There is a compelling need to adopt big data for IoT

      applications while the development of big data is already

      legged behind It has been widely recognized that these

      two technologies are inter-dependent and should be jointly

      developed on one hand the widespread deployment of IoT

      drives the high growth of data both in quantity and cate-

      gory thus providing the opportunity for the application and

      development of big data on the other hand the application

      of big data technology to IoT also accelerates the research

      advances and business models of of IoT

      Fig 4 Illustration of data acquisition equipment in IoT

      Mobile Netw Appl (2014) 19171ndash209 177

      23 Data center

      In the big data paradigm the data center not only is a plat-form for concentrated storage of data but also undertakesmore responsibilities such as acquiring data managingdata organizing data and leveraging the data values andfunctions Data centers mainly concern ldquodatardquo other thanldquocenterrdquo It has masses of data and organizes and man-ages data according to its core objective and developmentpath which is more valuable than owning a good site andresource The emergence of big data brings about sounddevelopment opportunities and great challenges to data cen-ters Big data is an emerging paradigm which will promotethe explosive growth of the infrastructure and related soft-ware of data center The physical data center network isthe core for supporting big data but at present is the keyinfrastructure that is most urgently required [29]

      ndash Big data requires data center provide powerful back-stage support The big data paradigm has more strin-gent requirements on storage capacity and processingcapacity as well as network transmission capacityEnterprises must take the development of data centersinto consideration to improve the capacity of rapidlyand effectively processing of big data under limitedpriceperformance ratio The data center shall providethe infrastructure with a large number of nodes build ahigh-speed internal network effectively dissipate heatand effective backup data Only when a highly energy-efficient stable safe expandable and redundant datacenter is built the normal operation of big data applica-tions may be ensured

      ndash The growth of big data applications accelerates therevolution and innovation of data centers Many bigdata applications have developed their unique architec-tures and directly promote the development of storagenetwork and computing technologies related to datacenter With the continued growth of the volumes ofstructured and unstructured data and the variety ofsources of analytical data the data processing and com-puting capacities of the data center shall be greatlyenhanced In addition as the scale of data center isincreasingly expanding it is also an important issue onhow to reduce the operational cost for the developmentof data centers

      ndash Big data endows more functions to the data center Inthe big data paradigm data center shall not only con-cern with hardware facilities but also strengthen softcapacities ie the capacities of acquisition processingorganization analysis and application of big data Thedata center may help business personnel analyze theexisting data discover problems in business operationand develop solutions from big data

      24 Relationship between hadoop and big data

      Presently Hadoop is widely used in big data applications inthe industry eg spam filtering network searching click-stream analysis and social recommendation In additionconsiderable academic research is now based on HadoopSome representative cases are given below As declaredin June 2012 Yahoo runs Hadoop in 42000 servers atfour data centers to support its products and services egsearching and spam filtering etc At present the biggestHadoop cluster has 4000 nodes but the number of nodeswill be increased to 10000 with the release of Hadoop 20In the same month Facebook announced that their Hadoopcluster can process 100 PB data which grew by 05 PB perday as in November 2012 Some well-known agencies thatuse Hadoop to conduct distributed computation are listedin [30] In addition many companies provide Hadoop com-mercial execution andor support including Cloudera IBMMapR EMC and Oracle

      Among modern industrial machinery and systems sen-sors are widely deployed to collect information for environ-ment monitoring and failure forecasting etc Bahga and oth-ers in [31] proposed a framework for data organization andcloud computing infrastructure termed CloudView Cloud-View uses mixed architectures local nodes and remoteclusters based on Hadoop to analyze machine-generateddata Local nodes are used for the forecast of real-time fail-ures clusters based on Hadoop are used for complex offlineanalysis eg case-driven data analysis

      The exponential growth of the genome data and the sharpdrop of sequencing cost transform bio-science and bio-medicine to data-driven science Gunarathne et al in [32]utilized cloud computing infrastructures Amazon AWSMicrosoft Azune and data processing framework basedon MapReduce Hadoop and Microsoft DryadLINQ torun two parallel bio-medicine applications (i) assembly ofgenome segments (ii) dimension reduction in the analy-sis of chemical structure In the subsequent application the166-D datasets used include 26000000 data points Theauthors compared the performance of all the frameworks interms of efficiency cost and availability According to thestudy the authors concluded that the loose coupling will beincreasingly applied to research on electron cloud and theparallel programming technology (MapReduce) frameworkmay provide the user an interface with more convenientservices and reduce unnecessary costs

      3 Big data generation and acquisition

      We have introduced several key technologies related to bigdata ie cloud computing IoT data center and HadoopNext we will focus on the value chain of big data which

      178 Mobile Netw Appl (2014) 19171ndash209

      can be generally divided into four phases data generationdata acquisition data storage and data analysis If we takedata as a raw material data generation and data acquisitionare an exploitation process data storage is a storage processand data analysis is a production process that utilizes theraw material to create new value

      31 Data generation

      Data generation is the first step of big data Given Inter-net data as an example huge amount of data in terms ofsearching entries Internet forum posts chatting records andmicroblog messages are generated Those data are closelyrelated to peoplersquos daily life and have similar features ofhigh value and low density Such Internet data may bevalueless individually but through the exploitation of accu-mulated big data useful information such as habits andhobbies of users can be identified and it is even possible toforecast usersrsquo behaviors and emotional moods

      Moreover generated through longitudinal andor dis-tributed data sources datasets are more large-scale highlydiverse and complex Such data sources include sensorsvideos clickstreams andor all other available data sourcesAt present main sources of big data are the operationand trading information in enterprises logistic and sens-ing information in the IoT human interaction informationand position information in the Internet world and datagenerated in scientific research etc The information far sur-passes the capacities of IT architectures and infrastructuresof existing enterprises while its real time requirement alsogreatly stresses the existing computing capacity

      311 Enterprise data

      In 2013 IBM issued Analysis the Applications of Big Datato the Real World which indicates that the internal data ofenterprises are the main sources of big data The internaldata of enterprises mainly consists of online trading data andonline analysis data most of which are historically staticdata and are managed by RDBMSs in a structured man-ner In addition production data inventory data sales dataand financial data etc also constitute enterprise internaldata which aims to capture informationized and data-drivenactivities in enterprises so as to record all activities ofenterprises in the form of internal data

      Over the past decades IT and digital data have con-tributed a lot to improve the profitability of business depart-ments It is estimated that the business data volume of allcompanies in the world may double every 12 years [10]in which the business turnover through the Internet enter-prises to enterprises and enterprises to consumers per daywill reach USD 450 billion [33] The continuously increas-ing business data volume requires more effective real-time

      analysis so as to fully harvest its potential For exampleAmazon processes millions of terminal operations and morethan 500000 queries from third-party sellers per day [12]Walmart processes one million customer trades per hour andsuch trading data are imported into a database with a capac-ity of over 25PB [3] Akamai analyzes 75 million eventsper day for its target advertisements [13]

      312 IoT data

      As discussed IoT is an important source of big data Amongsmart cities constructed based on IoT big data may comefrom industry agriculture traffic transportation medicalcare public departments and families etc

      According to the processes of data acquisition and trans-mission in IoT its network architecture may be dividedinto three layers the sensing layer the network layer andthe application layer The sensing layer is responsible fordata acquisition and mainly consists of sensor networksThe network layer is responsible for information transmis-sion and processing where close transmission may rely onsensor networks and remote transmission shall depend onthe Internet Finally the application layer support specificapplications of IoT

      According to characteristics of Internet of Things thedata generated from IoT has the following features

      ndash Large-scale data in IoT masses of data acquisi-tion equipments are distributedly deployed which mayacquire simple numeric data eg location or complexmultimedia data eg surveillance video In order tomeet the demands of analysis and processing not onlythe currently acquired data but also the historical datawithin a certain time frame should be stored Thereforedata generated by IoT are characterized by large scales

      ndash Heterogeneity because of the variety data acquisitiondevices the acquired data is also different and such datafeatures heterogeneity

      ndash Strong time and space correlation in IoT every dataacquisition device are placed at a specific geographiclocation and every piece of data has time stamp Thetime and space correlation are an important propertyof data from IoT During data analysis and process-ing time and space are also important dimensions forstatistical analysis

      ndash Effective data accounts for only a small portion of thebig data a great quantity of noises may occur dur-ing the acquisition and transmission of data in IoTAmong datasets acquired by acquisition devices only asmall amount of abnormal data is valuable For exam-ple during the acquisition of traffic video the few videoframes that capture the violation of traffic regulations

      Mobile Netw Appl (2014) 19171ndash209 179

      and traffic accidents are more valuable than those onlycapturing the normal flow of traffic

      313 Bio-medical data

      As a series of high-throughput bio-measurement technolo-gies are innovatively developed in the beginning of the21st century the frontier research in the bio-medicine fieldalso enters the era of big data By constructing smartefficient and accurate analytical models and theoretical sys-tems for bio-medicine applications the essential governingmechanism behind complex biological phenomena may berevealed Not only the future development of bio-medicinecan be determined but also the leading roles can be assumedin the development of a series of important strategic indus-tries related to the national economy peoplersquos livelihoodand national security with important applications such asmedical care new drug R amp D and grain production (egtransgenic crops)

      The completion of HGP (Human Genome Project) andthe continued development of sequencing technology alsolead to widespread applications of big data in the fieldThe masses of data generated by gene sequencing gothrough specialized analysis according to different applica-tion demands to combine it with the clinical gene diag-nosis and provide valuable information for early diagnosisand personalized treatment of disease One sequencing ofhuman gene may generate 100 600GB raw data In theChina National Genebank in Shenzhen there are 13 mil-lion samples including 115 million human samples and150000 animal plant and microorganism samples By theend of 2013 10 million traceable biological samples willbe stored and by the end of 2015 this figure will reach30 million It is predictable that with the development ofbio-medicine technologies gene sequencing will becomefaster and more convenient and thus making big data ofbio-medicine continuously grow beyond all doubt

      In addition data generated from clinical medical care andmedical R amp D also rise quickly For example the Uni-versity of Pittsburgh Medical Center (UPMC) has stored2TB such data Explorys an American company providesplatforms to collocate clinical data operation and mainte-nance data and financial data At present about 13 millionpeoplersquos information have been collocated with 44 arti-cles of data at the scale of about 60TB which will reach70TB in 2013 Practice Fusion another American com-pany manages electronic medical records of about 200000patients

      Apart from such small and medium-sized enterprisesother well-known IT companies such as Google Microsoftand IBM have invested extensively in the research and com-putational analysis of methods related to high-throughputbiological big data for shares in the huge market as known

      as the ldquoNext Internetrdquo IBM forecasts in the 2013 StrategyConference that with the sharp increase of medical imagesand electronic medical records medical professionals mayutilize big data to extract useful clinical information frommasses of data to obtain a medical history and forecast treat-ment effects thus improving patient care and reduce costIt is anticipated that by 2015 the average data volume ofevery hospital will increase from 167TB to 665TB

      314 Data generation from other fields

      As scientific applications are increasing the scale ofdatasets is gradually expanding and the development ofsome disciplines greatly relies on the analysis of masses ofdata Here we examine several such applications Althoughbeing in different scientific fields the applications havesimilar and increasing demand on data analysis The firstexample is related to computational biology GenBank isa nucleotide sequence database maintained by the USNational Bio-Technology Innovation Center Data in thisdatabase may double every 10 months By August 2009Genbank has more than 250 billion bases from 150000 dif-ferent organisms [34] The second example is related toastronomy Sloan Digital Sky Survey (SDSS) the biggestsky survey project in astronomy has recorded 25TB datafrom 1998 to 2008 As the resolution of the telescope isimproved by 2004 the data volume generated per night willsurpass 20TB The last application is related to high-energyphysics In the beginning of 2008 the Atlas experiment ofLarge Hadron Collider (LHC) of European Organization forNuclear Research generates raw data at 2PBs and storesabout 10TB processed data per year

      In addition pervasive sensing and computing amongnature commercial Internet government and social envi-ronments are generating heterogeneous data with unprece-dented complexity These datasets have their unique datacharacteristics in scale time dimension and data categoryFor example mobile data were recorded with respect topositions movement approximation degrees communica-tions multimedia use of applications and audio environ-ment [108] According to the application environment andrequirements such datasets into different categories so asto select the proper and feasible solutions for big data

      32 Big data acquisition

      As the second phase of the big data system big data acqui-sition includes data collection data transmission and datapre-processing During big data acquisition once we col-lect the raw data we shall utilize an efficient transmissionmechanism to send it to a proper storage managementsystem to support different analytical applications The col-lected datasets may sometimes include much redundant or

      180 Mobile Netw Appl (2014) 19171ndash209

      useless data which unnecessarily increases storage spaceand affects the subsequent data analysis For examplehigh redundancy is very common among datasets collectedby sensors for environment monitoring Data compressiontechnology can be applied to reduce the redundancy There-fore data pre-processing operations are indispensable toensure efficient data storage and exploitation

      321 Data collection

      Data collection is to utilize special data collection tech-niques to acquire raw data from a specific data generationenvironment Four common data collection methods areshown as follows

      ndash Log files As one widely used data collection methodlog files are record files automatically generated by thedata source system so as to record activities in desig-nated file formats for subsequent analysis Log files aretypically used in nearly all digital devices For exam-ple web servers record in log files number of clicksclick rates visits and other property records of webusers [35] To capture activities of users at the web sitesweb servers mainly include the following three log fileformats public log file format (NCSA) expanded logformat (W3C) and IIS log format (Microsoft) All thethree types of log files are in the ASCII text formatDatabases other than text files may sometimes be usedto store log information to improve the query efficiencyof the massive log store [36 37] There are also someother log files based on data collection including stockindicators in financial applications and determinationof operating states in network monitoring and trafficmanagement

      ndash Sensing Sensors are common in daily life to measurephysical quantities and transform physical quantitiesinto readable digital signals for subsequent process-ing (and storage) Sensory data may be classified assound wave voice vibration automobile chemicalcurrent weather pressure temperature etc Sensedinformation is transferred to a data collection pointthrough wired or wireless networks For applicationsthat may be easily deployed and managed eg videosurveillance system [38] the wired sensor network isa convenient solution to acquire related informationSometimes the accurate position of a specific phe-nomenon is unknown and sometimes the monitoredenvironment does not have the energy or communica-tion infrastructures Then wireless communication mustbe used to enable data transmission among sensor nodesunder limited energy and communication capability Inrecent years WSNs have received considerable inter-est and have been applied to many applications such

      as environmental research [39 40] water quality mon-itoring [41] civil engineering [42 43] and wildlifehabit monitoring [44] A WSN generally consists ofa large number of geographically distributed sensornodes each being a micro device powered by batterySuch sensors are deployed at designated positions asrequired by the application to collect remote sensingdata Once the sensors are deployed the base stationwill send control information for network configura-tionmanagement or data collection to sensor nodesBased on such control information the sensory data isassembled in different sensor nodes and sent back to thebase station for further processing Interested readersare referred to [45] for more detailed discussions

      ndash Methods for acquiring network data At present net-work data acquisition is accomplished using a com-bination of web crawler word segmentation systemtask system and index system etc Web crawler isa program used by search engines for downloadingand storing web pages [46] Generally speaking webcrawler starts from the uniform resource locator (URL)of an initial web page to access other linked web pagesduring which it stores and sequences all the retrievedURLs Web crawler acquires a URL in the order ofprecedence through a URL queue and then downloadsweb pages and identifies all URLs in the downloadedweb pages and extracts new URLs to be put in thequeue This process is repeated until the web crawleris stopped Data acquisition through a web crawler iswidely applied in applications based on web pagessuch as search engines or web caching Traditional webpage extraction technologies feature multiple efficientsolutions and considerable research has been done inthis field As more advanced web page applicationsare emerging some extraction strategies are proposedin [47] to cope with rich Internet applications

      The current network data acquisition technologiesmainly include traditional Libpcap-based packet capturetechnology zero-copy packet capture technology as wellas some specialized network monitoring software such asWireshark SmartSniff and WinNetCap

      ndash Libpcap-based packet capture technology Libpcap(packet capture library) is a widely used network datapacket capture function library It is a general tool thatdoes not depend on any specific system and is mainlyused to capture data in the data link layer It featuressimplicity easy-to-use and portability but has a rel-atively low efficiency Therefore under a high-speednetwork environment considerable packet losses mayoccur when Libpcap is used

      Mobile Netw Appl (2014) 19171ndash209 181

      ndash Zero-copy packet capture technology The so-calledzero-copy (ZC) means that no copies between any inter-nal memories occur during packet receiving and send-ing at a node In sending the data packets directly startfrom the user buffer of applications pass through thenetwork interfaces and arrive at an external networkIn receiving the network interfaces directly send datapackets to the user buffer The basic idea of zero-copyis to reduce data copy times reduce system calls andreduce CPU load while ddatagrams are passed from net-work equipments to user program space The zero-copytechnology first utilizes direct memory access (DMA)technology to directly transmit network datagrams to anaddress space pre-allocated by the system kernel so asto avoid the participation of CPU In the meanwhile itmaps the internal memory of the datagrams in the sys-tem kernel to the that of the detection program or buildsa cache region in the user space and maps it to the ker-nel space Then the detection program directly accessesthe internal memory so as to reduce internal memorycopy from system kernel to user space and reduce theamount of system calls

      ndash Mobile equipments At present mobile devices aremore widely used As mobile device functions becomeincreasingly stronger they feature more complex andmultiple means of data acquisition as well as morevariety of data Mobile devices may acquire geo-graphical location information through positioning sys-tems acquire audio information through microphonesacquire pictures videos streetscapes two-dimensionalbarcodes and other multimedia information throughcameras acquire user gestures and other body languageinformation through touch screens and gravity sensorsOver the years wireless operators have improved theservice level of the mobile Internet by acquiring andanalyzing such information For example iPhone itselfis a ldquomobile spyrdquo It may collect wireless data andgeographical location information and then send suchinformation back to Apple Inc for processing of whichthe user is not aware Apart from Apple smart phoneoperating systems such as Android of Google and Win-dows Phone of Microsoft can also collect informationin the similar manner

      In addition to the aforementioned three data acquisitionmethods of main data sources there are many other datacollect methods or systems For example in scientific exper-iments many special tools can be used to collect exper-imental data such as magnetic spectrometers and radiotelescopes We may classify data collection methods fromdifferent perspectives From the perspective of data sourcesdata collection methods can be classified into two cate-gories collection methods recording through data sources

      and collection methods recording through other auxiliarytools

      322 Data transportation

      Upon the completion of raw data collection data will betransferred to a data storage infrastructure for processingand analysis As discussed in Section 23 big data is mainlystored in a data center The data layout should be adjusted toimprove computing efficiency or facilitate hardware mainte-nance In other words internal data transmission may occurin the data center Therefore data transmission consistsof two phases Inter-DCN transmissions and Intra-DCNtransmissions

      ndash Inter-DCN transmissions Inter-DCN transmissions arefrom data source to data center which is generallyachieved with the existing physical network infrastruc-ture Because of the rapid growth of traffic demandsthe physical network infrastructure in most regionsaround the world are constituted by high-volumn high-rate and cost-effective optic fiber transmission systemsOver the past 20 years advanced management equip-ment and technologies have been developed such asIP-based wavelength division multiplexing (WDM) net-work architecture to conduct smart control and man-agement of optical fiber networks [48 49] WDM isa technology that multiplexes multiple optical carriersignals with different wave lengths and couples themto the same optical fiber of the optical link In suchtechnology lasers with different wave lengths carry dif-ferent signals By far the backbone network have beendeployed with WDM optical transmission systems withsingle channel rate of 40Gbs At present 100Gbs com-mercial interface are available and 100Gbs systems (orTBs systems) will be available in the near future [50]However traditional optical transmission technologiesare limited by the bandwidth of the electronic bot-tleneck [51] Recently orthogonal frequency-divisionmultiplexing (OFDM) initially designed for wirelesssystems is regarded as one of the main candidatetechnologies for future high-speed optical transmis-sion OFDM is a multi-carrier parallel transmissiontechnology It segments a high-speed data flow to trans-form it into low-speed sub-data-flows to be transmittedover multiple orthogonal sub-carriers [52] Comparedwith fixed channel spacing of WDM OFDM allowssub-channel frequency spectrums to overlap with eachother [53] Therefore it is a flexible and efficient opticalnetworking technology

      ndash Intra-DCN Transmissions Intra-DCN transmissionsare the data communication flows within data centersIntra-DCN transmissions depend on the communication

      182 Mobile Netw Appl (2014) 19171ndash209

      mechanism within the data center (ie on physical con-nection plates chips internal memories of data serversnetwork architectures of data centers and communica-tion protocols) A data center consists of multiple inte-grated server racks interconnected with its internal con-nection networks Nowadays the internal connectionnetworks of most data centers are fat-tree two-layeror three-layer structures based on multi-commoditynetwork flows [51 54] In the two-layer topologicalstructure the racks are connected by 1Gbps top rackswitches (TOR) and then such top rack switches areconnected with 10Gbps aggregation switches in thetopological structure The three-layer topological struc-ture is a structure augmented with one layer on the topof the two-layer topological structure and such layeris constituted by 10Gbps or 100Gbps core switchesto connect aggregation switches in the topologicalstructure There are also other topological structureswhich aim to improve the data center networks [55ndash58] Because of the inadequacy of electronic packetswitches it is difficult to increase communication band-widths while keeps energy consumption is low Overthe years due to the huge success achieved by opti-cal technologies the optical interconnection among thenetworks in data centers has drawn great interest Opti-cal interconnection is a high-throughput low-delayand low-energy-consumption solution At present opti-cal technologies are only used for point-to-point linksin data centers Such optical links provide connectionfor the switches using the low-cost multi-mode fiber(MMF) with 10Gbps data rate Optical interconnec-tion (switching in the optical domain) of networks indata centers is a feasible solution which can provideTbps-level transmission bandwidth with low energyconsumption Recently many optical interconnectionplans are proposed for data center networks [59] Someplans add optical paths to upgrade the existing net-works and other plans completely replace the currentswitches [59ndash64] As a strengthening technology Zhouet al in [65] adopt wireless links in the 60GHz fre-quency band to strengthen wired links Network vir-tualization should also be considered to improve theefficiency and utilization of data center networks

      323 Data pre-processing

      Because of the wide variety of data sources the collecteddatasets vary with respect to noise redundancy and con-sistency etc and it is undoubtedly a waste to store mean-ingless data In addition some analytical methods haveserious requirements on data quality Therefore in orderto enable effective data analysis we shall pre-process data

      under many circumstances to integrate the data from differ-ent sources which can not only reduces storage expensebut also improves analysis accuracy Some relational datapre-processing techniques are discussed as follows

      ndash Integration data integration is the cornerstone of mod-ern commercial informatics which involves the com-bination of data from different sources and providesusers with a uniform view of data [66] This is a matureresearch field for traditional database Historically twomethods have been widely recognized data ware-house and data federation Data warehousing includesa process named ETL (Extract Transform and Load)Extraction involves connecting source systems select-ing collecting analyzing and processing necessarydata Transformation is the execution of a series of rulesto transform the extracted data into standard formatsLoading means importing extracted and transformeddata into the target storage infrastructure Loading isthe most complex procedure among the three whichincludes operations such as transformation copy clear-ing standardization screening and data organizationA virtual database can be built to query and aggregatedata from different data sources but such database doesnot contain data On the contrary it includes informa-tion or metadata related to actual data and its positionsSuch two ldquostorage-readingrdquo approaches do not sat-isfy the high performance requirements of data flowsor search programs and applications Compared withqueries data in such two approaches is more dynamicand must be processed during data transmission Gen-erally data integration methods are accompanied withflow processing engines and search engines [30 67]

      ndash Cleaning data cleaning is a process to identify inac-curate incomplete or unreasonable data and thenmodify or delete such data to improve data qualityGenerally data cleaning includes five complementaryprocedures [68] defining and determining error typessearching and identifying errors correcting errors doc-umenting error examples and error types and mod-ifying data entry procedures to reduce future errorsDuring cleaning data formats completeness rational-ity and restriction shall be inspected Data cleaning isof vital importance to keep the data consistency whichis widely applied in many fields such as banking insur-ance retail industry telecommunications and trafficcontrol

      In e-commerce most data is electronically col-lected which may have serious data quality prob-lems Classic data quality problems mainly come fromsoftware defects customized errors or system mis-configuration Authors in [69] discussed data cleaning

      Mobile Netw Appl (2014) 19171ndash209 183

      in e-commerce by crawlers and regularly re-copyingcustomer and account information

      In [70] the problem of cleaning RFID data wasexamined RFID is widely used in many applica-tions eg inventory management and target track-ing However the original RFID features low qualitywhich includes a lot of abnormal data limited by thephysical design and affected by environmental noisesIn [71] a probability model was developed to copewith data loss in mobile environments Khoussainovaet al in [72] proposed a system to automatically cor-rect errors of input data by defining global integrityconstraints

      Herbert et al [73] proposed a framework called BIO-AJAX to standardize biological data so as to conductfurther computation and improve search quality WithBIO-AJAX some errors and repetitions may be elim-inated and common data mining technologies can beexecuted more effectively

      ndash Redundancy elimination data redundancy refers to datarepetitions or surplus which usually occurs in manydatasets Data redundancy can increase the unneces-sary data transmission expense and cause defects onstorage systems eg waste of storage space lead-ing to data inconsistency reduction of data reliabil-ity and data damage Therefore various redundancyreduction methods have been proposed such as redun-dancy detection data filtering and data compressionSuch methods may apply to different datasets or appli-cation environments However redundancy reductionmay also bring about certain negative effects Forexample data compression and decompression causeadditional computational burden Therefore the ben-efits of redundancy reduction and the cost should becarefully balanced Data collected from different fieldswill increasingly appear in image or video formatsIt is well-known that images and videos contain con-siderable redundancy including temporal redundancyspacial redundancy statistical redundancy and sens-ing redundancy Video compression is widely usedto reduce redundancy in video data as specified inthe many video coding standards (MPEG-2 MPEG-4H263 and H264AVC) In [74] the authors inves-tigated the problem of video compression in a videosurveillance system with a video sensor network Theauthors propose a new MPEG-4 based method byinvestigating the contextual redundancy related to back-ground and foreground in a scene The low com-plexity and the low compression ratio of the pro-posed approach were demonstrated by the evaluationresults

      On generalized data transmission or storage re-peated data deletion is a special data compression

      technology which aims to eliminate repeated datacopies [75] With repeated data deletion individual datablocks or data segments will be assigned with identi-fiers (eg using a hash algorithm) and stored with theidentifiers added to the identification list As the anal-ysis of repeated data deletion continues if a new datablock has an identifier that is identical to that listedin the identification list the new data block will bedeemed as redundant and will be replaced by the cor-responding stored data block Repeated data deletioncan greatly reduce storage requirement which is par-ticularly important to a big data storage system Apartfrom the aforementioned data pre-processing methodsspecific data objects shall go through some other oper-ations such as feature extraction Such operation playsan important role in multimedia search and DNA anal-ysis [76ndash78] Usually high-dimensional feature vec-tors (or high-dimensional feature points) are used todescribe such data objects and the system stores thedimensional feature vectors for future retrieval Datatransfer is usually used to process distributed hetero-geneous data sources especially business datasets [79]As a matter of fact in consideration of various datasetsit is non-trivial or impossible to build a uniform datapre-processing procedure and technology that is appli-cable to all types of datasets on the specific featureproblem performance requirements and other factorsof the datasets should be considered so as to select aproper data pre-processing strategy

      4 Big data storage

      The explosive growth of data has more strict requirementson storage and management In this section we focus onthe storage of big data Big data storage refers to the stor-age and management of large-scale datasets while achiev-ing reliability and availability of data accessing We willreview important issues including massive storage systemsdistributed storage systems and big data storage mecha-nisms On one hand the storage infrastructure needs toprovide information storage service with reliable storagespace on the other hand it must provide a powerful accessinterface for query and analysis of a large amount ofdata

      Traditionally as auxiliary equipment of server data stor-age device is used to store manage look up and analyzedata with structured RDBMSs With the sharp growth ofdata data storage device is becoming increasingly moreimportant and many Internet companies pursue big capac-ity of storage to be competitive Therefore there is acompelling need for research on data storage

      184 Mobile Netw Appl (2014) 19171ndash209

      41 Storage system for massive data

      Various storage systems emerge to meet the demands ofmassive data Existing massive storage technologies can beclassified as Direct Attached Storage (DAS) and networkstorage while network storage can be further classifiedinto Network Attached Storage (NAS) and Storage AreaNetwork (SAN)

      In DAS various harddisks are directly connected withservers and data management is server-centric such thatstorage devices are peripheral equipments each of whichtakes a certain amount of IO resource and is managed by anindividual application software For this reason DAS is onlysuitable to interconnect servers with a small scale How-ever due to its low scalability DAS will exhibit undesirableefficiency when the storage capacity is increased ie theupgradeability and expandability are greatly limited ThusDAS is mainly used in personal computers and small-sizedservers

      Network storage is to utilize network to provide userswith a union interface for data access and sharing Networkstorage equipment includes special data exchange equip-ments disk array tap library and other storage media aswell as special storage software It is characterized withstrong expandability

      NAS is actually an auxillary storage equipment of a net-work It is directly connected to a network through a hub orswitch through TCPIP protocols In NAS data is transmit-ted in the form of files Compared to DAS the IO burdenat a NAS server is reduced extensively since the serveraccesses a storage device indirectly through a network

      While NAS is network-oriented SAN is especiallydesigned for data storage with a scalable and bandwidthintensive network eg a high-speed network with opticalfiber connections In SAN data storage management is rel-atively independent within a storage local area networkwhere multipath based data switching among any internalnodes is utilized to achieve a maximum degree of datasharing and data management

      From the organization of a data storage system DASNAS and SAN can all be divided into three parts (i) discarray it is the foundation of a storage system and the fun-damental guarantee for data storage (ii) connection andnetwork sub-systems which provide connection among oneor more disc arrays and servers (iii) storage managementsoftware which handles data sharing disaster recovery andother storage management tasks of multiple servers

      42 Distributed storage system

      The first challenge brought about by big data is how todevelop a large scale distributed storage system for effi-ciently data processing and analysis To use a distributed

      system to store massive data the following factors shouldbe taken into consideration

      ndash Consistency a distributed storage system requires mul-tiple servers to cooperatively store data As there aremore servers the probability of server failures will belarger Usually data is divided into multiple pieces tobe stored at different servers to ensure availability incase of server failure However server failures and par-allel storage may cause inconsistency among differentcopies of the same data Consistency refers to assuringthat multiple copies of the same data are identical

      ndash Availability a distributed storage system operates inmultiple sets of servers As more servers are usedserver failures are inevitable It would be desirable ifthe entire system is not seriously affected to satisfy cus-tomerrsquos requests in terms of reading and writing Thisproperty is called availability

      ndash Partition Tolerance multiple servers in a distributedstorage system are connected by a network The net-work could have linknode failures or temporary con-gestion The distributed system should have a certainlevel of tolerance to problems caused by network fail-ures It would be desirable that the distributed storagestill works well when the network is partitioned

      Eric Brewer proposed a CAP [80 81] theory in 2000which indicated that a distributed system could not simulta-neously meet the requirements on consistency availabilityand partition tolerance at most two of the three require-ments can be satisfied simultaneously Seth Gilbert andNancy Lynch from MIT proved the correctness of CAP the-ory in 2002 Since consistency availability and partitiontolerance could not be achieved simultaneously we can havea CA system by ignoring partition tolerance a CP system byignoring availability and an AP system that ignores consis-tency according to different design goals The three systemsare discussed in the following

      CA systems do not have partition tolerance ie theycould not handle network failures Therefore CA sys-tems are generally deemed as storage systems with a sin-gle server such as the traditional small-scale relationaldatabases Such systems feature single copy of data suchthat consistency is easily ensured Availability is guaranteedby the excellent design of relational databases Howeversince CA systems could not handle network failures theycould not be expanded to use many servers Therefore mostlarge-scale storage systems are CP systems and AP systems

      Compared with CA systems CP systems ensure parti-tion tolerance Therefore CP systems can be expanded tobecome distributed systems CP systems generally main-tain several copies of the same data in order to ensure a

      Mobile Netw Appl (2014) 19171ndash209 185

      level of fault tolerance CP systems also ensure data consis-tency ie multiple copies of the same data are guaranteedto be completely identical However CP could not ensuresound availability because of the high cost for consistencyassurance Therefore CP systems are useful for the scenar-ios with moderate load but stringent requirements on dataaccuracy (eg trading data) BigTable and Hbase are twopopular CP systems

      AP systems also ensure partition tolerance However APsystems are different from CP systems in that AP systemsalso ensure availability However AP systems only ensureeventual consistency rather than strong consistency in theprevious two systems Therefore AP systems only applyto the scenarios with frequent requests but not very highrequirements on accuracy For example in online SocialNetworking Services (SNS) systems there are many con-current visits to the data but a certain amount of data errorsare tolerable Furthermore because AP systems ensureeventual consistency accurate data can still be obtained aftera certain amount of delay Therefore AP systems may alsobe used under the circumstances with no stringent realtimerequirements Dynamo and Cassandra are two popular APsystems

      43 Storage mechanism for big data

      Considerable research on big data promotes the develop-ment of storage mechanisms for big data Existing stor-age mechanisms of big data may be classified into threebottom-up levels (i) file systems (ii) databases and (iii)programming models

      File systems are the foundation of the applications atupper levels Googlersquos GFS is an expandable distributedfile system to support large-scale distributed data-intensiveapplications [25] GFS uses cheap commodity servers toachieve fault-tolerance and provides customers with high-performance services GFS supports large-scale file appli-cations with more frequent reading than writing HoweverGFS also has some limitations such as a single point offailure and poor performances for small files Such limita-tions have been overcome by Colossus [82] the successorof GFS

      In addition other companies and researchers also havetheir solutions to meet the different demands for storageof big data For example HDFS and Kosmosfs are deriva-tives of open source codes of GFS Microsoft developedCosmos [83] to support its search and advertisement busi-ness Facebook utilizes Haystack [84] to store the largeamount of small-sized photos Taobao also developed TFSand FastDFS In conclusion distributed file systems havebeen relatively mature after years of development and busi-ness operation Therefore we will focus on the other twolevels in the rest of this section

      431 Database technology

      The database technology has been evolving for more than30 years Various database systems are developed to handledatasets at different scales and support various applica-tions Traditional relational databases cannot meet the chal-lenges on categories and scales brought about by big dataNoSQL databases (ie non traditional relational databases)are becoming more popular for big data storage NoSQLdatabases feature flexible modes support for simple andeasy copy simple API eventual consistency and supportof large volume data NoSQL databases are becomingthe core technology for of big data We will examinethe following three main NoSQL databases in this sec-tion Key-value databases column-oriented databases anddocument-oriented databases each based on certain datamodels

      ndash Key-value Databases Key-value Databases are con-stituted by a simple data model and data is storedcorresponding to key-values Every key is unique andcustomers may input queried values according to thekeys Such databases feature a simple structure andthe modern key-value databases are characterized withhigh expandability and shorter query response time thanthose of relational databases Over the past few yearsmany key-value databases have appeared as motivatedby Amazonrsquos Dynamo system [85] We will introduceDynamo and several other representative key-valuedatabases

      ndash Dynamo Dynamo is a highly available andexpandable distributed key-value data stor-age system It is used to store and managethe status of some core services which canbe realized with key access in the Amazone-Commerce Platform The public mode ofrelational databases may generate invalid dataand limit data scale and availability whileDynamo can resolve these problems with asimple key-object interface which is consti-tuted by simple reading and writing opera-tion Dynamo achieves elasticity and avail-ability through the data partition data copyand object edition mechanisms Dynamo par-tition plan relies on Consistent Hashing [86]which has a main advantage that node pass-ing only affects directly adjacent nodes anddo not affect other nodes to divide the loadfor multiple main storage machines Dynamocopies data to N sets of servers in which Nis a configurable parameter in order to achieve

      186 Mobile Netw Appl (2014) 19171ndash209

      high availability and durability Dynamo sys-tem also provides eventual consistency so asto conduct asynchronous update on all copies

      ndash Voldemort Voldemort is also a key-value stor-age system which was initially developed forand is still used by LinkedIn Key words andvalues in Voldemort are composite objectsconstituted by tables and images Volde-mort interface includes three simple opera-tions reading writing and deletion all ofwhich are confirmed by key words Volde-mort provides asynchronous updating con-current control of multiple editions but doesnot ensure data consistency However Volde-mort supports optimistic locking for consistentmulti-record updating When conflict happensbetween the updating and any other opera-tions the updating operation will quit Thedata copy mechanism of Voldmort is the sameas that of Dynamo Voldemort not only storesdata in RAM but allows data be inserted intoa storage engine Especially Voldemort sup-ports two storage engines including BerkeleyDB and Random Access Files

      The key-value database emerged a few years agoDeeply influenced by Amazon Dynamo DB other key-value storage systems include Redis Tokyo Canbinetand Tokyo Tyrant Memcached and Memcache DBRiak and Scalaris all of which provide expandabilityby distributing key words into nodes Voldemort RiakTokyo Cabinet and Memecached can utilize attachedstorage devices to store data in RAM or disks Otherstorage systems store data at RAM and provide diskbackup or rely on copy and recovery to avoid backup

      ndash Column-oriented Database The column-orienteddatabases store and process data according to columnsother than rows Both columns and rows are segmentedin multiple nodes to realize expandability The column-oriented databases are mainly inspired by GooglersquosBigTable In this Section we first discuss BigTable andthen introduce several derivative tools

      ndash BigTable BigTable is a distributed structureddata storage system which is designed to pro-cess the large-scale (PB class) data amongthousands commercial servers [87] The basicdata structure of Bigtable is a multi-dimensionsequenced mapping with sparse distributedand persistent storage Indexes of mappingare row key column key and timestampsand every value in mapping is an unana-lyzed byte array Each row key in BigTable

      is a 64KB character string By lexicograph-ical order rows are stored and continuallysegmented into Tablets (ie units of distribu-tion) for load balance Thus reading a shortrow of data can be highly effective since itonly involves communication with a small por-tion of machines The columns are groupedaccording to the prefixes of keys and thusforming column families These column fami-lies are the basic units for access control Thetimestamps are 64-bit integers to distinguishdifferent editions of cell values Clients mayflexibly determine the number of cell editionsstored These editions are sequenced in thedescending order of timestamps so the latestedition will always be read

      The BigTable API features the creation anddeletion of Tablets and column families as wellas modification of metadata of clusters tablesand column families Client applications mayinsert or delete values of BigTable query val-ues from columns or browse sub-datasets in atable Bigtable also supports some other char-acteristics such as transaction processing in asingle row Users may utilize such features toconduct more complex data processing

      Every procedure executed by BigTableincludes three main components Masterserver Tablet server and client libraryBigtable only allows one set of Master serverbe distributed to be responsible for distribut-ing tablets for Tablet server detecting addedor removed Tablet servers and conductingload balance In addition it can also mod-ify BigTable schema eg creating tables andcolumn families and collecting garbage savedin GFS as well as deleted or disabled filesand using them in specific BigTable instancesEvery tablet server manages a Tablet set andis responsible for the reading and writing of aloaded Tablet When Tablets are too big theywill be segmented by the server The applica-tion client library is used to communicate withBigTable instances

      BigTable is based on many fundamentalcomponents of Google including GFS [25]cluster management system SSTable file for-mat and Chubby [88] GFS is use to store dataand log files The cluster management systemis responsible for task scheduling resourcessharing processing of machine failures andmonitoring of machine statuses SSTable fileformat is used to store BigTable data internally

      Mobile Netw Appl (2014) 19171ndash209 187

      and it provides mapping between persistentsequenced and unchangeable keys and valuesas any byte strings BigTable utilizes Chubbyfor the following tasks in server 1) ensurethere is at most one active Master copy atany time 2) store the bootstrap location ofBigTable data 3) look up Tablet server 4) con-duct error recovery in case of Table server fail-ures 5) store BigTable schema information 6)store the access control table

      ndash Cassandra Cassandra is a distributed storagesystem to manage the huge amount of struc-tured data distributed among multiple commer-cial servers [89] The system was developedby Facebook and became an open source toolin 2008 It adopts the ideas and concepts ofboth Amazon Dynamo and Google BigTableespecially integrating the distributed systemtechnology of Dynamo with the BigTable datamodel Tables in Cassandra are in the form ofdistributed four-dimensional structured map-ping where the four dimensions including rowcolumn column family and super column Arow is distinguished by a string-key with arbi-trary length No matter the amount of columnsto be read or written the operation on rowsis an auto Columns may constitute clusterswhich is called column families and are sim-ilar to the data model of Bigtable Cassandraprovides two kinds of column families columnfamilies and super columns The super columnincludes arbitrary number of columns relatedto same names A column family includescolumns and super columns which may becontinuously inserted to the column familyduring runtime The partition and copy mecha-nisms of Cassandra are very similar to those ofDynamo so as to achieve consistency

      ndash Derivative tools of BigTable since theBigTable code cannot be obtained throughthe open source license some open sourceprojects compete to implement the BigTableconcept to develop similar systems such asHBase and Hypertable

      HBase is a BigTable cloned version pro-grammed with Java and is a part of Hadoop ofApachersquos MapReduce framework [90] HBasereplaces GFS with HDFS It writes updatedcontents into RAM and regularly writes theminto files on disks The row operations areatomic operations equipped with row-levellocking and transaction processing which is

      optional for large scale Partition and distribu-tion are transparently operated and have spacefor client hash or fixed key

      HyperTable was developed similar toBigTable to obtain a set of high-performanceexpandable distributed storage and process-ing systems for structured and unstructureddata [91] HyperTable relies on distributedfile systems eg HDFS and distributed lockmanager Data representation processing andpartition mechanism are similar to that inBigTable HyperTable has its own query lan-guage called HyperTable query language(HQL) and allows users to create modify andquery underlying tables

      Since the column-oriented storage databases mainlyemulate BigTable their designs are all similar exceptfor the concurrency mechanism and several other fea-tures For example Cassandra emphasizes weak consis-tency of concurrent control of multiple editions whileHBase and HyperTable focus on strong consistencythrough locks or log records

      ndash Document Database Compared with key-value stor-age document storage can support more complex dataforms Since documents do not follow strict modesthere is no need to conduct mode migration In additionkey-value pairs can still be saved We will examine threeimportant representatives of document storage systemsie MongoDB SimpleDB and CouchDB

      ndash MongoDB MongoDB is open-source anddocument-oriented database [92] MongoDBstores documents as Binary JSON (BSON)objects [93] which is similar to object Everydocument has an ID field as the primary keyQuery in MongoDB is expressed with syn-tax similar to JSON A database driver sendsthe query as a BSON object to MongoDBThe system allows query on all documentsincluding embedded objects and arrays Toenable rapid query indexes can be createdin the queryable fields of documents Thecopy operation in MongoDB can be executedwith log files in the main nodes that supportall the high-level operations conducted in thedatabase During copying the slavers queryall the writing operations since the last syn-chronization to the master and execute opera-tions in log files in local databases MongoDBsupports horizontal expansion with automaticsharing to distribute data among thousandsof nodes by automatically balancing load andfailover

      188 Mobile Netw Appl (2014) 19171ndash209

      ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

      ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

      Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

      ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

      functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

      Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

      ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

      The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

      In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

      Mobile Netw Appl (2014) 19171ndash209 189

      DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

      ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

      All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

      ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

      The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

      Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

      The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

      Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

      ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

      ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

      51 Traditional data analysis

      5 Big data analysis

      190 Mobile Netw Appl (2014) 19171ndash209

      ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

      ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

      ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

      ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

      ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

      52 Big data analytic methods

      In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

      ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

      ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

      ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

      ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

      ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

      Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

      53 Architecture for big data analysis

      Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

      Mobile Netw Appl (2014) 19171ndash209 191

      Table 1 Comparison of MPI MapReduce and Dryad

      MPI MapReduce Dryad

      Deployment Computing node and data Computing and data storage Computing and data storage

      storage arranged separately arranged at the same node arranged at the same node

      (Data should be moved (Computing should (Computing should

      computing node) be close to data) be close to data)

      Resource management ndash Workqueue(google) Not clear

      scheduling HOD(Yahoo)

      Low level programming MPI API MapReduce API Dryad API

      High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

      Data storage The local file system GFS(google) NTFS

      NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

      Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

      the tasks

      Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

      memory access Shared-memory FIFOs

      Fault-tolerant Checkpoint Task re-execute Task re-execute

      531 Real-time vs offline analysis

      According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

      ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

      ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

      532 Analysis at different levels

      Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

      ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

      ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

      ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

      192 Mobile Netw Appl (2014) 19171ndash209

      533 Analysis with different complexity

      The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

      54 Tools for big data mining and analysis

      Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

      ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

      ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

      ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

      The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

      ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

      ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

      6 Big data applications

      In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

      Mobile Netw Appl (2014) 19171ndash209 193

      However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

      61 Application evolutions

      Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

      ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

      ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

      most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

      ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

      As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

      62 Big data analysis fields

      webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

      194 Mobile Netw Appl (2014) 19171ndash209

      621 Structured data analysis

      Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

      However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

      622 Text data analysis

      The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

      623 Web data analysis

      Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

      mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

      Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

      Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

      Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

      Mobile Netw Appl (2014) 19171ndash209 195

      624 Multimedia data analysis

      Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

      Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

      Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

      Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

      segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

      Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

      The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

      625 Network data analysis

      Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

      196 Mobile Netw Appl (2014) 19171ndash209

      and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

      The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

      Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

      Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

      Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

      is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

      626 Mobile data analysis

      By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

      With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

      Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

      Mobile Netw Appl (2014) 19171ndash209 197

      In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

      Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

      63 Key applications of big data

      631 Application of big data in enterprises

      At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

      In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

      Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

      632 Application of IoT based big data

      IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

      Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

      Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

      633 Application of online social network-oriented big data

      Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

      198 Mobile Netw Appl (2014) 19171ndash209

      information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

      ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

      ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

      is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

      The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

      In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

      Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

      Mobile Netw Appl (2014) 19171ndash209 199

      or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

      Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

      ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

      ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

      ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

      634 Applications of healthcare and medical big data

      Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

      effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

      For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

      The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

      HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

      Fig 6 The correlation between Tweets about rice price and food price inflation

      200 Mobile Netw Appl (2014) 19171ndash209

      imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

      635 Collective intelligence

      With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

      Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

      In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

      636 Smart grid

      Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

      supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

      ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

      ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

      Mobile Netw Appl (2014) 19171ndash209 201

      according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

      ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

      7 Conclusion open issues and outlook

      In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

      In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

      71 Open issues

      The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

      711 Theoretical research

      Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

      ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

      ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

      ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

      712 Technology development

      The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

      ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

      202 Mobile Netw Appl (2014) 19171ndash209

      ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

      ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

      ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

      713 Practical implications

      Although there are already many successful big data appli-cations many practical problems should be solved

      ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

      ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

      ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

      individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

      ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

      714 Data security

      In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

      ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

      ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

      Mobile Netw Appl (2014) 19171ndash209 203

      quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

      ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

      ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

      The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

      72 Outlook

      The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

      not predict the future but may take precautions for possibleevents to occur in the future

      ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

      ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

      ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

      ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

      204 Mobile Netw Appl (2014) 19171ndash209

      utilizes relational diagrams to express interpersonalrelationship

      ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

      ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

      ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

      ndash Compared with accurate data we would like toaccept numerous and complicated data

      ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

      ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

      ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

      Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

      increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

      Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

      References

      1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

      2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

      3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

      4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

      5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

      httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

      7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

      8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

      9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

      10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

      11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

      12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

      13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

      14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

      15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

      16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

      17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

      18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

      19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

      Mobile Netw Appl (2014) 19171ndash209 205

      20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

      21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

      22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

      23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

      24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

      25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

      26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

      27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

      28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

      29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

      30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

      31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

      32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

      33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

      34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

      35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

      36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

      37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

      38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

      39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

      40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

      41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

      42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

      43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

      44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

      45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

      46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

      47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

      48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

      49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

      50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

      51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

      52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

      53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

      54 Cisco data center interconnect design and deployment guide(2010)

      55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

      56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

      57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

      58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

      59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

      206 Mobile Netw Appl (2014) 19171ndash209

      60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

      61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

      62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

      63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

      64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

      65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

      66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

      67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

      68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

      69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

      70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

      71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

      72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

      73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

      74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

      75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

      76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

      77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

      78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

      79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

      80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

      81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

      82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

      83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

      84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

      85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

      86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

      87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

      88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

      89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

      90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

      Media Inc93 Crockford D (2006) The applicationjson media type for

      javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

      SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

      tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

      (2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

      97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

      98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

      99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

      Mobile Netw Appl (2014) 19171ndash209 207

      100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

      101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

      102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

      103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

      104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

      105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

      106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

      107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

      108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

      109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

      110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

      111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

      112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

      113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

      114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

      115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

      D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

      117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

      118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

      the 7th ACM international conference on computing frontiersACM pp 277ndash286

      119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

      120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

      121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

      122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

      123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

      124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

      125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

      126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

      127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

      128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

      129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

      130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

      131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

      132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

      133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

      134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

      135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

      136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

      137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

      138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

      139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

      140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

      208 Mobile Netw Appl (2014) 19171ndash209

      141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

      142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

      143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

      144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

      145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

      146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

      147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

      148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

      149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

      150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

      151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

      152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

      153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

      154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

      155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

      156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

      Mobile Netw Appl (2014) 19171ndash209 209

      • Big Data A Survey
        • Abstract
        • Background
          • Dawn of big data era
          • Definition and features of big data
          • Big data value
          • The development of big data
          • Challenges of big data
            • Related technologies
              • Relationship between cloud computing and big data
              • Relationship between IoT and big data
              • Data center
              • Relationship between hadoop and big data
                • Big data generation and acquisition
                  • Data generation
                    • Enterprise data
                    • IoT data
                    • Bio-medical data
                    • Data generation from other fields
                      • Big data acquisition
                        • Data collection
                        • Data transportation
                        • Data pre-processing
                            • Big data storage
                              • Storage system for massive data
                              • Distributed storage system
                              • Storage mechanism for big data
                                • Database technology
                                  • Traditional data analysis
                                  • Big data analytic methods
                                  • Architecture for big data analysis
                                    • Real-time vs offline analysis
                                    • Analysis at different levels
                                    • Analysis with different complexity
                                      • Tools for big data mining and analysis
                                        • Big data applications
                                          • Key applications of big data
                                            • Application evolutions
                                            • Structured data analysis
                                            • Text data analysis
                                            • Web data analysis
                                            • Multimedia data analysis
                                            • Network data analysis
                                            • Mobile data analysis
                                              • Key applications of big data
                                                • Application of big data in enterprises
                                                • Application of IoT based big data
                                                • Application of online social network-oriented big data
                                                • Applications of healthcare and medical big data
                                                • Collective intelligence
                                                • Smart grid
                                                    • Conclusion open issues and outlook
                                                      • Open issues
                                                        • Theoretical research
                                                        • Technology development
                                                        • Practical implications
                                                        • Data security
                                                          • Outlook
                                                            • Acknowledgments
                                                            • References

        Fig 2 The 4Vs feature of big data

        The McKinsey report is regarded as prospective andpredictive while the following facts may validate the val-ues of big data During the 2009 flu pandemic Googleobtained timely information by analyzing big data whicheven provided more valuable information than that providedby disease prevention centers Nearly all countries requiredhospitals inform agencies such as disease prevention centersof the new type of influenza cases However patients usu-ally did not see doctors immediately when they got infectedIt also took some time to send information from hospitals todisease prevention centers and for disease prevention cen-ters to analyze and summarize such information Thereforewhen the public is aware of the pandemic of the new typeof influenza the disease may have already spread for one totwo weeks with a hysteretic nature Google found that dur-ing the spreading of influenza entries frequently sought atits search engines would be different from those at ordinarytimes and the use frequencies of the entries were corre-lated to the influenza spreading in both time and locationGoogle found 45 search entry groups that were closely rel-evant to the outbreak of influenza and incorporated themin specific mathematic models to forecast the spreading ofinfluenza and even to predict places where influenza spreadfrom The related research results have been published inNature [18]

        In 2008 Microsoft purchased Farecast a sci-tech venturecompany in the US Farecast has an airline ticket forecastsystem that predicts the trends and risingdropping ranges ofairline ticket price The system has been incorporated intothe Bing search engine of Microsoft By 2012 the systemhas saved nearly USD 50 per ticket per passenger with theforecasted accuracy as high as 75

        At present data has become an important production fac-tor that could be comparable to material assets and humancapital As multimedia social media and IoT are devel-oping enterprises will collect more information leading

        to an exponential growth of data volume Big data willhave a huge and increasing potential in creating values forbusinesses and consumers

        14 The development of big data

        In the late 1970s the concept of ldquodatabase machinerdquoemerged which is a technology specially used for stor-ing and analyzing data With the increase of data volumethe storage and processing capacity of a single mainframecomputer system became inadequate In the 1980s peo-ple proposed ldquoshare nothingrdquo a parallel database system tomeet the demand of the increasing data volume [19] Theshare nothing system architecture is based on the use ofcluster and every machine has its own processor storageand disk Teradata system was the first successful com-mercial parallel database system Such database becamevery popular lately On June 2 1986 a milestone eventoccurred when Teradata delivered the first parallel databasesystem with the storage capacity of 1TB to Kmart to helpthe large-scale retail company in North America to expandits data warehouse [20] In the late 1990s the advantagesof parallel database was widely recognized in the databasefield

        However many challenges on big data arose With thedevelopment of Internet servies indexes and queried con-tents were rapidly growing Therefore search engine com-panies had to face the challenges of handling such big dataGoogle created GFS [21] and MapReduce [22] program-ming models to cope with the challenges brought aboutby data management and analysis at the Internet scale Inaddition contents generated by users sensors and otherubiquitous data sources also feuled the overwhelming dataflows which required a fundamental change on the comput-ing architecture and large-scale data processing mechanismIn January 2007 Jim Gray a pioneer of database software

        174 Mobile Netw Appl (2014) 19171ndash209

        called such transformation ldquoThe Fourth Paradigmrdquo [23] Healso thought the only way to cope with such paradigm wasto develop a new generation of computing tools to managevisualize and analyze massive data In June 2011 anothermilestone event occurred EMCIDC published a researchreport titled Extracting Values from Chaos [1] which intro-duced the concept and potential of big data for the firsttime This research report triggered the great interest in bothindustry and academia on big data

        Over the past few years nearly all major companiesincluding EMC Oracle IBM Microsoft Google Ama-zon and Facebook etc have started their big data projectsTaking IBM as an example since 2005 IBM has investedUSD 16 billion on 30 acquisitions related to big data Inacademia big data was also under the spotlight In 2008Nature published a big data special issue In 2011 Sciencealso launched a special issue on the key technologies ofldquodata processingrdquo in big data In 2012 European ResearchConsortium for Informatics and Mathematics (ERCIM)News published a special issue on big data In the beginningof 2012 a report titled Big Data Big Impact presented at theDavos Forum in Switzerland announced that big data hasbecome a new kind of economic assets just like currencyor gold Gartner an international research agency issuedHype Cycles from 2012 to 2013 which classified big datacomputing social analysis and stored data analysis into 48emerging technologies that deserve most attention

        Many national governments such as the US also paidgreat attention to big data In March 2012 the ObamaAdministration announced a USD 200 million investmentto launch the ldquoBig Data Research and Development Planrdquowhich was a second major scientific and technologicaldevelopment initiative after the ldquoInformation Highwayrdquo ini-tiative in 1993 In July 2012 the ldquoVigorous ICT Japanrdquoproject issued by Japanrsquos Ministry of Internal Affairs andCommunications indicated that the big data developmentshould be a national strategy and application technologiesshould be the focus In July 2012 the United Nations issuedBig Data for Development report which summarized howgovernments utilized big data to better serve and protecttheir people

        15 Challenges of big data

        The sharply increasing data deluge in the big data erabrings about huge challenges on data acquisition storagemanagement and analysis Traditional data managementand analysis systems are based on the relational databasemanagement system (RDBMS) However such RDBMSsonly apply to structured data other than semi-structured orunstructured data In addition RDBMSs are increasinglyutilizing more and more expensive hardware It is appar-ently that the traditional RDBMSs could not handle the

        huge volume and heterogeneity of big data The researchcommunity has proposed some solutions from different per-spectives For example cloud computing is utilized to meetthe requirements on infrastructure for big data eg costefficiency elasticity and smooth upgradingdowngradingFor solutions of permanent storage and management oflarge-scale disordered datasets distributed file systems [24]and NoSQL [25] databases are good choices Such program-ming frameworks have achieved great success in processingclustered tasks especially for webpage ranking Various bigdata applications can be developed based on these innova-tive technologies or platforms Moreover it is non-trivial todeploy the big data analysis systems

        Some literature [26ndash28] discuss obstacles in the develop-ment of big data applications The key challenges are listedas follows

        ndash Data representation many datasets have certain levelsof heterogeneity in type structure semantics organiza-tion granularity and accessibility Data representationaims to make data more meaningful for computer anal-ysis and user interpretation Nevertheless an improperdata representation will reduce the value of the origi-nal data and may even obstruct effective data analysisEfficient data representation shall reflect data structureclass and type as well as integrated technologies so asto enable efficient operations on different datasets

        ndash Redundancy reduction and data compression gener-ally there is a high level of redundancy in datasetsRedundancy reduction and data compression is effec-tive to reduce the indirect cost of the entire system onthe premise that the potential values of the data are notaffected For example most data generated by sensornetworks are highly redundant which may be filteredand compressed at orders of magnitude

        ndash Data life cycle management compared with the rel-atively slow advances of storage systems pervasivesensing and computing are generating data at unprece-dented rates and scales We are confronted with a lotof pressing challenges one of which is that the currentstorage system could not support such massive dataGenerally speaking values hidden in big data dependon data freshness Therefore a data importance princi-ple related to the analytical value should be developedto decide which data shall be stored and which datashall be discarded

        ndash Analytical mechanism the analytical system of big datashall process masses of heterogeneous data within alimited time However traditional RDBMSs are strictlydesigned with a lack of scalability and expandabilitywhich could not meet the performance requirementsNon-relational databases have shown their uniqueadvantages in the processing of unstructured data and

        Mobile Netw Appl (2014) 19171ndash209 175

        started to become mainstream in big data analysisEven so there are still some problems of non-relationaldatabases in their performance and particular applica-tions We shall find a compromising solution betweenRDBMSs and non-relational databases For examplesome enterprises have utilized a mixed database archi-tecture that integrates the advantages of both types ofdatabase (eg Facebook and Taobao) More researchis needed on the in-memory database and sample databased on approximate analysis

        ndash Data confidentiality most big data service providers orowners at present could not effectively maintain andanalyze such huge datasets because of their limitedcapacity They must rely on professionals or tools toanalyze such data which increase the potential safetyrisks For example the transactional dataset generallyincludes a set of complete operating data to drive keybusiness processes Such data contains details of thelowest granularity and some sensitive information suchas credit card numbers Therefore analysis of big datamay be delivered to a third party for processing onlywhen proper preventive measures are taken to protectsuch sensitive data to ensure its safety

        ndash Energy management the energy consumption of main-frame computing systems has drawn much attentionfrom both economy and environment perspectives Withthe increase of data volume and analytical demandsthe processing storage and transmission of big datawill inevitably consume more and more electric energyTherefore system-level power consumption controland management mechanism shall be established forbig data while the expandability and accessibility areensured

        ndash Expendability and scalability the analytical system ofbig data must support present and future datasets Theanalytical algorithm must be able to process increas-ingly expanding and more complex datasets

        ndash Cooperation analysis of big data is an interdisci-plinary research which requires experts in differentfields cooperate to harvest the potential of big dataA comprehensive big data network architecture mustbe established to help scientists and engineers in var-ious fields access different kinds of data and fullyutilize their expertise so as to cooperate to complete theanalytical objectives

        2 Related technologies

        In order to gain a deep understanding of big data this sec-tion will introduce several fundamental technologies that areclosely related to big data including cloud computing IoTdata center and Hadoop

        21 Relationship between cloud computing and big data

        Cloud computing is closely related to big data The keycomponents of cloud computing are shown in Fig 3 Bigdata is the object of the computation-intensive operation andstresses the storage capacity of a cloud system The mainobjective of cloud computing is to use huge computing andstorage resources under concentrated management so asto provide big data applications with fine-grained comput-ing capacity The development of cloud computing providessolutions for the storage and processing of big data On theother hand the emergence of big data also accelerates thedevelopment of cloud computing The distributed storagetechnology based on cloud computing can effectively man-age big data the parallel computing capacity by virtue ofcloud computing can improve the efficiency of acquisitionand analyzing big data

        Even though there are many overlapped technologiesin cloud computing and big data they differ in the fol-lowing two aspects First the concepts are different to acertain extent Cloud computing transforms the IT archi-tecture while big data influences business decision-makingHowever big data depends on cloud computing as thefundamental infrastructure for smooth operation

        Second big data and cloud computing have differenttarget customers Cloud computing is a technology andproduct targeting Chief Information Officers (CIO) as anadvanced IT solution Big data is a product targeting ChiefExecutive Officers (CEO) focusing on business operationsSince the decision makers may directly feel the pressurefrom market competition they must defeat business oppo-nents in more competitive ways With the advances ofbig data and cloud computing these two technologies arecertainly and increasingly entwine with each other Cloudcomputing with functions similar to those of computers andoperating systems provides system-level resources big data

        Fig 3 Key components of cloud computing

        176 Mobile Netw Appl (2014) 19171ndash209

        operates in the upper level supported by cloud computingand provides functions similar to those of database and effi-cient data processing capacity Kissinger President of EMCindicated that the application of big data must be based oncloud computing

        The evolution of big data was driven by the rapid growthof application demands and cloud computing developedfrom virtualized technologies Therefore cloud computingnot only provides computation and processing for big databut also itself is a service mode To a certain extent theadvances of cloud computing also promote the developmentof big data both of which supplement each other

        22 Relationship between IoT and big data

        In the IoT paradigm an enormous amount of networkingsensors are embedded into various devices and machinesin the real world Such sensors deployed in different fieldsmay collect various kinds of data such as environmentaldata geographical data astronomical data and logistic dataMobile equipments transportation facilities public facil-ities and home appliances could all be data acquisitionequipments in IoT as illustrated in Fig 4

        The big data generated by IoT has different characteris-tics compared with general big data because of the differenttypes of data collected of which the most classical charac-teristics include heterogeneity variety unstructured featurenoise and high redundancy Although the current IoT datais not the dominant part of big data by 2030 the quantity of

        sensors will reach one trillion and then the IoT data will be

        the most important part of big data according to the fore-

        cast of HP A report from Intel pointed out that big data in

        IoT has three features that conform to the big data paradigm

        (i) abundant terminals generating masses of data (ii) data

        generated by IoT is usually semi-structured or unstructured

        (iii) data of IoT is useful only when it is analyzed

        At present the data processing capacity of IoT has fallen

        behind the collected data and it is extremely urgent to accel-

        erate the introduction of big data technologies to promote

        the development of IoT Many operators of IoT realize the

        importance of big data since the success of IoT is hinged

        upon the effective integration of big data and cloud com-

        puting The widespread deployment of IoT will also bring

        many cities into the big data era

        There is a compelling need to adopt big data for IoT

        applications while the development of big data is already

        legged behind It has been widely recognized that these

        two technologies are inter-dependent and should be jointly

        developed on one hand the widespread deployment of IoT

        drives the high growth of data both in quantity and cate-

        gory thus providing the opportunity for the application and

        development of big data on the other hand the application

        of big data technology to IoT also accelerates the research

        advances and business models of of IoT

        Fig 4 Illustration of data acquisition equipment in IoT

        Mobile Netw Appl (2014) 19171ndash209 177

        23 Data center

        In the big data paradigm the data center not only is a plat-form for concentrated storage of data but also undertakesmore responsibilities such as acquiring data managingdata organizing data and leveraging the data values andfunctions Data centers mainly concern ldquodatardquo other thanldquocenterrdquo It has masses of data and organizes and man-ages data according to its core objective and developmentpath which is more valuable than owning a good site andresource The emergence of big data brings about sounddevelopment opportunities and great challenges to data cen-ters Big data is an emerging paradigm which will promotethe explosive growth of the infrastructure and related soft-ware of data center The physical data center network isthe core for supporting big data but at present is the keyinfrastructure that is most urgently required [29]

        ndash Big data requires data center provide powerful back-stage support The big data paradigm has more strin-gent requirements on storage capacity and processingcapacity as well as network transmission capacityEnterprises must take the development of data centersinto consideration to improve the capacity of rapidlyand effectively processing of big data under limitedpriceperformance ratio The data center shall providethe infrastructure with a large number of nodes build ahigh-speed internal network effectively dissipate heatand effective backup data Only when a highly energy-efficient stable safe expandable and redundant datacenter is built the normal operation of big data applica-tions may be ensured

        ndash The growth of big data applications accelerates therevolution and innovation of data centers Many bigdata applications have developed their unique architec-tures and directly promote the development of storagenetwork and computing technologies related to datacenter With the continued growth of the volumes ofstructured and unstructured data and the variety ofsources of analytical data the data processing and com-puting capacities of the data center shall be greatlyenhanced In addition as the scale of data center isincreasingly expanding it is also an important issue onhow to reduce the operational cost for the developmentof data centers

        ndash Big data endows more functions to the data center Inthe big data paradigm data center shall not only con-cern with hardware facilities but also strengthen softcapacities ie the capacities of acquisition processingorganization analysis and application of big data Thedata center may help business personnel analyze theexisting data discover problems in business operationand develop solutions from big data

        24 Relationship between hadoop and big data

        Presently Hadoop is widely used in big data applications inthe industry eg spam filtering network searching click-stream analysis and social recommendation In additionconsiderable academic research is now based on HadoopSome representative cases are given below As declaredin June 2012 Yahoo runs Hadoop in 42000 servers atfour data centers to support its products and services egsearching and spam filtering etc At present the biggestHadoop cluster has 4000 nodes but the number of nodeswill be increased to 10000 with the release of Hadoop 20In the same month Facebook announced that their Hadoopcluster can process 100 PB data which grew by 05 PB perday as in November 2012 Some well-known agencies thatuse Hadoop to conduct distributed computation are listedin [30] In addition many companies provide Hadoop com-mercial execution andor support including Cloudera IBMMapR EMC and Oracle

        Among modern industrial machinery and systems sen-sors are widely deployed to collect information for environ-ment monitoring and failure forecasting etc Bahga and oth-ers in [31] proposed a framework for data organization andcloud computing infrastructure termed CloudView Cloud-View uses mixed architectures local nodes and remoteclusters based on Hadoop to analyze machine-generateddata Local nodes are used for the forecast of real-time fail-ures clusters based on Hadoop are used for complex offlineanalysis eg case-driven data analysis

        The exponential growth of the genome data and the sharpdrop of sequencing cost transform bio-science and bio-medicine to data-driven science Gunarathne et al in [32]utilized cloud computing infrastructures Amazon AWSMicrosoft Azune and data processing framework basedon MapReduce Hadoop and Microsoft DryadLINQ torun two parallel bio-medicine applications (i) assembly ofgenome segments (ii) dimension reduction in the analy-sis of chemical structure In the subsequent application the166-D datasets used include 26000000 data points Theauthors compared the performance of all the frameworks interms of efficiency cost and availability According to thestudy the authors concluded that the loose coupling will beincreasingly applied to research on electron cloud and theparallel programming technology (MapReduce) frameworkmay provide the user an interface with more convenientservices and reduce unnecessary costs

        3 Big data generation and acquisition

        We have introduced several key technologies related to bigdata ie cloud computing IoT data center and HadoopNext we will focus on the value chain of big data which

        178 Mobile Netw Appl (2014) 19171ndash209

        can be generally divided into four phases data generationdata acquisition data storage and data analysis If we takedata as a raw material data generation and data acquisitionare an exploitation process data storage is a storage processand data analysis is a production process that utilizes theraw material to create new value

        31 Data generation

        Data generation is the first step of big data Given Inter-net data as an example huge amount of data in terms ofsearching entries Internet forum posts chatting records andmicroblog messages are generated Those data are closelyrelated to peoplersquos daily life and have similar features ofhigh value and low density Such Internet data may bevalueless individually but through the exploitation of accu-mulated big data useful information such as habits andhobbies of users can be identified and it is even possible toforecast usersrsquo behaviors and emotional moods

        Moreover generated through longitudinal andor dis-tributed data sources datasets are more large-scale highlydiverse and complex Such data sources include sensorsvideos clickstreams andor all other available data sourcesAt present main sources of big data are the operationand trading information in enterprises logistic and sens-ing information in the IoT human interaction informationand position information in the Internet world and datagenerated in scientific research etc The information far sur-passes the capacities of IT architectures and infrastructuresof existing enterprises while its real time requirement alsogreatly stresses the existing computing capacity

        311 Enterprise data

        In 2013 IBM issued Analysis the Applications of Big Datato the Real World which indicates that the internal data ofenterprises are the main sources of big data The internaldata of enterprises mainly consists of online trading data andonline analysis data most of which are historically staticdata and are managed by RDBMSs in a structured man-ner In addition production data inventory data sales dataand financial data etc also constitute enterprise internaldata which aims to capture informationized and data-drivenactivities in enterprises so as to record all activities ofenterprises in the form of internal data

        Over the past decades IT and digital data have con-tributed a lot to improve the profitability of business depart-ments It is estimated that the business data volume of allcompanies in the world may double every 12 years [10]in which the business turnover through the Internet enter-prises to enterprises and enterprises to consumers per daywill reach USD 450 billion [33] The continuously increas-ing business data volume requires more effective real-time

        analysis so as to fully harvest its potential For exampleAmazon processes millions of terminal operations and morethan 500000 queries from third-party sellers per day [12]Walmart processes one million customer trades per hour andsuch trading data are imported into a database with a capac-ity of over 25PB [3] Akamai analyzes 75 million eventsper day for its target advertisements [13]

        312 IoT data

        As discussed IoT is an important source of big data Amongsmart cities constructed based on IoT big data may comefrom industry agriculture traffic transportation medicalcare public departments and families etc

        According to the processes of data acquisition and trans-mission in IoT its network architecture may be dividedinto three layers the sensing layer the network layer andthe application layer The sensing layer is responsible fordata acquisition and mainly consists of sensor networksThe network layer is responsible for information transmis-sion and processing where close transmission may rely onsensor networks and remote transmission shall depend onthe Internet Finally the application layer support specificapplications of IoT

        According to characteristics of Internet of Things thedata generated from IoT has the following features

        ndash Large-scale data in IoT masses of data acquisi-tion equipments are distributedly deployed which mayacquire simple numeric data eg location or complexmultimedia data eg surveillance video In order tomeet the demands of analysis and processing not onlythe currently acquired data but also the historical datawithin a certain time frame should be stored Thereforedata generated by IoT are characterized by large scales

        ndash Heterogeneity because of the variety data acquisitiondevices the acquired data is also different and such datafeatures heterogeneity

        ndash Strong time and space correlation in IoT every dataacquisition device are placed at a specific geographiclocation and every piece of data has time stamp Thetime and space correlation are an important propertyof data from IoT During data analysis and process-ing time and space are also important dimensions forstatistical analysis

        ndash Effective data accounts for only a small portion of thebig data a great quantity of noises may occur dur-ing the acquisition and transmission of data in IoTAmong datasets acquired by acquisition devices only asmall amount of abnormal data is valuable For exam-ple during the acquisition of traffic video the few videoframes that capture the violation of traffic regulations

        Mobile Netw Appl (2014) 19171ndash209 179

        and traffic accidents are more valuable than those onlycapturing the normal flow of traffic

        313 Bio-medical data

        As a series of high-throughput bio-measurement technolo-gies are innovatively developed in the beginning of the21st century the frontier research in the bio-medicine fieldalso enters the era of big data By constructing smartefficient and accurate analytical models and theoretical sys-tems for bio-medicine applications the essential governingmechanism behind complex biological phenomena may berevealed Not only the future development of bio-medicinecan be determined but also the leading roles can be assumedin the development of a series of important strategic indus-tries related to the national economy peoplersquos livelihoodand national security with important applications such asmedical care new drug R amp D and grain production (egtransgenic crops)

        The completion of HGP (Human Genome Project) andthe continued development of sequencing technology alsolead to widespread applications of big data in the fieldThe masses of data generated by gene sequencing gothrough specialized analysis according to different applica-tion demands to combine it with the clinical gene diag-nosis and provide valuable information for early diagnosisand personalized treatment of disease One sequencing ofhuman gene may generate 100 600GB raw data In theChina National Genebank in Shenzhen there are 13 mil-lion samples including 115 million human samples and150000 animal plant and microorganism samples By theend of 2013 10 million traceable biological samples willbe stored and by the end of 2015 this figure will reach30 million It is predictable that with the development ofbio-medicine technologies gene sequencing will becomefaster and more convenient and thus making big data ofbio-medicine continuously grow beyond all doubt

        In addition data generated from clinical medical care andmedical R amp D also rise quickly For example the Uni-versity of Pittsburgh Medical Center (UPMC) has stored2TB such data Explorys an American company providesplatforms to collocate clinical data operation and mainte-nance data and financial data At present about 13 millionpeoplersquos information have been collocated with 44 arti-cles of data at the scale of about 60TB which will reach70TB in 2013 Practice Fusion another American com-pany manages electronic medical records of about 200000patients

        Apart from such small and medium-sized enterprisesother well-known IT companies such as Google Microsoftand IBM have invested extensively in the research and com-putational analysis of methods related to high-throughputbiological big data for shares in the huge market as known

        as the ldquoNext Internetrdquo IBM forecasts in the 2013 StrategyConference that with the sharp increase of medical imagesand electronic medical records medical professionals mayutilize big data to extract useful clinical information frommasses of data to obtain a medical history and forecast treat-ment effects thus improving patient care and reduce costIt is anticipated that by 2015 the average data volume ofevery hospital will increase from 167TB to 665TB

        314 Data generation from other fields

        As scientific applications are increasing the scale ofdatasets is gradually expanding and the development ofsome disciplines greatly relies on the analysis of masses ofdata Here we examine several such applications Althoughbeing in different scientific fields the applications havesimilar and increasing demand on data analysis The firstexample is related to computational biology GenBank isa nucleotide sequence database maintained by the USNational Bio-Technology Innovation Center Data in thisdatabase may double every 10 months By August 2009Genbank has more than 250 billion bases from 150000 dif-ferent organisms [34] The second example is related toastronomy Sloan Digital Sky Survey (SDSS) the biggestsky survey project in astronomy has recorded 25TB datafrom 1998 to 2008 As the resolution of the telescope isimproved by 2004 the data volume generated per night willsurpass 20TB The last application is related to high-energyphysics In the beginning of 2008 the Atlas experiment ofLarge Hadron Collider (LHC) of European Organization forNuclear Research generates raw data at 2PBs and storesabout 10TB processed data per year

        In addition pervasive sensing and computing amongnature commercial Internet government and social envi-ronments are generating heterogeneous data with unprece-dented complexity These datasets have their unique datacharacteristics in scale time dimension and data categoryFor example mobile data were recorded with respect topositions movement approximation degrees communica-tions multimedia use of applications and audio environ-ment [108] According to the application environment andrequirements such datasets into different categories so asto select the proper and feasible solutions for big data

        32 Big data acquisition

        As the second phase of the big data system big data acqui-sition includes data collection data transmission and datapre-processing During big data acquisition once we col-lect the raw data we shall utilize an efficient transmissionmechanism to send it to a proper storage managementsystem to support different analytical applications The col-lected datasets may sometimes include much redundant or

        180 Mobile Netw Appl (2014) 19171ndash209

        useless data which unnecessarily increases storage spaceand affects the subsequent data analysis For examplehigh redundancy is very common among datasets collectedby sensors for environment monitoring Data compressiontechnology can be applied to reduce the redundancy There-fore data pre-processing operations are indispensable toensure efficient data storage and exploitation

        321 Data collection

        Data collection is to utilize special data collection tech-niques to acquire raw data from a specific data generationenvironment Four common data collection methods areshown as follows

        ndash Log files As one widely used data collection methodlog files are record files automatically generated by thedata source system so as to record activities in desig-nated file formats for subsequent analysis Log files aretypically used in nearly all digital devices For exam-ple web servers record in log files number of clicksclick rates visits and other property records of webusers [35] To capture activities of users at the web sitesweb servers mainly include the following three log fileformats public log file format (NCSA) expanded logformat (W3C) and IIS log format (Microsoft) All thethree types of log files are in the ASCII text formatDatabases other than text files may sometimes be usedto store log information to improve the query efficiencyof the massive log store [36 37] There are also someother log files based on data collection including stockindicators in financial applications and determinationof operating states in network monitoring and trafficmanagement

        ndash Sensing Sensors are common in daily life to measurephysical quantities and transform physical quantitiesinto readable digital signals for subsequent process-ing (and storage) Sensory data may be classified assound wave voice vibration automobile chemicalcurrent weather pressure temperature etc Sensedinformation is transferred to a data collection pointthrough wired or wireless networks For applicationsthat may be easily deployed and managed eg videosurveillance system [38] the wired sensor network isa convenient solution to acquire related informationSometimes the accurate position of a specific phe-nomenon is unknown and sometimes the monitoredenvironment does not have the energy or communica-tion infrastructures Then wireless communication mustbe used to enable data transmission among sensor nodesunder limited energy and communication capability Inrecent years WSNs have received considerable inter-est and have been applied to many applications such

        as environmental research [39 40] water quality mon-itoring [41] civil engineering [42 43] and wildlifehabit monitoring [44] A WSN generally consists ofa large number of geographically distributed sensornodes each being a micro device powered by batterySuch sensors are deployed at designated positions asrequired by the application to collect remote sensingdata Once the sensors are deployed the base stationwill send control information for network configura-tionmanagement or data collection to sensor nodesBased on such control information the sensory data isassembled in different sensor nodes and sent back to thebase station for further processing Interested readersare referred to [45] for more detailed discussions

        ndash Methods for acquiring network data At present net-work data acquisition is accomplished using a com-bination of web crawler word segmentation systemtask system and index system etc Web crawler isa program used by search engines for downloadingand storing web pages [46] Generally speaking webcrawler starts from the uniform resource locator (URL)of an initial web page to access other linked web pagesduring which it stores and sequences all the retrievedURLs Web crawler acquires a URL in the order ofprecedence through a URL queue and then downloadsweb pages and identifies all URLs in the downloadedweb pages and extracts new URLs to be put in thequeue This process is repeated until the web crawleris stopped Data acquisition through a web crawler iswidely applied in applications based on web pagessuch as search engines or web caching Traditional webpage extraction technologies feature multiple efficientsolutions and considerable research has been done inthis field As more advanced web page applicationsare emerging some extraction strategies are proposedin [47] to cope with rich Internet applications

        The current network data acquisition technologiesmainly include traditional Libpcap-based packet capturetechnology zero-copy packet capture technology as wellas some specialized network monitoring software such asWireshark SmartSniff and WinNetCap

        ndash Libpcap-based packet capture technology Libpcap(packet capture library) is a widely used network datapacket capture function library It is a general tool thatdoes not depend on any specific system and is mainlyused to capture data in the data link layer It featuressimplicity easy-to-use and portability but has a rel-atively low efficiency Therefore under a high-speednetwork environment considerable packet losses mayoccur when Libpcap is used

        Mobile Netw Appl (2014) 19171ndash209 181

        ndash Zero-copy packet capture technology The so-calledzero-copy (ZC) means that no copies between any inter-nal memories occur during packet receiving and send-ing at a node In sending the data packets directly startfrom the user buffer of applications pass through thenetwork interfaces and arrive at an external networkIn receiving the network interfaces directly send datapackets to the user buffer The basic idea of zero-copyis to reduce data copy times reduce system calls andreduce CPU load while ddatagrams are passed from net-work equipments to user program space The zero-copytechnology first utilizes direct memory access (DMA)technology to directly transmit network datagrams to anaddress space pre-allocated by the system kernel so asto avoid the participation of CPU In the meanwhile itmaps the internal memory of the datagrams in the sys-tem kernel to the that of the detection program or buildsa cache region in the user space and maps it to the ker-nel space Then the detection program directly accessesthe internal memory so as to reduce internal memorycopy from system kernel to user space and reduce theamount of system calls

        ndash Mobile equipments At present mobile devices aremore widely used As mobile device functions becomeincreasingly stronger they feature more complex andmultiple means of data acquisition as well as morevariety of data Mobile devices may acquire geo-graphical location information through positioning sys-tems acquire audio information through microphonesacquire pictures videos streetscapes two-dimensionalbarcodes and other multimedia information throughcameras acquire user gestures and other body languageinformation through touch screens and gravity sensorsOver the years wireless operators have improved theservice level of the mobile Internet by acquiring andanalyzing such information For example iPhone itselfis a ldquomobile spyrdquo It may collect wireless data andgeographical location information and then send suchinformation back to Apple Inc for processing of whichthe user is not aware Apart from Apple smart phoneoperating systems such as Android of Google and Win-dows Phone of Microsoft can also collect informationin the similar manner

        In addition to the aforementioned three data acquisitionmethods of main data sources there are many other datacollect methods or systems For example in scientific exper-iments many special tools can be used to collect exper-imental data such as magnetic spectrometers and radiotelescopes We may classify data collection methods fromdifferent perspectives From the perspective of data sourcesdata collection methods can be classified into two cate-gories collection methods recording through data sources

        and collection methods recording through other auxiliarytools

        322 Data transportation

        Upon the completion of raw data collection data will betransferred to a data storage infrastructure for processingand analysis As discussed in Section 23 big data is mainlystored in a data center The data layout should be adjusted toimprove computing efficiency or facilitate hardware mainte-nance In other words internal data transmission may occurin the data center Therefore data transmission consistsof two phases Inter-DCN transmissions and Intra-DCNtransmissions

        ndash Inter-DCN transmissions Inter-DCN transmissions arefrom data source to data center which is generallyachieved with the existing physical network infrastruc-ture Because of the rapid growth of traffic demandsthe physical network infrastructure in most regionsaround the world are constituted by high-volumn high-rate and cost-effective optic fiber transmission systemsOver the past 20 years advanced management equip-ment and technologies have been developed such asIP-based wavelength division multiplexing (WDM) net-work architecture to conduct smart control and man-agement of optical fiber networks [48 49] WDM isa technology that multiplexes multiple optical carriersignals with different wave lengths and couples themto the same optical fiber of the optical link In suchtechnology lasers with different wave lengths carry dif-ferent signals By far the backbone network have beendeployed with WDM optical transmission systems withsingle channel rate of 40Gbs At present 100Gbs com-mercial interface are available and 100Gbs systems (orTBs systems) will be available in the near future [50]However traditional optical transmission technologiesare limited by the bandwidth of the electronic bot-tleneck [51] Recently orthogonal frequency-divisionmultiplexing (OFDM) initially designed for wirelesssystems is regarded as one of the main candidatetechnologies for future high-speed optical transmis-sion OFDM is a multi-carrier parallel transmissiontechnology It segments a high-speed data flow to trans-form it into low-speed sub-data-flows to be transmittedover multiple orthogonal sub-carriers [52] Comparedwith fixed channel spacing of WDM OFDM allowssub-channel frequency spectrums to overlap with eachother [53] Therefore it is a flexible and efficient opticalnetworking technology

        ndash Intra-DCN Transmissions Intra-DCN transmissionsare the data communication flows within data centersIntra-DCN transmissions depend on the communication

        182 Mobile Netw Appl (2014) 19171ndash209

        mechanism within the data center (ie on physical con-nection plates chips internal memories of data serversnetwork architectures of data centers and communica-tion protocols) A data center consists of multiple inte-grated server racks interconnected with its internal con-nection networks Nowadays the internal connectionnetworks of most data centers are fat-tree two-layeror three-layer structures based on multi-commoditynetwork flows [51 54] In the two-layer topologicalstructure the racks are connected by 1Gbps top rackswitches (TOR) and then such top rack switches areconnected with 10Gbps aggregation switches in thetopological structure The three-layer topological struc-ture is a structure augmented with one layer on the topof the two-layer topological structure and such layeris constituted by 10Gbps or 100Gbps core switchesto connect aggregation switches in the topologicalstructure There are also other topological structureswhich aim to improve the data center networks [55ndash58] Because of the inadequacy of electronic packetswitches it is difficult to increase communication band-widths while keeps energy consumption is low Overthe years due to the huge success achieved by opti-cal technologies the optical interconnection among thenetworks in data centers has drawn great interest Opti-cal interconnection is a high-throughput low-delayand low-energy-consumption solution At present opti-cal technologies are only used for point-to-point linksin data centers Such optical links provide connectionfor the switches using the low-cost multi-mode fiber(MMF) with 10Gbps data rate Optical interconnec-tion (switching in the optical domain) of networks indata centers is a feasible solution which can provideTbps-level transmission bandwidth with low energyconsumption Recently many optical interconnectionplans are proposed for data center networks [59] Someplans add optical paths to upgrade the existing net-works and other plans completely replace the currentswitches [59ndash64] As a strengthening technology Zhouet al in [65] adopt wireless links in the 60GHz fre-quency band to strengthen wired links Network vir-tualization should also be considered to improve theefficiency and utilization of data center networks

        323 Data pre-processing

        Because of the wide variety of data sources the collecteddatasets vary with respect to noise redundancy and con-sistency etc and it is undoubtedly a waste to store mean-ingless data In addition some analytical methods haveserious requirements on data quality Therefore in orderto enable effective data analysis we shall pre-process data

        under many circumstances to integrate the data from differ-ent sources which can not only reduces storage expensebut also improves analysis accuracy Some relational datapre-processing techniques are discussed as follows

        ndash Integration data integration is the cornerstone of mod-ern commercial informatics which involves the com-bination of data from different sources and providesusers with a uniform view of data [66] This is a matureresearch field for traditional database Historically twomethods have been widely recognized data ware-house and data federation Data warehousing includesa process named ETL (Extract Transform and Load)Extraction involves connecting source systems select-ing collecting analyzing and processing necessarydata Transformation is the execution of a series of rulesto transform the extracted data into standard formatsLoading means importing extracted and transformeddata into the target storage infrastructure Loading isthe most complex procedure among the three whichincludes operations such as transformation copy clear-ing standardization screening and data organizationA virtual database can be built to query and aggregatedata from different data sources but such database doesnot contain data On the contrary it includes informa-tion or metadata related to actual data and its positionsSuch two ldquostorage-readingrdquo approaches do not sat-isfy the high performance requirements of data flowsor search programs and applications Compared withqueries data in such two approaches is more dynamicand must be processed during data transmission Gen-erally data integration methods are accompanied withflow processing engines and search engines [30 67]

        ndash Cleaning data cleaning is a process to identify inac-curate incomplete or unreasonable data and thenmodify or delete such data to improve data qualityGenerally data cleaning includes five complementaryprocedures [68] defining and determining error typessearching and identifying errors correcting errors doc-umenting error examples and error types and mod-ifying data entry procedures to reduce future errorsDuring cleaning data formats completeness rational-ity and restriction shall be inspected Data cleaning isof vital importance to keep the data consistency whichis widely applied in many fields such as banking insur-ance retail industry telecommunications and trafficcontrol

        In e-commerce most data is electronically col-lected which may have serious data quality prob-lems Classic data quality problems mainly come fromsoftware defects customized errors or system mis-configuration Authors in [69] discussed data cleaning

        Mobile Netw Appl (2014) 19171ndash209 183

        in e-commerce by crawlers and regularly re-copyingcustomer and account information

        In [70] the problem of cleaning RFID data wasexamined RFID is widely used in many applica-tions eg inventory management and target track-ing However the original RFID features low qualitywhich includes a lot of abnormal data limited by thephysical design and affected by environmental noisesIn [71] a probability model was developed to copewith data loss in mobile environments Khoussainovaet al in [72] proposed a system to automatically cor-rect errors of input data by defining global integrityconstraints

        Herbert et al [73] proposed a framework called BIO-AJAX to standardize biological data so as to conductfurther computation and improve search quality WithBIO-AJAX some errors and repetitions may be elim-inated and common data mining technologies can beexecuted more effectively

        ndash Redundancy elimination data redundancy refers to datarepetitions or surplus which usually occurs in manydatasets Data redundancy can increase the unneces-sary data transmission expense and cause defects onstorage systems eg waste of storage space lead-ing to data inconsistency reduction of data reliabil-ity and data damage Therefore various redundancyreduction methods have been proposed such as redun-dancy detection data filtering and data compressionSuch methods may apply to different datasets or appli-cation environments However redundancy reductionmay also bring about certain negative effects Forexample data compression and decompression causeadditional computational burden Therefore the ben-efits of redundancy reduction and the cost should becarefully balanced Data collected from different fieldswill increasingly appear in image or video formatsIt is well-known that images and videos contain con-siderable redundancy including temporal redundancyspacial redundancy statistical redundancy and sens-ing redundancy Video compression is widely usedto reduce redundancy in video data as specified inthe many video coding standards (MPEG-2 MPEG-4H263 and H264AVC) In [74] the authors inves-tigated the problem of video compression in a videosurveillance system with a video sensor network Theauthors propose a new MPEG-4 based method byinvestigating the contextual redundancy related to back-ground and foreground in a scene The low com-plexity and the low compression ratio of the pro-posed approach were demonstrated by the evaluationresults

        On generalized data transmission or storage re-peated data deletion is a special data compression

        technology which aims to eliminate repeated datacopies [75] With repeated data deletion individual datablocks or data segments will be assigned with identi-fiers (eg using a hash algorithm) and stored with theidentifiers added to the identification list As the anal-ysis of repeated data deletion continues if a new datablock has an identifier that is identical to that listedin the identification list the new data block will bedeemed as redundant and will be replaced by the cor-responding stored data block Repeated data deletioncan greatly reduce storage requirement which is par-ticularly important to a big data storage system Apartfrom the aforementioned data pre-processing methodsspecific data objects shall go through some other oper-ations such as feature extraction Such operation playsan important role in multimedia search and DNA anal-ysis [76ndash78] Usually high-dimensional feature vec-tors (or high-dimensional feature points) are used todescribe such data objects and the system stores thedimensional feature vectors for future retrieval Datatransfer is usually used to process distributed hetero-geneous data sources especially business datasets [79]As a matter of fact in consideration of various datasetsit is non-trivial or impossible to build a uniform datapre-processing procedure and technology that is appli-cable to all types of datasets on the specific featureproblem performance requirements and other factorsof the datasets should be considered so as to select aproper data pre-processing strategy

        4 Big data storage

        The explosive growth of data has more strict requirementson storage and management In this section we focus onthe storage of big data Big data storage refers to the stor-age and management of large-scale datasets while achiev-ing reliability and availability of data accessing We willreview important issues including massive storage systemsdistributed storage systems and big data storage mecha-nisms On one hand the storage infrastructure needs toprovide information storage service with reliable storagespace on the other hand it must provide a powerful accessinterface for query and analysis of a large amount ofdata

        Traditionally as auxiliary equipment of server data stor-age device is used to store manage look up and analyzedata with structured RDBMSs With the sharp growth ofdata data storage device is becoming increasingly moreimportant and many Internet companies pursue big capac-ity of storage to be competitive Therefore there is acompelling need for research on data storage

        184 Mobile Netw Appl (2014) 19171ndash209

        41 Storage system for massive data

        Various storage systems emerge to meet the demands ofmassive data Existing massive storage technologies can beclassified as Direct Attached Storage (DAS) and networkstorage while network storage can be further classifiedinto Network Attached Storage (NAS) and Storage AreaNetwork (SAN)

        In DAS various harddisks are directly connected withservers and data management is server-centric such thatstorage devices are peripheral equipments each of whichtakes a certain amount of IO resource and is managed by anindividual application software For this reason DAS is onlysuitable to interconnect servers with a small scale How-ever due to its low scalability DAS will exhibit undesirableefficiency when the storage capacity is increased ie theupgradeability and expandability are greatly limited ThusDAS is mainly used in personal computers and small-sizedservers

        Network storage is to utilize network to provide userswith a union interface for data access and sharing Networkstorage equipment includes special data exchange equip-ments disk array tap library and other storage media aswell as special storage software It is characterized withstrong expandability

        NAS is actually an auxillary storage equipment of a net-work It is directly connected to a network through a hub orswitch through TCPIP protocols In NAS data is transmit-ted in the form of files Compared to DAS the IO burdenat a NAS server is reduced extensively since the serveraccesses a storage device indirectly through a network

        While NAS is network-oriented SAN is especiallydesigned for data storage with a scalable and bandwidthintensive network eg a high-speed network with opticalfiber connections In SAN data storage management is rel-atively independent within a storage local area networkwhere multipath based data switching among any internalnodes is utilized to achieve a maximum degree of datasharing and data management

        From the organization of a data storage system DASNAS and SAN can all be divided into three parts (i) discarray it is the foundation of a storage system and the fun-damental guarantee for data storage (ii) connection andnetwork sub-systems which provide connection among oneor more disc arrays and servers (iii) storage managementsoftware which handles data sharing disaster recovery andother storage management tasks of multiple servers

        42 Distributed storage system

        The first challenge brought about by big data is how todevelop a large scale distributed storage system for effi-ciently data processing and analysis To use a distributed

        system to store massive data the following factors shouldbe taken into consideration

        ndash Consistency a distributed storage system requires mul-tiple servers to cooperatively store data As there aremore servers the probability of server failures will belarger Usually data is divided into multiple pieces tobe stored at different servers to ensure availability incase of server failure However server failures and par-allel storage may cause inconsistency among differentcopies of the same data Consistency refers to assuringthat multiple copies of the same data are identical

        ndash Availability a distributed storage system operates inmultiple sets of servers As more servers are usedserver failures are inevitable It would be desirable ifthe entire system is not seriously affected to satisfy cus-tomerrsquos requests in terms of reading and writing Thisproperty is called availability

        ndash Partition Tolerance multiple servers in a distributedstorage system are connected by a network The net-work could have linknode failures or temporary con-gestion The distributed system should have a certainlevel of tolerance to problems caused by network fail-ures It would be desirable that the distributed storagestill works well when the network is partitioned

        Eric Brewer proposed a CAP [80 81] theory in 2000which indicated that a distributed system could not simulta-neously meet the requirements on consistency availabilityand partition tolerance at most two of the three require-ments can be satisfied simultaneously Seth Gilbert andNancy Lynch from MIT proved the correctness of CAP the-ory in 2002 Since consistency availability and partitiontolerance could not be achieved simultaneously we can havea CA system by ignoring partition tolerance a CP system byignoring availability and an AP system that ignores consis-tency according to different design goals The three systemsare discussed in the following

        CA systems do not have partition tolerance ie theycould not handle network failures Therefore CA sys-tems are generally deemed as storage systems with a sin-gle server such as the traditional small-scale relationaldatabases Such systems feature single copy of data suchthat consistency is easily ensured Availability is guaranteedby the excellent design of relational databases Howeversince CA systems could not handle network failures theycould not be expanded to use many servers Therefore mostlarge-scale storage systems are CP systems and AP systems

        Compared with CA systems CP systems ensure parti-tion tolerance Therefore CP systems can be expanded tobecome distributed systems CP systems generally main-tain several copies of the same data in order to ensure a

        Mobile Netw Appl (2014) 19171ndash209 185

        level of fault tolerance CP systems also ensure data consis-tency ie multiple copies of the same data are guaranteedto be completely identical However CP could not ensuresound availability because of the high cost for consistencyassurance Therefore CP systems are useful for the scenar-ios with moderate load but stringent requirements on dataaccuracy (eg trading data) BigTable and Hbase are twopopular CP systems

        AP systems also ensure partition tolerance However APsystems are different from CP systems in that AP systemsalso ensure availability However AP systems only ensureeventual consistency rather than strong consistency in theprevious two systems Therefore AP systems only applyto the scenarios with frequent requests but not very highrequirements on accuracy For example in online SocialNetworking Services (SNS) systems there are many con-current visits to the data but a certain amount of data errorsare tolerable Furthermore because AP systems ensureeventual consistency accurate data can still be obtained aftera certain amount of delay Therefore AP systems may alsobe used under the circumstances with no stringent realtimerequirements Dynamo and Cassandra are two popular APsystems

        43 Storage mechanism for big data

        Considerable research on big data promotes the develop-ment of storage mechanisms for big data Existing stor-age mechanisms of big data may be classified into threebottom-up levels (i) file systems (ii) databases and (iii)programming models

        File systems are the foundation of the applications atupper levels Googlersquos GFS is an expandable distributedfile system to support large-scale distributed data-intensiveapplications [25] GFS uses cheap commodity servers toachieve fault-tolerance and provides customers with high-performance services GFS supports large-scale file appli-cations with more frequent reading than writing HoweverGFS also has some limitations such as a single point offailure and poor performances for small files Such limita-tions have been overcome by Colossus [82] the successorof GFS

        In addition other companies and researchers also havetheir solutions to meet the different demands for storageof big data For example HDFS and Kosmosfs are deriva-tives of open source codes of GFS Microsoft developedCosmos [83] to support its search and advertisement busi-ness Facebook utilizes Haystack [84] to store the largeamount of small-sized photos Taobao also developed TFSand FastDFS In conclusion distributed file systems havebeen relatively mature after years of development and busi-ness operation Therefore we will focus on the other twolevels in the rest of this section

        431 Database technology

        The database technology has been evolving for more than30 years Various database systems are developed to handledatasets at different scales and support various applica-tions Traditional relational databases cannot meet the chal-lenges on categories and scales brought about by big dataNoSQL databases (ie non traditional relational databases)are becoming more popular for big data storage NoSQLdatabases feature flexible modes support for simple andeasy copy simple API eventual consistency and supportof large volume data NoSQL databases are becomingthe core technology for of big data We will examinethe following three main NoSQL databases in this sec-tion Key-value databases column-oriented databases anddocument-oriented databases each based on certain datamodels

        ndash Key-value Databases Key-value Databases are con-stituted by a simple data model and data is storedcorresponding to key-values Every key is unique andcustomers may input queried values according to thekeys Such databases feature a simple structure andthe modern key-value databases are characterized withhigh expandability and shorter query response time thanthose of relational databases Over the past few yearsmany key-value databases have appeared as motivatedby Amazonrsquos Dynamo system [85] We will introduceDynamo and several other representative key-valuedatabases

        ndash Dynamo Dynamo is a highly available andexpandable distributed key-value data stor-age system It is used to store and managethe status of some core services which canbe realized with key access in the Amazone-Commerce Platform The public mode ofrelational databases may generate invalid dataand limit data scale and availability whileDynamo can resolve these problems with asimple key-object interface which is consti-tuted by simple reading and writing opera-tion Dynamo achieves elasticity and avail-ability through the data partition data copyand object edition mechanisms Dynamo par-tition plan relies on Consistent Hashing [86]which has a main advantage that node pass-ing only affects directly adjacent nodes anddo not affect other nodes to divide the loadfor multiple main storage machines Dynamocopies data to N sets of servers in which Nis a configurable parameter in order to achieve

        186 Mobile Netw Appl (2014) 19171ndash209

        high availability and durability Dynamo sys-tem also provides eventual consistency so asto conduct asynchronous update on all copies

        ndash Voldemort Voldemort is also a key-value stor-age system which was initially developed forand is still used by LinkedIn Key words andvalues in Voldemort are composite objectsconstituted by tables and images Volde-mort interface includes three simple opera-tions reading writing and deletion all ofwhich are confirmed by key words Volde-mort provides asynchronous updating con-current control of multiple editions but doesnot ensure data consistency However Volde-mort supports optimistic locking for consistentmulti-record updating When conflict happensbetween the updating and any other opera-tions the updating operation will quit Thedata copy mechanism of Voldmort is the sameas that of Dynamo Voldemort not only storesdata in RAM but allows data be inserted intoa storage engine Especially Voldemort sup-ports two storage engines including BerkeleyDB and Random Access Files

        The key-value database emerged a few years agoDeeply influenced by Amazon Dynamo DB other key-value storage systems include Redis Tokyo Canbinetand Tokyo Tyrant Memcached and Memcache DBRiak and Scalaris all of which provide expandabilityby distributing key words into nodes Voldemort RiakTokyo Cabinet and Memecached can utilize attachedstorage devices to store data in RAM or disks Otherstorage systems store data at RAM and provide diskbackup or rely on copy and recovery to avoid backup

        ndash Column-oriented Database The column-orienteddatabases store and process data according to columnsother than rows Both columns and rows are segmentedin multiple nodes to realize expandability The column-oriented databases are mainly inspired by GooglersquosBigTable In this Section we first discuss BigTable andthen introduce several derivative tools

        ndash BigTable BigTable is a distributed structureddata storage system which is designed to pro-cess the large-scale (PB class) data amongthousands commercial servers [87] The basicdata structure of Bigtable is a multi-dimensionsequenced mapping with sparse distributedand persistent storage Indexes of mappingare row key column key and timestampsand every value in mapping is an unana-lyzed byte array Each row key in BigTable

        is a 64KB character string By lexicograph-ical order rows are stored and continuallysegmented into Tablets (ie units of distribu-tion) for load balance Thus reading a shortrow of data can be highly effective since itonly involves communication with a small por-tion of machines The columns are groupedaccording to the prefixes of keys and thusforming column families These column fami-lies are the basic units for access control Thetimestamps are 64-bit integers to distinguishdifferent editions of cell values Clients mayflexibly determine the number of cell editionsstored These editions are sequenced in thedescending order of timestamps so the latestedition will always be read

        The BigTable API features the creation anddeletion of Tablets and column families as wellas modification of metadata of clusters tablesand column families Client applications mayinsert or delete values of BigTable query val-ues from columns or browse sub-datasets in atable Bigtable also supports some other char-acteristics such as transaction processing in asingle row Users may utilize such features toconduct more complex data processing

        Every procedure executed by BigTableincludes three main components Masterserver Tablet server and client libraryBigtable only allows one set of Master serverbe distributed to be responsible for distribut-ing tablets for Tablet server detecting addedor removed Tablet servers and conductingload balance In addition it can also mod-ify BigTable schema eg creating tables andcolumn families and collecting garbage savedin GFS as well as deleted or disabled filesand using them in specific BigTable instancesEvery tablet server manages a Tablet set andis responsible for the reading and writing of aloaded Tablet When Tablets are too big theywill be segmented by the server The applica-tion client library is used to communicate withBigTable instances

        BigTable is based on many fundamentalcomponents of Google including GFS [25]cluster management system SSTable file for-mat and Chubby [88] GFS is use to store dataand log files The cluster management systemis responsible for task scheduling resourcessharing processing of machine failures andmonitoring of machine statuses SSTable fileformat is used to store BigTable data internally

        Mobile Netw Appl (2014) 19171ndash209 187

        and it provides mapping between persistentsequenced and unchangeable keys and valuesas any byte strings BigTable utilizes Chubbyfor the following tasks in server 1) ensurethere is at most one active Master copy atany time 2) store the bootstrap location ofBigTable data 3) look up Tablet server 4) con-duct error recovery in case of Table server fail-ures 5) store BigTable schema information 6)store the access control table

        ndash Cassandra Cassandra is a distributed storagesystem to manage the huge amount of struc-tured data distributed among multiple commer-cial servers [89] The system was developedby Facebook and became an open source toolin 2008 It adopts the ideas and concepts ofboth Amazon Dynamo and Google BigTableespecially integrating the distributed systemtechnology of Dynamo with the BigTable datamodel Tables in Cassandra are in the form ofdistributed four-dimensional structured map-ping where the four dimensions including rowcolumn column family and super column Arow is distinguished by a string-key with arbi-trary length No matter the amount of columnsto be read or written the operation on rowsis an auto Columns may constitute clusterswhich is called column families and are sim-ilar to the data model of Bigtable Cassandraprovides two kinds of column families columnfamilies and super columns The super columnincludes arbitrary number of columns relatedto same names A column family includescolumns and super columns which may becontinuously inserted to the column familyduring runtime The partition and copy mecha-nisms of Cassandra are very similar to those ofDynamo so as to achieve consistency

        ndash Derivative tools of BigTable since theBigTable code cannot be obtained throughthe open source license some open sourceprojects compete to implement the BigTableconcept to develop similar systems such asHBase and Hypertable

        HBase is a BigTable cloned version pro-grammed with Java and is a part of Hadoop ofApachersquos MapReduce framework [90] HBasereplaces GFS with HDFS It writes updatedcontents into RAM and regularly writes theminto files on disks The row operations areatomic operations equipped with row-levellocking and transaction processing which is

        optional for large scale Partition and distribu-tion are transparently operated and have spacefor client hash or fixed key

        HyperTable was developed similar toBigTable to obtain a set of high-performanceexpandable distributed storage and process-ing systems for structured and unstructureddata [91] HyperTable relies on distributedfile systems eg HDFS and distributed lockmanager Data representation processing andpartition mechanism are similar to that inBigTable HyperTable has its own query lan-guage called HyperTable query language(HQL) and allows users to create modify andquery underlying tables

        Since the column-oriented storage databases mainlyemulate BigTable their designs are all similar exceptfor the concurrency mechanism and several other fea-tures For example Cassandra emphasizes weak consis-tency of concurrent control of multiple editions whileHBase and HyperTable focus on strong consistencythrough locks or log records

        ndash Document Database Compared with key-value stor-age document storage can support more complex dataforms Since documents do not follow strict modesthere is no need to conduct mode migration In additionkey-value pairs can still be saved We will examine threeimportant representatives of document storage systemsie MongoDB SimpleDB and CouchDB

        ndash MongoDB MongoDB is open-source anddocument-oriented database [92] MongoDBstores documents as Binary JSON (BSON)objects [93] which is similar to object Everydocument has an ID field as the primary keyQuery in MongoDB is expressed with syn-tax similar to JSON A database driver sendsthe query as a BSON object to MongoDBThe system allows query on all documentsincluding embedded objects and arrays Toenable rapid query indexes can be createdin the queryable fields of documents Thecopy operation in MongoDB can be executedwith log files in the main nodes that supportall the high-level operations conducted in thedatabase During copying the slavers queryall the writing operations since the last syn-chronization to the master and execute opera-tions in log files in local databases MongoDBsupports horizontal expansion with automaticsharing to distribute data among thousandsof nodes by automatically balancing load andfailover

        188 Mobile Netw Appl (2014) 19171ndash209

        ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

        ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

        Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

        ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

        functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

        Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

        ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

        The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

        In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

        Mobile Netw Appl (2014) 19171ndash209 189

        DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

        ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

        All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

        ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

        The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

        Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

        The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

        Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

        ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

        ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

        51 Traditional data analysis

        5 Big data analysis

        190 Mobile Netw Appl (2014) 19171ndash209

        ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

        ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

        ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

        ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

        ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

        52 Big data analytic methods

        In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

        ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

        ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

        ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

        ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

        ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

        Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

        53 Architecture for big data analysis

        Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

        Mobile Netw Appl (2014) 19171ndash209 191

        Table 1 Comparison of MPI MapReduce and Dryad

        MPI MapReduce Dryad

        Deployment Computing node and data Computing and data storage Computing and data storage

        storage arranged separately arranged at the same node arranged at the same node

        (Data should be moved (Computing should (Computing should

        computing node) be close to data) be close to data)

        Resource management ndash Workqueue(google) Not clear

        scheduling HOD(Yahoo)

        Low level programming MPI API MapReduce API Dryad API

        High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

        Data storage The local file system GFS(google) NTFS

        NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

        Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

        the tasks

        Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

        memory access Shared-memory FIFOs

        Fault-tolerant Checkpoint Task re-execute Task re-execute

        531 Real-time vs offline analysis

        According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

        ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

        ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

        532 Analysis at different levels

        Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

        ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

        ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

        ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

        192 Mobile Netw Appl (2014) 19171ndash209

        533 Analysis with different complexity

        The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

        54 Tools for big data mining and analysis

        Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

        ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

        ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

        ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

        The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

        ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

        ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

        6 Big data applications

        In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

        Mobile Netw Appl (2014) 19171ndash209 193

        However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

        61 Application evolutions

        Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

        ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

        ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

        most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

        ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

        As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

        62 Big data analysis fields

        webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

        194 Mobile Netw Appl (2014) 19171ndash209

        621 Structured data analysis

        Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

        However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

        622 Text data analysis

        The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

        623 Web data analysis

        Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

        mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

        Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

        Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

        Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

        Mobile Netw Appl (2014) 19171ndash209 195

        624 Multimedia data analysis

        Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

        Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

        Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

        Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

        segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

        Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

        The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

        625 Network data analysis

        Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

        196 Mobile Netw Appl (2014) 19171ndash209

        and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

        The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

        Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

        Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

        Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

        is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

        626 Mobile data analysis

        By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

        With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

        Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

        Mobile Netw Appl (2014) 19171ndash209 197

        In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

        Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

        63 Key applications of big data

        631 Application of big data in enterprises

        At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

        In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

        Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

        632 Application of IoT based big data

        IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

        Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

        Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

        633 Application of online social network-oriented big data

        Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

        198 Mobile Netw Appl (2014) 19171ndash209

        information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

        ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

        ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

        is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

        The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

        In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

        Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

        Mobile Netw Appl (2014) 19171ndash209 199

        or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

        Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

        ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

        ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

        ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

        634 Applications of healthcare and medical big data

        Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

        effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

        For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

        The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

        HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

        Fig 6 The correlation between Tweets about rice price and food price inflation

        200 Mobile Netw Appl (2014) 19171ndash209

        imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

        635 Collective intelligence

        With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

        Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

        In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

        636 Smart grid

        Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

        supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

        ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

        ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

        Mobile Netw Appl (2014) 19171ndash209 201

        according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

        ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

        7 Conclusion open issues and outlook

        In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

        In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

        71 Open issues

        The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

        711 Theoretical research

        Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

        ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

        ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

        ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

        712 Technology development

        The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

        ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

        202 Mobile Netw Appl (2014) 19171ndash209

        ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

        ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

        ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

        713 Practical implications

        Although there are already many successful big data appli-cations many practical problems should be solved

        ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

        ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

        ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

        individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

        ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

        714 Data security

        In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

        ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

        ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

        Mobile Netw Appl (2014) 19171ndash209 203

        quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

        ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

        ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

        The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

        72 Outlook

        The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

        not predict the future but may take precautions for possibleevents to occur in the future

        ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

        ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

        ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

        ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

        204 Mobile Netw Appl (2014) 19171ndash209

        utilizes relational diagrams to express interpersonalrelationship

        ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

        ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

        ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

        ndash Compared with accurate data we would like toaccept numerous and complicated data

        ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

        ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

        ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

        Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

        increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

        Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

        References

        1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

        2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

        3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

        4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

        5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

        httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

        7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

        8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

        9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

        10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

        11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

        12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

        13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

        14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

        15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

        16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

        17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

        18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

        19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

        Mobile Netw Appl (2014) 19171ndash209 205

        20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

        21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

        22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

        23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

        24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

        25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

        26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

        27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

        28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

        29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

        30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

        31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

        32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

        33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

        34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

        35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

        36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

        37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

        38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

        39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

        40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

        41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

        42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

        43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

        44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

        45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

        46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

        47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

        48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

        49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

        50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

        51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

        52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

        53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

        54 Cisco data center interconnect design and deployment guide(2010)

        55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

        56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

        57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

        58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

        59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

        206 Mobile Netw Appl (2014) 19171ndash209

        60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

        61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

        62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

        63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

        64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

        65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

        66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

        67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

        68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

        69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

        70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

        71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

        72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

        73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

        74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

        75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

        76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

        77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

        78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

        79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

        80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

        81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

        82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

        83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

        84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

        85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

        86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

        87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

        88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

        89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

        90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

        Media Inc93 Crockford D (2006) The applicationjson media type for

        javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

        SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

        tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

        (2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

        97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

        98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

        99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

        Mobile Netw Appl (2014) 19171ndash209 207

        100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

        101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

        102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

        103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

        104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

        105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

        106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

        107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

        108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

        109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

        110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

        111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

        112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

        113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

        114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

        115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

        D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

        117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

        118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

        the 7th ACM international conference on computing frontiersACM pp 277ndash286

        119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

        120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

        121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

        122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

        123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

        124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

        125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

        126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

        127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

        128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

        129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

        130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

        131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

        132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

        133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

        134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

        135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

        136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

        137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

        138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

        139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

        140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

        208 Mobile Netw Appl (2014) 19171ndash209

        141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

        142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

        143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

        144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

        145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

        146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

        147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

        148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

        149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

        150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

        151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

        152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

        153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

        154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

        155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

        156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

        Mobile Netw Appl (2014) 19171ndash209 209

        • Big Data A Survey
          • Abstract
          • Background
            • Dawn of big data era
            • Definition and features of big data
            • Big data value
            • The development of big data
            • Challenges of big data
              • Related technologies
                • Relationship between cloud computing and big data
                • Relationship between IoT and big data
                • Data center
                • Relationship between hadoop and big data
                  • Big data generation and acquisition
                    • Data generation
                      • Enterprise data
                      • IoT data
                      • Bio-medical data
                      • Data generation from other fields
                        • Big data acquisition
                          • Data collection
                          • Data transportation
                          • Data pre-processing
                              • Big data storage
                                • Storage system for massive data
                                • Distributed storage system
                                • Storage mechanism for big data
                                  • Database technology
                                    • Traditional data analysis
                                    • Big data analytic methods
                                    • Architecture for big data analysis
                                      • Real-time vs offline analysis
                                      • Analysis at different levels
                                      • Analysis with different complexity
                                        • Tools for big data mining and analysis
                                          • Big data applications
                                            • Key applications of big data
                                              • Application evolutions
                                              • Structured data analysis
                                              • Text data analysis
                                              • Web data analysis
                                              • Multimedia data analysis
                                              • Network data analysis
                                              • Mobile data analysis
                                                • Key applications of big data
                                                  • Application of big data in enterprises
                                                  • Application of IoT based big data
                                                  • Application of online social network-oriented big data
                                                  • Applications of healthcare and medical big data
                                                  • Collective intelligence
                                                  • Smart grid
                                                      • Conclusion open issues and outlook
                                                        • Open issues
                                                          • Theoretical research
                                                          • Technology development
                                                          • Practical implications
                                                          • Data security
                                                            • Outlook
                                                              • Acknowledgments
                                                              • References

          called such transformation ldquoThe Fourth Paradigmrdquo [23] Healso thought the only way to cope with such paradigm wasto develop a new generation of computing tools to managevisualize and analyze massive data In June 2011 anothermilestone event occurred EMCIDC published a researchreport titled Extracting Values from Chaos [1] which intro-duced the concept and potential of big data for the firsttime This research report triggered the great interest in bothindustry and academia on big data

          Over the past few years nearly all major companiesincluding EMC Oracle IBM Microsoft Google Ama-zon and Facebook etc have started their big data projectsTaking IBM as an example since 2005 IBM has investedUSD 16 billion on 30 acquisitions related to big data Inacademia big data was also under the spotlight In 2008Nature published a big data special issue In 2011 Sciencealso launched a special issue on the key technologies ofldquodata processingrdquo in big data In 2012 European ResearchConsortium for Informatics and Mathematics (ERCIM)News published a special issue on big data In the beginningof 2012 a report titled Big Data Big Impact presented at theDavos Forum in Switzerland announced that big data hasbecome a new kind of economic assets just like currencyor gold Gartner an international research agency issuedHype Cycles from 2012 to 2013 which classified big datacomputing social analysis and stored data analysis into 48emerging technologies that deserve most attention

          Many national governments such as the US also paidgreat attention to big data In March 2012 the ObamaAdministration announced a USD 200 million investmentto launch the ldquoBig Data Research and Development Planrdquowhich was a second major scientific and technologicaldevelopment initiative after the ldquoInformation Highwayrdquo ini-tiative in 1993 In July 2012 the ldquoVigorous ICT Japanrdquoproject issued by Japanrsquos Ministry of Internal Affairs andCommunications indicated that the big data developmentshould be a national strategy and application technologiesshould be the focus In July 2012 the United Nations issuedBig Data for Development report which summarized howgovernments utilized big data to better serve and protecttheir people

          15 Challenges of big data

          The sharply increasing data deluge in the big data erabrings about huge challenges on data acquisition storagemanagement and analysis Traditional data managementand analysis systems are based on the relational databasemanagement system (RDBMS) However such RDBMSsonly apply to structured data other than semi-structured orunstructured data In addition RDBMSs are increasinglyutilizing more and more expensive hardware It is appar-ently that the traditional RDBMSs could not handle the

          huge volume and heterogeneity of big data The researchcommunity has proposed some solutions from different per-spectives For example cloud computing is utilized to meetthe requirements on infrastructure for big data eg costefficiency elasticity and smooth upgradingdowngradingFor solutions of permanent storage and management oflarge-scale disordered datasets distributed file systems [24]and NoSQL [25] databases are good choices Such program-ming frameworks have achieved great success in processingclustered tasks especially for webpage ranking Various bigdata applications can be developed based on these innova-tive technologies or platforms Moreover it is non-trivial todeploy the big data analysis systems

          Some literature [26ndash28] discuss obstacles in the develop-ment of big data applications The key challenges are listedas follows

          ndash Data representation many datasets have certain levelsof heterogeneity in type structure semantics organiza-tion granularity and accessibility Data representationaims to make data more meaningful for computer anal-ysis and user interpretation Nevertheless an improperdata representation will reduce the value of the origi-nal data and may even obstruct effective data analysisEfficient data representation shall reflect data structureclass and type as well as integrated technologies so asto enable efficient operations on different datasets

          ndash Redundancy reduction and data compression gener-ally there is a high level of redundancy in datasetsRedundancy reduction and data compression is effec-tive to reduce the indirect cost of the entire system onthe premise that the potential values of the data are notaffected For example most data generated by sensornetworks are highly redundant which may be filteredand compressed at orders of magnitude

          ndash Data life cycle management compared with the rel-atively slow advances of storage systems pervasivesensing and computing are generating data at unprece-dented rates and scales We are confronted with a lotof pressing challenges one of which is that the currentstorage system could not support such massive dataGenerally speaking values hidden in big data dependon data freshness Therefore a data importance princi-ple related to the analytical value should be developedto decide which data shall be stored and which datashall be discarded

          ndash Analytical mechanism the analytical system of big datashall process masses of heterogeneous data within alimited time However traditional RDBMSs are strictlydesigned with a lack of scalability and expandabilitywhich could not meet the performance requirementsNon-relational databases have shown their uniqueadvantages in the processing of unstructured data and

          Mobile Netw Appl (2014) 19171ndash209 175

          started to become mainstream in big data analysisEven so there are still some problems of non-relationaldatabases in their performance and particular applica-tions We shall find a compromising solution betweenRDBMSs and non-relational databases For examplesome enterprises have utilized a mixed database archi-tecture that integrates the advantages of both types ofdatabase (eg Facebook and Taobao) More researchis needed on the in-memory database and sample databased on approximate analysis

          ndash Data confidentiality most big data service providers orowners at present could not effectively maintain andanalyze such huge datasets because of their limitedcapacity They must rely on professionals or tools toanalyze such data which increase the potential safetyrisks For example the transactional dataset generallyincludes a set of complete operating data to drive keybusiness processes Such data contains details of thelowest granularity and some sensitive information suchas credit card numbers Therefore analysis of big datamay be delivered to a third party for processing onlywhen proper preventive measures are taken to protectsuch sensitive data to ensure its safety

          ndash Energy management the energy consumption of main-frame computing systems has drawn much attentionfrom both economy and environment perspectives Withthe increase of data volume and analytical demandsthe processing storage and transmission of big datawill inevitably consume more and more electric energyTherefore system-level power consumption controland management mechanism shall be established forbig data while the expandability and accessibility areensured

          ndash Expendability and scalability the analytical system ofbig data must support present and future datasets Theanalytical algorithm must be able to process increas-ingly expanding and more complex datasets

          ndash Cooperation analysis of big data is an interdisci-plinary research which requires experts in differentfields cooperate to harvest the potential of big dataA comprehensive big data network architecture mustbe established to help scientists and engineers in var-ious fields access different kinds of data and fullyutilize their expertise so as to cooperate to complete theanalytical objectives

          2 Related technologies

          In order to gain a deep understanding of big data this sec-tion will introduce several fundamental technologies that areclosely related to big data including cloud computing IoTdata center and Hadoop

          21 Relationship between cloud computing and big data

          Cloud computing is closely related to big data The keycomponents of cloud computing are shown in Fig 3 Bigdata is the object of the computation-intensive operation andstresses the storage capacity of a cloud system The mainobjective of cloud computing is to use huge computing andstorage resources under concentrated management so asto provide big data applications with fine-grained comput-ing capacity The development of cloud computing providessolutions for the storage and processing of big data On theother hand the emergence of big data also accelerates thedevelopment of cloud computing The distributed storagetechnology based on cloud computing can effectively man-age big data the parallel computing capacity by virtue ofcloud computing can improve the efficiency of acquisitionand analyzing big data

          Even though there are many overlapped technologiesin cloud computing and big data they differ in the fol-lowing two aspects First the concepts are different to acertain extent Cloud computing transforms the IT archi-tecture while big data influences business decision-makingHowever big data depends on cloud computing as thefundamental infrastructure for smooth operation

          Second big data and cloud computing have differenttarget customers Cloud computing is a technology andproduct targeting Chief Information Officers (CIO) as anadvanced IT solution Big data is a product targeting ChiefExecutive Officers (CEO) focusing on business operationsSince the decision makers may directly feel the pressurefrom market competition they must defeat business oppo-nents in more competitive ways With the advances ofbig data and cloud computing these two technologies arecertainly and increasingly entwine with each other Cloudcomputing with functions similar to those of computers andoperating systems provides system-level resources big data

          Fig 3 Key components of cloud computing

          176 Mobile Netw Appl (2014) 19171ndash209

          operates in the upper level supported by cloud computingand provides functions similar to those of database and effi-cient data processing capacity Kissinger President of EMCindicated that the application of big data must be based oncloud computing

          The evolution of big data was driven by the rapid growthof application demands and cloud computing developedfrom virtualized technologies Therefore cloud computingnot only provides computation and processing for big databut also itself is a service mode To a certain extent theadvances of cloud computing also promote the developmentof big data both of which supplement each other

          22 Relationship between IoT and big data

          In the IoT paradigm an enormous amount of networkingsensors are embedded into various devices and machinesin the real world Such sensors deployed in different fieldsmay collect various kinds of data such as environmentaldata geographical data astronomical data and logistic dataMobile equipments transportation facilities public facil-ities and home appliances could all be data acquisitionequipments in IoT as illustrated in Fig 4

          The big data generated by IoT has different characteris-tics compared with general big data because of the differenttypes of data collected of which the most classical charac-teristics include heterogeneity variety unstructured featurenoise and high redundancy Although the current IoT datais not the dominant part of big data by 2030 the quantity of

          sensors will reach one trillion and then the IoT data will be

          the most important part of big data according to the fore-

          cast of HP A report from Intel pointed out that big data in

          IoT has three features that conform to the big data paradigm

          (i) abundant terminals generating masses of data (ii) data

          generated by IoT is usually semi-structured or unstructured

          (iii) data of IoT is useful only when it is analyzed

          At present the data processing capacity of IoT has fallen

          behind the collected data and it is extremely urgent to accel-

          erate the introduction of big data technologies to promote

          the development of IoT Many operators of IoT realize the

          importance of big data since the success of IoT is hinged

          upon the effective integration of big data and cloud com-

          puting The widespread deployment of IoT will also bring

          many cities into the big data era

          There is a compelling need to adopt big data for IoT

          applications while the development of big data is already

          legged behind It has been widely recognized that these

          two technologies are inter-dependent and should be jointly

          developed on one hand the widespread deployment of IoT

          drives the high growth of data both in quantity and cate-

          gory thus providing the opportunity for the application and

          development of big data on the other hand the application

          of big data technology to IoT also accelerates the research

          advances and business models of of IoT

          Fig 4 Illustration of data acquisition equipment in IoT

          Mobile Netw Appl (2014) 19171ndash209 177

          23 Data center

          In the big data paradigm the data center not only is a plat-form for concentrated storage of data but also undertakesmore responsibilities such as acquiring data managingdata organizing data and leveraging the data values andfunctions Data centers mainly concern ldquodatardquo other thanldquocenterrdquo It has masses of data and organizes and man-ages data according to its core objective and developmentpath which is more valuable than owning a good site andresource The emergence of big data brings about sounddevelopment opportunities and great challenges to data cen-ters Big data is an emerging paradigm which will promotethe explosive growth of the infrastructure and related soft-ware of data center The physical data center network isthe core for supporting big data but at present is the keyinfrastructure that is most urgently required [29]

          ndash Big data requires data center provide powerful back-stage support The big data paradigm has more strin-gent requirements on storage capacity and processingcapacity as well as network transmission capacityEnterprises must take the development of data centersinto consideration to improve the capacity of rapidlyand effectively processing of big data under limitedpriceperformance ratio The data center shall providethe infrastructure with a large number of nodes build ahigh-speed internal network effectively dissipate heatand effective backup data Only when a highly energy-efficient stable safe expandable and redundant datacenter is built the normal operation of big data applica-tions may be ensured

          ndash The growth of big data applications accelerates therevolution and innovation of data centers Many bigdata applications have developed their unique architec-tures and directly promote the development of storagenetwork and computing technologies related to datacenter With the continued growth of the volumes ofstructured and unstructured data and the variety ofsources of analytical data the data processing and com-puting capacities of the data center shall be greatlyenhanced In addition as the scale of data center isincreasingly expanding it is also an important issue onhow to reduce the operational cost for the developmentof data centers

          ndash Big data endows more functions to the data center Inthe big data paradigm data center shall not only con-cern with hardware facilities but also strengthen softcapacities ie the capacities of acquisition processingorganization analysis and application of big data Thedata center may help business personnel analyze theexisting data discover problems in business operationand develop solutions from big data

          24 Relationship between hadoop and big data

          Presently Hadoop is widely used in big data applications inthe industry eg spam filtering network searching click-stream analysis and social recommendation In additionconsiderable academic research is now based on HadoopSome representative cases are given below As declaredin June 2012 Yahoo runs Hadoop in 42000 servers atfour data centers to support its products and services egsearching and spam filtering etc At present the biggestHadoop cluster has 4000 nodes but the number of nodeswill be increased to 10000 with the release of Hadoop 20In the same month Facebook announced that their Hadoopcluster can process 100 PB data which grew by 05 PB perday as in November 2012 Some well-known agencies thatuse Hadoop to conduct distributed computation are listedin [30] In addition many companies provide Hadoop com-mercial execution andor support including Cloudera IBMMapR EMC and Oracle

          Among modern industrial machinery and systems sen-sors are widely deployed to collect information for environ-ment monitoring and failure forecasting etc Bahga and oth-ers in [31] proposed a framework for data organization andcloud computing infrastructure termed CloudView Cloud-View uses mixed architectures local nodes and remoteclusters based on Hadoop to analyze machine-generateddata Local nodes are used for the forecast of real-time fail-ures clusters based on Hadoop are used for complex offlineanalysis eg case-driven data analysis

          The exponential growth of the genome data and the sharpdrop of sequencing cost transform bio-science and bio-medicine to data-driven science Gunarathne et al in [32]utilized cloud computing infrastructures Amazon AWSMicrosoft Azune and data processing framework basedon MapReduce Hadoop and Microsoft DryadLINQ torun two parallel bio-medicine applications (i) assembly ofgenome segments (ii) dimension reduction in the analy-sis of chemical structure In the subsequent application the166-D datasets used include 26000000 data points Theauthors compared the performance of all the frameworks interms of efficiency cost and availability According to thestudy the authors concluded that the loose coupling will beincreasingly applied to research on electron cloud and theparallel programming technology (MapReduce) frameworkmay provide the user an interface with more convenientservices and reduce unnecessary costs

          3 Big data generation and acquisition

          We have introduced several key technologies related to bigdata ie cloud computing IoT data center and HadoopNext we will focus on the value chain of big data which

          178 Mobile Netw Appl (2014) 19171ndash209

          can be generally divided into four phases data generationdata acquisition data storage and data analysis If we takedata as a raw material data generation and data acquisitionare an exploitation process data storage is a storage processand data analysis is a production process that utilizes theraw material to create new value

          31 Data generation

          Data generation is the first step of big data Given Inter-net data as an example huge amount of data in terms ofsearching entries Internet forum posts chatting records andmicroblog messages are generated Those data are closelyrelated to peoplersquos daily life and have similar features ofhigh value and low density Such Internet data may bevalueless individually but through the exploitation of accu-mulated big data useful information such as habits andhobbies of users can be identified and it is even possible toforecast usersrsquo behaviors and emotional moods

          Moreover generated through longitudinal andor dis-tributed data sources datasets are more large-scale highlydiverse and complex Such data sources include sensorsvideos clickstreams andor all other available data sourcesAt present main sources of big data are the operationand trading information in enterprises logistic and sens-ing information in the IoT human interaction informationand position information in the Internet world and datagenerated in scientific research etc The information far sur-passes the capacities of IT architectures and infrastructuresof existing enterprises while its real time requirement alsogreatly stresses the existing computing capacity

          311 Enterprise data

          In 2013 IBM issued Analysis the Applications of Big Datato the Real World which indicates that the internal data ofenterprises are the main sources of big data The internaldata of enterprises mainly consists of online trading data andonline analysis data most of which are historically staticdata and are managed by RDBMSs in a structured man-ner In addition production data inventory data sales dataand financial data etc also constitute enterprise internaldata which aims to capture informationized and data-drivenactivities in enterprises so as to record all activities ofenterprises in the form of internal data

          Over the past decades IT and digital data have con-tributed a lot to improve the profitability of business depart-ments It is estimated that the business data volume of allcompanies in the world may double every 12 years [10]in which the business turnover through the Internet enter-prises to enterprises and enterprises to consumers per daywill reach USD 450 billion [33] The continuously increas-ing business data volume requires more effective real-time

          analysis so as to fully harvest its potential For exampleAmazon processes millions of terminal operations and morethan 500000 queries from third-party sellers per day [12]Walmart processes one million customer trades per hour andsuch trading data are imported into a database with a capac-ity of over 25PB [3] Akamai analyzes 75 million eventsper day for its target advertisements [13]

          312 IoT data

          As discussed IoT is an important source of big data Amongsmart cities constructed based on IoT big data may comefrom industry agriculture traffic transportation medicalcare public departments and families etc

          According to the processes of data acquisition and trans-mission in IoT its network architecture may be dividedinto three layers the sensing layer the network layer andthe application layer The sensing layer is responsible fordata acquisition and mainly consists of sensor networksThe network layer is responsible for information transmis-sion and processing where close transmission may rely onsensor networks and remote transmission shall depend onthe Internet Finally the application layer support specificapplications of IoT

          According to characteristics of Internet of Things thedata generated from IoT has the following features

          ndash Large-scale data in IoT masses of data acquisi-tion equipments are distributedly deployed which mayacquire simple numeric data eg location or complexmultimedia data eg surveillance video In order tomeet the demands of analysis and processing not onlythe currently acquired data but also the historical datawithin a certain time frame should be stored Thereforedata generated by IoT are characterized by large scales

          ndash Heterogeneity because of the variety data acquisitiondevices the acquired data is also different and such datafeatures heterogeneity

          ndash Strong time and space correlation in IoT every dataacquisition device are placed at a specific geographiclocation and every piece of data has time stamp Thetime and space correlation are an important propertyof data from IoT During data analysis and process-ing time and space are also important dimensions forstatistical analysis

          ndash Effective data accounts for only a small portion of thebig data a great quantity of noises may occur dur-ing the acquisition and transmission of data in IoTAmong datasets acquired by acquisition devices only asmall amount of abnormal data is valuable For exam-ple during the acquisition of traffic video the few videoframes that capture the violation of traffic regulations

          Mobile Netw Appl (2014) 19171ndash209 179

          and traffic accidents are more valuable than those onlycapturing the normal flow of traffic

          313 Bio-medical data

          As a series of high-throughput bio-measurement technolo-gies are innovatively developed in the beginning of the21st century the frontier research in the bio-medicine fieldalso enters the era of big data By constructing smartefficient and accurate analytical models and theoretical sys-tems for bio-medicine applications the essential governingmechanism behind complex biological phenomena may berevealed Not only the future development of bio-medicinecan be determined but also the leading roles can be assumedin the development of a series of important strategic indus-tries related to the national economy peoplersquos livelihoodand national security with important applications such asmedical care new drug R amp D and grain production (egtransgenic crops)

          The completion of HGP (Human Genome Project) andthe continued development of sequencing technology alsolead to widespread applications of big data in the fieldThe masses of data generated by gene sequencing gothrough specialized analysis according to different applica-tion demands to combine it with the clinical gene diag-nosis and provide valuable information for early diagnosisand personalized treatment of disease One sequencing ofhuman gene may generate 100 600GB raw data In theChina National Genebank in Shenzhen there are 13 mil-lion samples including 115 million human samples and150000 animal plant and microorganism samples By theend of 2013 10 million traceable biological samples willbe stored and by the end of 2015 this figure will reach30 million It is predictable that with the development ofbio-medicine technologies gene sequencing will becomefaster and more convenient and thus making big data ofbio-medicine continuously grow beyond all doubt

          In addition data generated from clinical medical care andmedical R amp D also rise quickly For example the Uni-versity of Pittsburgh Medical Center (UPMC) has stored2TB such data Explorys an American company providesplatforms to collocate clinical data operation and mainte-nance data and financial data At present about 13 millionpeoplersquos information have been collocated with 44 arti-cles of data at the scale of about 60TB which will reach70TB in 2013 Practice Fusion another American com-pany manages electronic medical records of about 200000patients

          Apart from such small and medium-sized enterprisesother well-known IT companies such as Google Microsoftand IBM have invested extensively in the research and com-putational analysis of methods related to high-throughputbiological big data for shares in the huge market as known

          as the ldquoNext Internetrdquo IBM forecasts in the 2013 StrategyConference that with the sharp increase of medical imagesand electronic medical records medical professionals mayutilize big data to extract useful clinical information frommasses of data to obtain a medical history and forecast treat-ment effects thus improving patient care and reduce costIt is anticipated that by 2015 the average data volume ofevery hospital will increase from 167TB to 665TB

          314 Data generation from other fields

          As scientific applications are increasing the scale ofdatasets is gradually expanding and the development ofsome disciplines greatly relies on the analysis of masses ofdata Here we examine several such applications Althoughbeing in different scientific fields the applications havesimilar and increasing demand on data analysis The firstexample is related to computational biology GenBank isa nucleotide sequence database maintained by the USNational Bio-Technology Innovation Center Data in thisdatabase may double every 10 months By August 2009Genbank has more than 250 billion bases from 150000 dif-ferent organisms [34] The second example is related toastronomy Sloan Digital Sky Survey (SDSS) the biggestsky survey project in astronomy has recorded 25TB datafrom 1998 to 2008 As the resolution of the telescope isimproved by 2004 the data volume generated per night willsurpass 20TB The last application is related to high-energyphysics In the beginning of 2008 the Atlas experiment ofLarge Hadron Collider (LHC) of European Organization forNuclear Research generates raw data at 2PBs and storesabout 10TB processed data per year

          In addition pervasive sensing and computing amongnature commercial Internet government and social envi-ronments are generating heterogeneous data with unprece-dented complexity These datasets have their unique datacharacteristics in scale time dimension and data categoryFor example mobile data were recorded with respect topositions movement approximation degrees communica-tions multimedia use of applications and audio environ-ment [108] According to the application environment andrequirements such datasets into different categories so asto select the proper and feasible solutions for big data

          32 Big data acquisition

          As the second phase of the big data system big data acqui-sition includes data collection data transmission and datapre-processing During big data acquisition once we col-lect the raw data we shall utilize an efficient transmissionmechanism to send it to a proper storage managementsystem to support different analytical applications The col-lected datasets may sometimes include much redundant or

          180 Mobile Netw Appl (2014) 19171ndash209

          useless data which unnecessarily increases storage spaceand affects the subsequent data analysis For examplehigh redundancy is very common among datasets collectedby sensors for environment monitoring Data compressiontechnology can be applied to reduce the redundancy There-fore data pre-processing operations are indispensable toensure efficient data storage and exploitation

          321 Data collection

          Data collection is to utilize special data collection tech-niques to acquire raw data from a specific data generationenvironment Four common data collection methods areshown as follows

          ndash Log files As one widely used data collection methodlog files are record files automatically generated by thedata source system so as to record activities in desig-nated file formats for subsequent analysis Log files aretypically used in nearly all digital devices For exam-ple web servers record in log files number of clicksclick rates visits and other property records of webusers [35] To capture activities of users at the web sitesweb servers mainly include the following three log fileformats public log file format (NCSA) expanded logformat (W3C) and IIS log format (Microsoft) All thethree types of log files are in the ASCII text formatDatabases other than text files may sometimes be usedto store log information to improve the query efficiencyof the massive log store [36 37] There are also someother log files based on data collection including stockindicators in financial applications and determinationof operating states in network monitoring and trafficmanagement

          ndash Sensing Sensors are common in daily life to measurephysical quantities and transform physical quantitiesinto readable digital signals for subsequent process-ing (and storage) Sensory data may be classified assound wave voice vibration automobile chemicalcurrent weather pressure temperature etc Sensedinformation is transferred to a data collection pointthrough wired or wireless networks For applicationsthat may be easily deployed and managed eg videosurveillance system [38] the wired sensor network isa convenient solution to acquire related informationSometimes the accurate position of a specific phe-nomenon is unknown and sometimes the monitoredenvironment does not have the energy or communica-tion infrastructures Then wireless communication mustbe used to enable data transmission among sensor nodesunder limited energy and communication capability Inrecent years WSNs have received considerable inter-est and have been applied to many applications such

          as environmental research [39 40] water quality mon-itoring [41] civil engineering [42 43] and wildlifehabit monitoring [44] A WSN generally consists ofa large number of geographically distributed sensornodes each being a micro device powered by batterySuch sensors are deployed at designated positions asrequired by the application to collect remote sensingdata Once the sensors are deployed the base stationwill send control information for network configura-tionmanagement or data collection to sensor nodesBased on such control information the sensory data isassembled in different sensor nodes and sent back to thebase station for further processing Interested readersare referred to [45] for more detailed discussions

          ndash Methods for acquiring network data At present net-work data acquisition is accomplished using a com-bination of web crawler word segmentation systemtask system and index system etc Web crawler isa program used by search engines for downloadingand storing web pages [46] Generally speaking webcrawler starts from the uniform resource locator (URL)of an initial web page to access other linked web pagesduring which it stores and sequences all the retrievedURLs Web crawler acquires a URL in the order ofprecedence through a URL queue and then downloadsweb pages and identifies all URLs in the downloadedweb pages and extracts new URLs to be put in thequeue This process is repeated until the web crawleris stopped Data acquisition through a web crawler iswidely applied in applications based on web pagessuch as search engines or web caching Traditional webpage extraction technologies feature multiple efficientsolutions and considerable research has been done inthis field As more advanced web page applicationsare emerging some extraction strategies are proposedin [47] to cope with rich Internet applications

          The current network data acquisition technologiesmainly include traditional Libpcap-based packet capturetechnology zero-copy packet capture technology as wellas some specialized network monitoring software such asWireshark SmartSniff and WinNetCap

          ndash Libpcap-based packet capture technology Libpcap(packet capture library) is a widely used network datapacket capture function library It is a general tool thatdoes not depend on any specific system and is mainlyused to capture data in the data link layer It featuressimplicity easy-to-use and portability but has a rel-atively low efficiency Therefore under a high-speednetwork environment considerable packet losses mayoccur when Libpcap is used

          Mobile Netw Appl (2014) 19171ndash209 181

          ndash Zero-copy packet capture technology The so-calledzero-copy (ZC) means that no copies between any inter-nal memories occur during packet receiving and send-ing at a node In sending the data packets directly startfrom the user buffer of applications pass through thenetwork interfaces and arrive at an external networkIn receiving the network interfaces directly send datapackets to the user buffer The basic idea of zero-copyis to reduce data copy times reduce system calls andreduce CPU load while ddatagrams are passed from net-work equipments to user program space The zero-copytechnology first utilizes direct memory access (DMA)technology to directly transmit network datagrams to anaddress space pre-allocated by the system kernel so asto avoid the participation of CPU In the meanwhile itmaps the internal memory of the datagrams in the sys-tem kernel to the that of the detection program or buildsa cache region in the user space and maps it to the ker-nel space Then the detection program directly accessesthe internal memory so as to reduce internal memorycopy from system kernel to user space and reduce theamount of system calls

          ndash Mobile equipments At present mobile devices aremore widely used As mobile device functions becomeincreasingly stronger they feature more complex andmultiple means of data acquisition as well as morevariety of data Mobile devices may acquire geo-graphical location information through positioning sys-tems acquire audio information through microphonesacquire pictures videos streetscapes two-dimensionalbarcodes and other multimedia information throughcameras acquire user gestures and other body languageinformation through touch screens and gravity sensorsOver the years wireless operators have improved theservice level of the mobile Internet by acquiring andanalyzing such information For example iPhone itselfis a ldquomobile spyrdquo It may collect wireless data andgeographical location information and then send suchinformation back to Apple Inc for processing of whichthe user is not aware Apart from Apple smart phoneoperating systems such as Android of Google and Win-dows Phone of Microsoft can also collect informationin the similar manner

          In addition to the aforementioned three data acquisitionmethods of main data sources there are many other datacollect methods or systems For example in scientific exper-iments many special tools can be used to collect exper-imental data such as magnetic spectrometers and radiotelescopes We may classify data collection methods fromdifferent perspectives From the perspective of data sourcesdata collection methods can be classified into two cate-gories collection methods recording through data sources

          and collection methods recording through other auxiliarytools

          322 Data transportation

          Upon the completion of raw data collection data will betransferred to a data storage infrastructure for processingand analysis As discussed in Section 23 big data is mainlystored in a data center The data layout should be adjusted toimprove computing efficiency or facilitate hardware mainte-nance In other words internal data transmission may occurin the data center Therefore data transmission consistsof two phases Inter-DCN transmissions and Intra-DCNtransmissions

          ndash Inter-DCN transmissions Inter-DCN transmissions arefrom data source to data center which is generallyachieved with the existing physical network infrastruc-ture Because of the rapid growth of traffic demandsthe physical network infrastructure in most regionsaround the world are constituted by high-volumn high-rate and cost-effective optic fiber transmission systemsOver the past 20 years advanced management equip-ment and technologies have been developed such asIP-based wavelength division multiplexing (WDM) net-work architecture to conduct smart control and man-agement of optical fiber networks [48 49] WDM isa technology that multiplexes multiple optical carriersignals with different wave lengths and couples themto the same optical fiber of the optical link In suchtechnology lasers with different wave lengths carry dif-ferent signals By far the backbone network have beendeployed with WDM optical transmission systems withsingle channel rate of 40Gbs At present 100Gbs com-mercial interface are available and 100Gbs systems (orTBs systems) will be available in the near future [50]However traditional optical transmission technologiesare limited by the bandwidth of the electronic bot-tleneck [51] Recently orthogonal frequency-divisionmultiplexing (OFDM) initially designed for wirelesssystems is regarded as one of the main candidatetechnologies for future high-speed optical transmis-sion OFDM is a multi-carrier parallel transmissiontechnology It segments a high-speed data flow to trans-form it into low-speed sub-data-flows to be transmittedover multiple orthogonal sub-carriers [52] Comparedwith fixed channel spacing of WDM OFDM allowssub-channel frequency spectrums to overlap with eachother [53] Therefore it is a flexible and efficient opticalnetworking technology

          ndash Intra-DCN Transmissions Intra-DCN transmissionsare the data communication flows within data centersIntra-DCN transmissions depend on the communication

          182 Mobile Netw Appl (2014) 19171ndash209

          mechanism within the data center (ie on physical con-nection plates chips internal memories of data serversnetwork architectures of data centers and communica-tion protocols) A data center consists of multiple inte-grated server racks interconnected with its internal con-nection networks Nowadays the internal connectionnetworks of most data centers are fat-tree two-layeror three-layer structures based on multi-commoditynetwork flows [51 54] In the two-layer topologicalstructure the racks are connected by 1Gbps top rackswitches (TOR) and then such top rack switches areconnected with 10Gbps aggregation switches in thetopological structure The three-layer topological struc-ture is a structure augmented with one layer on the topof the two-layer topological structure and such layeris constituted by 10Gbps or 100Gbps core switchesto connect aggregation switches in the topologicalstructure There are also other topological structureswhich aim to improve the data center networks [55ndash58] Because of the inadequacy of electronic packetswitches it is difficult to increase communication band-widths while keeps energy consumption is low Overthe years due to the huge success achieved by opti-cal technologies the optical interconnection among thenetworks in data centers has drawn great interest Opti-cal interconnection is a high-throughput low-delayand low-energy-consumption solution At present opti-cal technologies are only used for point-to-point linksin data centers Such optical links provide connectionfor the switches using the low-cost multi-mode fiber(MMF) with 10Gbps data rate Optical interconnec-tion (switching in the optical domain) of networks indata centers is a feasible solution which can provideTbps-level transmission bandwidth with low energyconsumption Recently many optical interconnectionplans are proposed for data center networks [59] Someplans add optical paths to upgrade the existing net-works and other plans completely replace the currentswitches [59ndash64] As a strengthening technology Zhouet al in [65] adopt wireless links in the 60GHz fre-quency band to strengthen wired links Network vir-tualization should also be considered to improve theefficiency and utilization of data center networks

          323 Data pre-processing

          Because of the wide variety of data sources the collecteddatasets vary with respect to noise redundancy and con-sistency etc and it is undoubtedly a waste to store mean-ingless data In addition some analytical methods haveserious requirements on data quality Therefore in orderto enable effective data analysis we shall pre-process data

          under many circumstances to integrate the data from differ-ent sources which can not only reduces storage expensebut also improves analysis accuracy Some relational datapre-processing techniques are discussed as follows

          ndash Integration data integration is the cornerstone of mod-ern commercial informatics which involves the com-bination of data from different sources and providesusers with a uniform view of data [66] This is a matureresearch field for traditional database Historically twomethods have been widely recognized data ware-house and data federation Data warehousing includesa process named ETL (Extract Transform and Load)Extraction involves connecting source systems select-ing collecting analyzing and processing necessarydata Transformation is the execution of a series of rulesto transform the extracted data into standard formatsLoading means importing extracted and transformeddata into the target storage infrastructure Loading isthe most complex procedure among the three whichincludes operations such as transformation copy clear-ing standardization screening and data organizationA virtual database can be built to query and aggregatedata from different data sources but such database doesnot contain data On the contrary it includes informa-tion or metadata related to actual data and its positionsSuch two ldquostorage-readingrdquo approaches do not sat-isfy the high performance requirements of data flowsor search programs and applications Compared withqueries data in such two approaches is more dynamicand must be processed during data transmission Gen-erally data integration methods are accompanied withflow processing engines and search engines [30 67]

          ndash Cleaning data cleaning is a process to identify inac-curate incomplete or unreasonable data and thenmodify or delete such data to improve data qualityGenerally data cleaning includes five complementaryprocedures [68] defining and determining error typessearching and identifying errors correcting errors doc-umenting error examples and error types and mod-ifying data entry procedures to reduce future errorsDuring cleaning data formats completeness rational-ity and restriction shall be inspected Data cleaning isof vital importance to keep the data consistency whichis widely applied in many fields such as banking insur-ance retail industry telecommunications and trafficcontrol

          In e-commerce most data is electronically col-lected which may have serious data quality prob-lems Classic data quality problems mainly come fromsoftware defects customized errors or system mis-configuration Authors in [69] discussed data cleaning

          Mobile Netw Appl (2014) 19171ndash209 183

          in e-commerce by crawlers and regularly re-copyingcustomer and account information

          In [70] the problem of cleaning RFID data wasexamined RFID is widely used in many applica-tions eg inventory management and target track-ing However the original RFID features low qualitywhich includes a lot of abnormal data limited by thephysical design and affected by environmental noisesIn [71] a probability model was developed to copewith data loss in mobile environments Khoussainovaet al in [72] proposed a system to automatically cor-rect errors of input data by defining global integrityconstraints

          Herbert et al [73] proposed a framework called BIO-AJAX to standardize biological data so as to conductfurther computation and improve search quality WithBIO-AJAX some errors and repetitions may be elim-inated and common data mining technologies can beexecuted more effectively

          ndash Redundancy elimination data redundancy refers to datarepetitions or surplus which usually occurs in manydatasets Data redundancy can increase the unneces-sary data transmission expense and cause defects onstorage systems eg waste of storage space lead-ing to data inconsistency reduction of data reliabil-ity and data damage Therefore various redundancyreduction methods have been proposed such as redun-dancy detection data filtering and data compressionSuch methods may apply to different datasets or appli-cation environments However redundancy reductionmay also bring about certain negative effects Forexample data compression and decompression causeadditional computational burden Therefore the ben-efits of redundancy reduction and the cost should becarefully balanced Data collected from different fieldswill increasingly appear in image or video formatsIt is well-known that images and videos contain con-siderable redundancy including temporal redundancyspacial redundancy statistical redundancy and sens-ing redundancy Video compression is widely usedto reduce redundancy in video data as specified inthe many video coding standards (MPEG-2 MPEG-4H263 and H264AVC) In [74] the authors inves-tigated the problem of video compression in a videosurveillance system with a video sensor network Theauthors propose a new MPEG-4 based method byinvestigating the contextual redundancy related to back-ground and foreground in a scene The low com-plexity and the low compression ratio of the pro-posed approach were demonstrated by the evaluationresults

          On generalized data transmission or storage re-peated data deletion is a special data compression

          technology which aims to eliminate repeated datacopies [75] With repeated data deletion individual datablocks or data segments will be assigned with identi-fiers (eg using a hash algorithm) and stored with theidentifiers added to the identification list As the anal-ysis of repeated data deletion continues if a new datablock has an identifier that is identical to that listedin the identification list the new data block will bedeemed as redundant and will be replaced by the cor-responding stored data block Repeated data deletioncan greatly reduce storage requirement which is par-ticularly important to a big data storage system Apartfrom the aforementioned data pre-processing methodsspecific data objects shall go through some other oper-ations such as feature extraction Such operation playsan important role in multimedia search and DNA anal-ysis [76ndash78] Usually high-dimensional feature vec-tors (or high-dimensional feature points) are used todescribe such data objects and the system stores thedimensional feature vectors for future retrieval Datatransfer is usually used to process distributed hetero-geneous data sources especially business datasets [79]As a matter of fact in consideration of various datasetsit is non-trivial or impossible to build a uniform datapre-processing procedure and technology that is appli-cable to all types of datasets on the specific featureproblem performance requirements and other factorsof the datasets should be considered so as to select aproper data pre-processing strategy

          4 Big data storage

          The explosive growth of data has more strict requirementson storage and management In this section we focus onthe storage of big data Big data storage refers to the stor-age and management of large-scale datasets while achiev-ing reliability and availability of data accessing We willreview important issues including massive storage systemsdistributed storage systems and big data storage mecha-nisms On one hand the storage infrastructure needs toprovide information storage service with reliable storagespace on the other hand it must provide a powerful accessinterface for query and analysis of a large amount ofdata

          Traditionally as auxiliary equipment of server data stor-age device is used to store manage look up and analyzedata with structured RDBMSs With the sharp growth ofdata data storage device is becoming increasingly moreimportant and many Internet companies pursue big capac-ity of storage to be competitive Therefore there is acompelling need for research on data storage

          184 Mobile Netw Appl (2014) 19171ndash209

          41 Storage system for massive data

          Various storage systems emerge to meet the demands ofmassive data Existing massive storage technologies can beclassified as Direct Attached Storage (DAS) and networkstorage while network storage can be further classifiedinto Network Attached Storage (NAS) and Storage AreaNetwork (SAN)

          In DAS various harddisks are directly connected withservers and data management is server-centric such thatstorage devices are peripheral equipments each of whichtakes a certain amount of IO resource and is managed by anindividual application software For this reason DAS is onlysuitable to interconnect servers with a small scale How-ever due to its low scalability DAS will exhibit undesirableefficiency when the storage capacity is increased ie theupgradeability and expandability are greatly limited ThusDAS is mainly used in personal computers and small-sizedservers

          Network storage is to utilize network to provide userswith a union interface for data access and sharing Networkstorage equipment includes special data exchange equip-ments disk array tap library and other storage media aswell as special storage software It is characterized withstrong expandability

          NAS is actually an auxillary storage equipment of a net-work It is directly connected to a network through a hub orswitch through TCPIP protocols In NAS data is transmit-ted in the form of files Compared to DAS the IO burdenat a NAS server is reduced extensively since the serveraccesses a storage device indirectly through a network

          While NAS is network-oriented SAN is especiallydesigned for data storage with a scalable and bandwidthintensive network eg a high-speed network with opticalfiber connections In SAN data storage management is rel-atively independent within a storage local area networkwhere multipath based data switching among any internalnodes is utilized to achieve a maximum degree of datasharing and data management

          From the organization of a data storage system DASNAS and SAN can all be divided into three parts (i) discarray it is the foundation of a storage system and the fun-damental guarantee for data storage (ii) connection andnetwork sub-systems which provide connection among oneor more disc arrays and servers (iii) storage managementsoftware which handles data sharing disaster recovery andother storage management tasks of multiple servers

          42 Distributed storage system

          The first challenge brought about by big data is how todevelop a large scale distributed storage system for effi-ciently data processing and analysis To use a distributed

          system to store massive data the following factors shouldbe taken into consideration

          ndash Consistency a distributed storage system requires mul-tiple servers to cooperatively store data As there aremore servers the probability of server failures will belarger Usually data is divided into multiple pieces tobe stored at different servers to ensure availability incase of server failure However server failures and par-allel storage may cause inconsistency among differentcopies of the same data Consistency refers to assuringthat multiple copies of the same data are identical

          ndash Availability a distributed storage system operates inmultiple sets of servers As more servers are usedserver failures are inevitable It would be desirable ifthe entire system is not seriously affected to satisfy cus-tomerrsquos requests in terms of reading and writing Thisproperty is called availability

          ndash Partition Tolerance multiple servers in a distributedstorage system are connected by a network The net-work could have linknode failures or temporary con-gestion The distributed system should have a certainlevel of tolerance to problems caused by network fail-ures It would be desirable that the distributed storagestill works well when the network is partitioned

          Eric Brewer proposed a CAP [80 81] theory in 2000which indicated that a distributed system could not simulta-neously meet the requirements on consistency availabilityand partition tolerance at most two of the three require-ments can be satisfied simultaneously Seth Gilbert andNancy Lynch from MIT proved the correctness of CAP the-ory in 2002 Since consistency availability and partitiontolerance could not be achieved simultaneously we can havea CA system by ignoring partition tolerance a CP system byignoring availability and an AP system that ignores consis-tency according to different design goals The three systemsare discussed in the following

          CA systems do not have partition tolerance ie theycould not handle network failures Therefore CA sys-tems are generally deemed as storage systems with a sin-gle server such as the traditional small-scale relationaldatabases Such systems feature single copy of data suchthat consistency is easily ensured Availability is guaranteedby the excellent design of relational databases Howeversince CA systems could not handle network failures theycould not be expanded to use many servers Therefore mostlarge-scale storage systems are CP systems and AP systems

          Compared with CA systems CP systems ensure parti-tion tolerance Therefore CP systems can be expanded tobecome distributed systems CP systems generally main-tain several copies of the same data in order to ensure a

          Mobile Netw Appl (2014) 19171ndash209 185

          level of fault tolerance CP systems also ensure data consis-tency ie multiple copies of the same data are guaranteedto be completely identical However CP could not ensuresound availability because of the high cost for consistencyassurance Therefore CP systems are useful for the scenar-ios with moderate load but stringent requirements on dataaccuracy (eg trading data) BigTable and Hbase are twopopular CP systems

          AP systems also ensure partition tolerance However APsystems are different from CP systems in that AP systemsalso ensure availability However AP systems only ensureeventual consistency rather than strong consistency in theprevious two systems Therefore AP systems only applyto the scenarios with frequent requests but not very highrequirements on accuracy For example in online SocialNetworking Services (SNS) systems there are many con-current visits to the data but a certain amount of data errorsare tolerable Furthermore because AP systems ensureeventual consistency accurate data can still be obtained aftera certain amount of delay Therefore AP systems may alsobe used under the circumstances with no stringent realtimerequirements Dynamo and Cassandra are two popular APsystems

          43 Storage mechanism for big data

          Considerable research on big data promotes the develop-ment of storage mechanisms for big data Existing stor-age mechanisms of big data may be classified into threebottom-up levels (i) file systems (ii) databases and (iii)programming models

          File systems are the foundation of the applications atupper levels Googlersquos GFS is an expandable distributedfile system to support large-scale distributed data-intensiveapplications [25] GFS uses cheap commodity servers toachieve fault-tolerance and provides customers with high-performance services GFS supports large-scale file appli-cations with more frequent reading than writing HoweverGFS also has some limitations such as a single point offailure and poor performances for small files Such limita-tions have been overcome by Colossus [82] the successorof GFS

          In addition other companies and researchers also havetheir solutions to meet the different demands for storageof big data For example HDFS and Kosmosfs are deriva-tives of open source codes of GFS Microsoft developedCosmos [83] to support its search and advertisement busi-ness Facebook utilizes Haystack [84] to store the largeamount of small-sized photos Taobao also developed TFSand FastDFS In conclusion distributed file systems havebeen relatively mature after years of development and busi-ness operation Therefore we will focus on the other twolevels in the rest of this section

          431 Database technology

          The database technology has been evolving for more than30 years Various database systems are developed to handledatasets at different scales and support various applica-tions Traditional relational databases cannot meet the chal-lenges on categories and scales brought about by big dataNoSQL databases (ie non traditional relational databases)are becoming more popular for big data storage NoSQLdatabases feature flexible modes support for simple andeasy copy simple API eventual consistency and supportof large volume data NoSQL databases are becomingthe core technology for of big data We will examinethe following three main NoSQL databases in this sec-tion Key-value databases column-oriented databases anddocument-oriented databases each based on certain datamodels

          ndash Key-value Databases Key-value Databases are con-stituted by a simple data model and data is storedcorresponding to key-values Every key is unique andcustomers may input queried values according to thekeys Such databases feature a simple structure andthe modern key-value databases are characterized withhigh expandability and shorter query response time thanthose of relational databases Over the past few yearsmany key-value databases have appeared as motivatedby Amazonrsquos Dynamo system [85] We will introduceDynamo and several other representative key-valuedatabases

          ndash Dynamo Dynamo is a highly available andexpandable distributed key-value data stor-age system It is used to store and managethe status of some core services which canbe realized with key access in the Amazone-Commerce Platform The public mode ofrelational databases may generate invalid dataand limit data scale and availability whileDynamo can resolve these problems with asimple key-object interface which is consti-tuted by simple reading and writing opera-tion Dynamo achieves elasticity and avail-ability through the data partition data copyand object edition mechanisms Dynamo par-tition plan relies on Consistent Hashing [86]which has a main advantage that node pass-ing only affects directly adjacent nodes anddo not affect other nodes to divide the loadfor multiple main storage machines Dynamocopies data to N sets of servers in which Nis a configurable parameter in order to achieve

          186 Mobile Netw Appl (2014) 19171ndash209

          high availability and durability Dynamo sys-tem also provides eventual consistency so asto conduct asynchronous update on all copies

          ndash Voldemort Voldemort is also a key-value stor-age system which was initially developed forand is still used by LinkedIn Key words andvalues in Voldemort are composite objectsconstituted by tables and images Volde-mort interface includes three simple opera-tions reading writing and deletion all ofwhich are confirmed by key words Volde-mort provides asynchronous updating con-current control of multiple editions but doesnot ensure data consistency However Volde-mort supports optimistic locking for consistentmulti-record updating When conflict happensbetween the updating and any other opera-tions the updating operation will quit Thedata copy mechanism of Voldmort is the sameas that of Dynamo Voldemort not only storesdata in RAM but allows data be inserted intoa storage engine Especially Voldemort sup-ports two storage engines including BerkeleyDB and Random Access Files

          The key-value database emerged a few years agoDeeply influenced by Amazon Dynamo DB other key-value storage systems include Redis Tokyo Canbinetand Tokyo Tyrant Memcached and Memcache DBRiak and Scalaris all of which provide expandabilityby distributing key words into nodes Voldemort RiakTokyo Cabinet and Memecached can utilize attachedstorage devices to store data in RAM or disks Otherstorage systems store data at RAM and provide diskbackup or rely on copy and recovery to avoid backup

          ndash Column-oriented Database The column-orienteddatabases store and process data according to columnsother than rows Both columns and rows are segmentedin multiple nodes to realize expandability The column-oriented databases are mainly inspired by GooglersquosBigTable In this Section we first discuss BigTable andthen introduce several derivative tools

          ndash BigTable BigTable is a distributed structureddata storage system which is designed to pro-cess the large-scale (PB class) data amongthousands commercial servers [87] The basicdata structure of Bigtable is a multi-dimensionsequenced mapping with sparse distributedand persistent storage Indexes of mappingare row key column key and timestampsand every value in mapping is an unana-lyzed byte array Each row key in BigTable

          is a 64KB character string By lexicograph-ical order rows are stored and continuallysegmented into Tablets (ie units of distribu-tion) for load balance Thus reading a shortrow of data can be highly effective since itonly involves communication with a small por-tion of machines The columns are groupedaccording to the prefixes of keys and thusforming column families These column fami-lies are the basic units for access control Thetimestamps are 64-bit integers to distinguishdifferent editions of cell values Clients mayflexibly determine the number of cell editionsstored These editions are sequenced in thedescending order of timestamps so the latestedition will always be read

          The BigTable API features the creation anddeletion of Tablets and column families as wellas modification of metadata of clusters tablesand column families Client applications mayinsert or delete values of BigTable query val-ues from columns or browse sub-datasets in atable Bigtable also supports some other char-acteristics such as transaction processing in asingle row Users may utilize such features toconduct more complex data processing

          Every procedure executed by BigTableincludes three main components Masterserver Tablet server and client libraryBigtable only allows one set of Master serverbe distributed to be responsible for distribut-ing tablets for Tablet server detecting addedor removed Tablet servers and conductingload balance In addition it can also mod-ify BigTable schema eg creating tables andcolumn families and collecting garbage savedin GFS as well as deleted or disabled filesand using them in specific BigTable instancesEvery tablet server manages a Tablet set andis responsible for the reading and writing of aloaded Tablet When Tablets are too big theywill be segmented by the server The applica-tion client library is used to communicate withBigTable instances

          BigTable is based on many fundamentalcomponents of Google including GFS [25]cluster management system SSTable file for-mat and Chubby [88] GFS is use to store dataand log files The cluster management systemis responsible for task scheduling resourcessharing processing of machine failures andmonitoring of machine statuses SSTable fileformat is used to store BigTable data internally

          Mobile Netw Appl (2014) 19171ndash209 187

          and it provides mapping between persistentsequenced and unchangeable keys and valuesas any byte strings BigTable utilizes Chubbyfor the following tasks in server 1) ensurethere is at most one active Master copy atany time 2) store the bootstrap location ofBigTable data 3) look up Tablet server 4) con-duct error recovery in case of Table server fail-ures 5) store BigTable schema information 6)store the access control table

          ndash Cassandra Cassandra is a distributed storagesystem to manage the huge amount of struc-tured data distributed among multiple commer-cial servers [89] The system was developedby Facebook and became an open source toolin 2008 It adopts the ideas and concepts ofboth Amazon Dynamo and Google BigTableespecially integrating the distributed systemtechnology of Dynamo with the BigTable datamodel Tables in Cassandra are in the form ofdistributed four-dimensional structured map-ping where the four dimensions including rowcolumn column family and super column Arow is distinguished by a string-key with arbi-trary length No matter the amount of columnsto be read or written the operation on rowsis an auto Columns may constitute clusterswhich is called column families and are sim-ilar to the data model of Bigtable Cassandraprovides two kinds of column families columnfamilies and super columns The super columnincludes arbitrary number of columns relatedto same names A column family includescolumns and super columns which may becontinuously inserted to the column familyduring runtime The partition and copy mecha-nisms of Cassandra are very similar to those ofDynamo so as to achieve consistency

          ndash Derivative tools of BigTable since theBigTable code cannot be obtained throughthe open source license some open sourceprojects compete to implement the BigTableconcept to develop similar systems such asHBase and Hypertable

          HBase is a BigTable cloned version pro-grammed with Java and is a part of Hadoop ofApachersquos MapReduce framework [90] HBasereplaces GFS with HDFS It writes updatedcontents into RAM and regularly writes theminto files on disks The row operations areatomic operations equipped with row-levellocking and transaction processing which is

          optional for large scale Partition and distribu-tion are transparently operated and have spacefor client hash or fixed key

          HyperTable was developed similar toBigTable to obtain a set of high-performanceexpandable distributed storage and process-ing systems for structured and unstructureddata [91] HyperTable relies on distributedfile systems eg HDFS and distributed lockmanager Data representation processing andpartition mechanism are similar to that inBigTable HyperTable has its own query lan-guage called HyperTable query language(HQL) and allows users to create modify andquery underlying tables

          Since the column-oriented storage databases mainlyemulate BigTable their designs are all similar exceptfor the concurrency mechanism and several other fea-tures For example Cassandra emphasizes weak consis-tency of concurrent control of multiple editions whileHBase and HyperTable focus on strong consistencythrough locks or log records

          ndash Document Database Compared with key-value stor-age document storage can support more complex dataforms Since documents do not follow strict modesthere is no need to conduct mode migration In additionkey-value pairs can still be saved We will examine threeimportant representatives of document storage systemsie MongoDB SimpleDB and CouchDB

          ndash MongoDB MongoDB is open-source anddocument-oriented database [92] MongoDBstores documents as Binary JSON (BSON)objects [93] which is similar to object Everydocument has an ID field as the primary keyQuery in MongoDB is expressed with syn-tax similar to JSON A database driver sendsthe query as a BSON object to MongoDBThe system allows query on all documentsincluding embedded objects and arrays Toenable rapid query indexes can be createdin the queryable fields of documents Thecopy operation in MongoDB can be executedwith log files in the main nodes that supportall the high-level operations conducted in thedatabase During copying the slavers queryall the writing operations since the last syn-chronization to the master and execute opera-tions in log files in local databases MongoDBsupports horizontal expansion with automaticsharing to distribute data among thousandsof nodes by automatically balancing load andfailover

          188 Mobile Netw Appl (2014) 19171ndash209

          ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

          ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

          Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

          ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

          functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

          Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

          ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

          The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

          In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

          Mobile Netw Appl (2014) 19171ndash209 189

          DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

          ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

          All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

          ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

          The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

          Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

          The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

          Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

          ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

          ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

          51 Traditional data analysis

          5 Big data analysis

          190 Mobile Netw Appl (2014) 19171ndash209

          ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

          ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

          ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

          ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

          ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

          52 Big data analytic methods

          In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

          ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

          ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

          ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

          ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

          ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

          Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

          53 Architecture for big data analysis

          Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

          Mobile Netw Appl (2014) 19171ndash209 191

          Table 1 Comparison of MPI MapReduce and Dryad

          MPI MapReduce Dryad

          Deployment Computing node and data Computing and data storage Computing and data storage

          storage arranged separately arranged at the same node arranged at the same node

          (Data should be moved (Computing should (Computing should

          computing node) be close to data) be close to data)

          Resource management ndash Workqueue(google) Not clear

          scheduling HOD(Yahoo)

          Low level programming MPI API MapReduce API Dryad API

          High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

          Data storage The local file system GFS(google) NTFS

          NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

          Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

          the tasks

          Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

          memory access Shared-memory FIFOs

          Fault-tolerant Checkpoint Task re-execute Task re-execute

          531 Real-time vs offline analysis

          According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

          ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

          ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

          532 Analysis at different levels

          Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

          ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

          ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

          ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

          192 Mobile Netw Appl (2014) 19171ndash209

          533 Analysis with different complexity

          The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

          54 Tools for big data mining and analysis

          Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

          ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

          ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

          ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

          The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

          ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

          ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

          6 Big data applications

          In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

          Mobile Netw Appl (2014) 19171ndash209 193

          However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

          61 Application evolutions

          Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

          ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

          ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

          most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

          ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

          As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

          62 Big data analysis fields

          webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

          194 Mobile Netw Appl (2014) 19171ndash209

          621 Structured data analysis

          Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

          However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

          622 Text data analysis

          The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

          623 Web data analysis

          Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

          mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

          Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

          Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

          Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

          Mobile Netw Appl (2014) 19171ndash209 195

          624 Multimedia data analysis

          Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

          Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

          Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

          Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

          segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

          Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

          The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

          625 Network data analysis

          Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

          196 Mobile Netw Appl (2014) 19171ndash209

          and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

          The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

          Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

          Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

          Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

          is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

          626 Mobile data analysis

          By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

          With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

          Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

          Mobile Netw Appl (2014) 19171ndash209 197

          In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

          Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

          63 Key applications of big data

          631 Application of big data in enterprises

          At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

          In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

          Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

          632 Application of IoT based big data

          IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

          Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

          Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

          633 Application of online social network-oriented big data

          Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

          198 Mobile Netw Appl (2014) 19171ndash209

          information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

          ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

          ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

          is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

          The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

          In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

          Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

          Mobile Netw Appl (2014) 19171ndash209 199

          or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

          Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

          ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

          ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

          ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

          634 Applications of healthcare and medical big data

          Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

          effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

          For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

          The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

          HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

          Fig 6 The correlation between Tweets about rice price and food price inflation

          200 Mobile Netw Appl (2014) 19171ndash209

          imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

          635 Collective intelligence

          With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

          Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

          In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

          636 Smart grid

          Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

          supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

          ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

          ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

          Mobile Netw Appl (2014) 19171ndash209 201

          according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

          ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

          7 Conclusion open issues and outlook

          In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

          In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

          71 Open issues

          The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

          711 Theoretical research

          Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

          ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

          ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

          ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

          712 Technology development

          The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

          ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

          202 Mobile Netw Appl (2014) 19171ndash209

          ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

          ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

          ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

          713 Practical implications

          Although there are already many successful big data appli-cations many practical problems should be solved

          ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

          ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

          ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

          individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

          ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

          714 Data security

          In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

          ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

          ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

          Mobile Netw Appl (2014) 19171ndash209 203

          quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

          ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

          ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

          The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

          72 Outlook

          The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

          not predict the future but may take precautions for possibleevents to occur in the future

          ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

          ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

          ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

          ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

          204 Mobile Netw Appl (2014) 19171ndash209

          utilizes relational diagrams to express interpersonalrelationship

          ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

          ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

          ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

          ndash Compared with accurate data we would like toaccept numerous and complicated data

          ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

          ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

          ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

          Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

          increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

          Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

          References

          1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

          2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

          3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

          4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

          5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

          httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

          7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

          8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

          9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

          10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

          11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

          12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

          13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

          14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

          15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

          16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

          17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

          18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

          19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

          Mobile Netw Appl (2014) 19171ndash209 205

          20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

          21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

          22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

          23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

          24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

          25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

          26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

          27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

          28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

          29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

          30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

          31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

          32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

          33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

          34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

          35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

          36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

          37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

          38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

          39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

          40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

          41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

          42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

          43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

          44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

          45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

          46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

          47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

          48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

          49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

          50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

          51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

          52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

          53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

          54 Cisco data center interconnect design and deployment guide(2010)

          55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

          56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

          57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

          58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

          59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

          206 Mobile Netw Appl (2014) 19171ndash209

          60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

          61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

          62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

          63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

          64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

          65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

          66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

          67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

          68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

          69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

          70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

          71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

          72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

          73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

          74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

          75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

          76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

          77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

          78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

          79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

          80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

          81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

          82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

          83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

          84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

          85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

          86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

          87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

          88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

          89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

          90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

          Media Inc93 Crockford D (2006) The applicationjson media type for

          javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

          SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

          tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

          (2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

          97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

          98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

          99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

          Mobile Netw Appl (2014) 19171ndash209 207

          100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

          101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

          102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

          103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

          104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

          105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

          106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

          107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

          108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

          109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

          110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

          111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

          112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

          113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

          114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

          115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

          D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

          117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

          118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

          the 7th ACM international conference on computing frontiersACM pp 277ndash286

          119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

          120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

          121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

          122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

          123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

          124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

          125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

          126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

          127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

          128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

          129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

          130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

          131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

          132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

          133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

          134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

          135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

          136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

          137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

          138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

          139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

          140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

          208 Mobile Netw Appl (2014) 19171ndash209

          141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

          142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

          143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

          144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

          145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

          146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

          147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

          148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

          149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

          150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

          151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

          152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

          153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

          154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

          155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

          156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

          Mobile Netw Appl (2014) 19171ndash209 209

          • Big Data A Survey
            • Abstract
            • Background
              • Dawn of big data era
              • Definition and features of big data
              • Big data value
              • The development of big data
              • Challenges of big data
                • Related technologies
                  • Relationship between cloud computing and big data
                  • Relationship between IoT and big data
                  • Data center
                  • Relationship between hadoop and big data
                    • Big data generation and acquisition
                      • Data generation
                        • Enterprise data
                        • IoT data
                        • Bio-medical data
                        • Data generation from other fields
                          • Big data acquisition
                            • Data collection
                            • Data transportation
                            • Data pre-processing
                                • Big data storage
                                  • Storage system for massive data
                                  • Distributed storage system
                                  • Storage mechanism for big data
                                    • Database technology
                                      • Traditional data analysis
                                      • Big data analytic methods
                                      • Architecture for big data analysis
                                        • Real-time vs offline analysis
                                        • Analysis at different levels
                                        • Analysis with different complexity
                                          • Tools for big data mining and analysis
                                            • Big data applications
                                              • Key applications of big data
                                                • Application evolutions
                                                • Structured data analysis
                                                • Text data analysis
                                                • Web data analysis
                                                • Multimedia data analysis
                                                • Network data analysis
                                                • Mobile data analysis
                                                  • Key applications of big data
                                                    • Application of big data in enterprises
                                                    • Application of IoT based big data
                                                    • Application of online social network-oriented big data
                                                    • Applications of healthcare and medical big data
                                                    • Collective intelligence
                                                    • Smart grid
                                                        • Conclusion open issues and outlook
                                                          • Open issues
                                                            • Theoretical research
                                                            • Technology development
                                                            • Practical implications
                                                            • Data security
                                                              • Outlook
                                                                • Acknowledgments
                                                                • References

            started to become mainstream in big data analysisEven so there are still some problems of non-relationaldatabases in their performance and particular applica-tions We shall find a compromising solution betweenRDBMSs and non-relational databases For examplesome enterprises have utilized a mixed database archi-tecture that integrates the advantages of both types ofdatabase (eg Facebook and Taobao) More researchis needed on the in-memory database and sample databased on approximate analysis

            ndash Data confidentiality most big data service providers orowners at present could not effectively maintain andanalyze such huge datasets because of their limitedcapacity They must rely on professionals or tools toanalyze such data which increase the potential safetyrisks For example the transactional dataset generallyincludes a set of complete operating data to drive keybusiness processes Such data contains details of thelowest granularity and some sensitive information suchas credit card numbers Therefore analysis of big datamay be delivered to a third party for processing onlywhen proper preventive measures are taken to protectsuch sensitive data to ensure its safety

            ndash Energy management the energy consumption of main-frame computing systems has drawn much attentionfrom both economy and environment perspectives Withthe increase of data volume and analytical demandsthe processing storage and transmission of big datawill inevitably consume more and more electric energyTherefore system-level power consumption controland management mechanism shall be established forbig data while the expandability and accessibility areensured

            ndash Expendability and scalability the analytical system ofbig data must support present and future datasets Theanalytical algorithm must be able to process increas-ingly expanding and more complex datasets

            ndash Cooperation analysis of big data is an interdisci-plinary research which requires experts in differentfields cooperate to harvest the potential of big dataA comprehensive big data network architecture mustbe established to help scientists and engineers in var-ious fields access different kinds of data and fullyutilize their expertise so as to cooperate to complete theanalytical objectives

            2 Related technologies

            In order to gain a deep understanding of big data this sec-tion will introduce several fundamental technologies that areclosely related to big data including cloud computing IoTdata center and Hadoop

            21 Relationship between cloud computing and big data

            Cloud computing is closely related to big data The keycomponents of cloud computing are shown in Fig 3 Bigdata is the object of the computation-intensive operation andstresses the storage capacity of a cloud system The mainobjective of cloud computing is to use huge computing andstorage resources under concentrated management so asto provide big data applications with fine-grained comput-ing capacity The development of cloud computing providessolutions for the storage and processing of big data On theother hand the emergence of big data also accelerates thedevelopment of cloud computing The distributed storagetechnology based on cloud computing can effectively man-age big data the parallel computing capacity by virtue ofcloud computing can improve the efficiency of acquisitionand analyzing big data

            Even though there are many overlapped technologiesin cloud computing and big data they differ in the fol-lowing two aspects First the concepts are different to acertain extent Cloud computing transforms the IT archi-tecture while big data influences business decision-makingHowever big data depends on cloud computing as thefundamental infrastructure for smooth operation

            Second big data and cloud computing have differenttarget customers Cloud computing is a technology andproduct targeting Chief Information Officers (CIO) as anadvanced IT solution Big data is a product targeting ChiefExecutive Officers (CEO) focusing on business operationsSince the decision makers may directly feel the pressurefrom market competition they must defeat business oppo-nents in more competitive ways With the advances ofbig data and cloud computing these two technologies arecertainly and increasingly entwine with each other Cloudcomputing with functions similar to those of computers andoperating systems provides system-level resources big data

            Fig 3 Key components of cloud computing

            176 Mobile Netw Appl (2014) 19171ndash209

            operates in the upper level supported by cloud computingand provides functions similar to those of database and effi-cient data processing capacity Kissinger President of EMCindicated that the application of big data must be based oncloud computing

            The evolution of big data was driven by the rapid growthof application demands and cloud computing developedfrom virtualized technologies Therefore cloud computingnot only provides computation and processing for big databut also itself is a service mode To a certain extent theadvances of cloud computing also promote the developmentof big data both of which supplement each other

            22 Relationship between IoT and big data

            In the IoT paradigm an enormous amount of networkingsensors are embedded into various devices and machinesin the real world Such sensors deployed in different fieldsmay collect various kinds of data such as environmentaldata geographical data astronomical data and logistic dataMobile equipments transportation facilities public facil-ities and home appliances could all be data acquisitionequipments in IoT as illustrated in Fig 4

            The big data generated by IoT has different characteris-tics compared with general big data because of the differenttypes of data collected of which the most classical charac-teristics include heterogeneity variety unstructured featurenoise and high redundancy Although the current IoT datais not the dominant part of big data by 2030 the quantity of

            sensors will reach one trillion and then the IoT data will be

            the most important part of big data according to the fore-

            cast of HP A report from Intel pointed out that big data in

            IoT has three features that conform to the big data paradigm

            (i) abundant terminals generating masses of data (ii) data

            generated by IoT is usually semi-structured or unstructured

            (iii) data of IoT is useful only when it is analyzed

            At present the data processing capacity of IoT has fallen

            behind the collected data and it is extremely urgent to accel-

            erate the introduction of big data technologies to promote

            the development of IoT Many operators of IoT realize the

            importance of big data since the success of IoT is hinged

            upon the effective integration of big data and cloud com-

            puting The widespread deployment of IoT will also bring

            many cities into the big data era

            There is a compelling need to adopt big data for IoT

            applications while the development of big data is already

            legged behind It has been widely recognized that these

            two technologies are inter-dependent and should be jointly

            developed on one hand the widespread deployment of IoT

            drives the high growth of data both in quantity and cate-

            gory thus providing the opportunity for the application and

            development of big data on the other hand the application

            of big data technology to IoT also accelerates the research

            advances and business models of of IoT

            Fig 4 Illustration of data acquisition equipment in IoT

            Mobile Netw Appl (2014) 19171ndash209 177

            23 Data center

            In the big data paradigm the data center not only is a plat-form for concentrated storage of data but also undertakesmore responsibilities such as acquiring data managingdata organizing data and leveraging the data values andfunctions Data centers mainly concern ldquodatardquo other thanldquocenterrdquo It has masses of data and organizes and man-ages data according to its core objective and developmentpath which is more valuable than owning a good site andresource The emergence of big data brings about sounddevelopment opportunities and great challenges to data cen-ters Big data is an emerging paradigm which will promotethe explosive growth of the infrastructure and related soft-ware of data center The physical data center network isthe core for supporting big data but at present is the keyinfrastructure that is most urgently required [29]

            ndash Big data requires data center provide powerful back-stage support The big data paradigm has more strin-gent requirements on storage capacity and processingcapacity as well as network transmission capacityEnterprises must take the development of data centersinto consideration to improve the capacity of rapidlyand effectively processing of big data under limitedpriceperformance ratio The data center shall providethe infrastructure with a large number of nodes build ahigh-speed internal network effectively dissipate heatand effective backup data Only when a highly energy-efficient stable safe expandable and redundant datacenter is built the normal operation of big data applica-tions may be ensured

            ndash The growth of big data applications accelerates therevolution and innovation of data centers Many bigdata applications have developed their unique architec-tures and directly promote the development of storagenetwork and computing technologies related to datacenter With the continued growth of the volumes ofstructured and unstructured data and the variety ofsources of analytical data the data processing and com-puting capacities of the data center shall be greatlyenhanced In addition as the scale of data center isincreasingly expanding it is also an important issue onhow to reduce the operational cost for the developmentof data centers

            ndash Big data endows more functions to the data center Inthe big data paradigm data center shall not only con-cern with hardware facilities but also strengthen softcapacities ie the capacities of acquisition processingorganization analysis and application of big data Thedata center may help business personnel analyze theexisting data discover problems in business operationand develop solutions from big data

            24 Relationship between hadoop and big data

            Presently Hadoop is widely used in big data applications inthe industry eg spam filtering network searching click-stream analysis and social recommendation In additionconsiderable academic research is now based on HadoopSome representative cases are given below As declaredin June 2012 Yahoo runs Hadoop in 42000 servers atfour data centers to support its products and services egsearching and spam filtering etc At present the biggestHadoop cluster has 4000 nodes but the number of nodeswill be increased to 10000 with the release of Hadoop 20In the same month Facebook announced that their Hadoopcluster can process 100 PB data which grew by 05 PB perday as in November 2012 Some well-known agencies thatuse Hadoop to conduct distributed computation are listedin [30] In addition many companies provide Hadoop com-mercial execution andor support including Cloudera IBMMapR EMC and Oracle

            Among modern industrial machinery and systems sen-sors are widely deployed to collect information for environ-ment monitoring and failure forecasting etc Bahga and oth-ers in [31] proposed a framework for data organization andcloud computing infrastructure termed CloudView Cloud-View uses mixed architectures local nodes and remoteclusters based on Hadoop to analyze machine-generateddata Local nodes are used for the forecast of real-time fail-ures clusters based on Hadoop are used for complex offlineanalysis eg case-driven data analysis

            The exponential growth of the genome data and the sharpdrop of sequencing cost transform bio-science and bio-medicine to data-driven science Gunarathne et al in [32]utilized cloud computing infrastructures Amazon AWSMicrosoft Azune and data processing framework basedon MapReduce Hadoop and Microsoft DryadLINQ torun two parallel bio-medicine applications (i) assembly ofgenome segments (ii) dimension reduction in the analy-sis of chemical structure In the subsequent application the166-D datasets used include 26000000 data points Theauthors compared the performance of all the frameworks interms of efficiency cost and availability According to thestudy the authors concluded that the loose coupling will beincreasingly applied to research on electron cloud and theparallel programming technology (MapReduce) frameworkmay provide the user an interface with more convenientservices and reduce unnecessary costs

            3 Big data generation and acquisition

            We have introduced several key technologies related to bigdata ie cloud computing IoT data center and HadoopNext we will focus on the value chain of big data which

            178 Mobile Netw Appl (2014) 19171ndash209

            can be generally divided into four phases data generationdata acquisition data storage and data analysis If we takedata as a raw material data generation and data acquisitionare an exploitation process data storage is a storage processand data analysis is a production process that utilizes theraw material to create new value

            31 Data generation

            Data generation is the first step of big data Given Inter-net data as an example huge amount of data in terms ofsearching entries Internet forum posts chatting records andmicroblog messages are generated Those data are closelyrelated to peoplersquos daily life and have similar features ofhigh value and low density Such Internet data may bevalueless individually but through the exploitation of accu-mulated big data useful information such as habits andhobbies of users can be identified and it is even possible toforecast usersrsquo behaviors and emotional moods

            Moreover generated through longitudinal andor dis-tributed data sources datasets are more large-scale highlydiverse and complex Such data sources include sensorsvideos clickstreams andor all other available data sourcesAt present main sources of big data are the operationand trading information in enterprises logistic and sens-ing information in the IoT human interaction informationand position information in the Internet world and datagenerated in scientific research etc The information far sur-passes the capacities of IT architectures and infrastructuresof existing enterprises while its real time requirement alsogreatly stresses the existing computing capacity

            311 Enterprise data

            In 2013 IBM issued Analysis the Applications of Big Datato the Real World which indicates that the internal data ofenterprises are the main sources of big data The internaldata of enterprises mainly consists of online trading data andonline analysis data most of which are historically staticdata and are managed by RDBMSs in a structured man-ner In addition production data inventory data sales dataand financial data etc also constitute enterprise internaldata which aims to capture informationized and data-drivenactivities in enterprises so as to record all activities ofenterprises in the form of internal data

            Over the past decades IT and digital data have con-tributed a lot to improve the profitability of business depart-ments It is estimated that the business data volume of allcompanies in the world may double every 12 years [10]in which the business turnover through the Internet enter-prises to enterprises and enterprises to consumers per daywill reach USD 450 billion [33] The continuously increas-ing business data volume requires more effective real-time

            analysis so as to fully harvest its potential For exampleAmazon processes millions of terminal operations and morethan 500000 queries from third-party sellers per day [12]Walmart processes one million customer trades per hour andsuch trading data are imported into a database with a capac-ity of over 25PB [3] Akamai analyzes 75 million eventsper day for its target advertisements [13]

            312 IoT data

            As discussed IoT is an important source of big data Amongsmart cities constructed based on IoT big data may comefrom industry agriculture traffic transportation medicalcare public departments and families etc

            According to the processes of data acquisition and trans-mission in IoT its network architecture may be dividedinto three layers the sensing layer the network layer andthe application layer The sensing layer is responsible fordata acquisition and mainly consists of sensor networksThe network layer is responsible for information transmis-sion and processing where close transmission may rely onsensor networks and remote transmission shall depend onthe Internet Finally the application layer support specificapplications of IoT

            According to characteristics of Internet of Things thedata generated from IoT has the following features

            ndash Large-scale data in IoT masses of data acquisi-tion equipments are distributedly deployed which mayacquire simple numeric data eg location or complexmultimedia data eg surveillance video In order tomeet the demands of analysis and processing not onlythe currently acquired data but also the historical datawithin a certain time frame should be stored Thereforedata generated by IoT are characterized by large scales

            ndash Heterogeneity because of the variety data acquisitiondevices the acquired data is also different and such datafeatures heterogeneity

            ndash Strong time and space correlation in IoT every dataacquisition device are placed at a specific geographiclocation and every piece of data has time stamp Thetime and space correlation are an important propertyof data from IoT During data analysis and process-ing time and space are also important dimensions forstatistical analysis

            ndash Effective data accounts for only a small portion of thebig data a great quantity of noises may occur dur-ing the acquisition and transmission of data in IoTAmong datasets acquired by acquisition devices only asmall amount of abnormal data is valuable For exam-ple during the acquisition of traffic video the few videoframes that capture the violation of traffic regulations

            Mobile Netw Appl (2014) 19171ndash209 179

            and traffic accidents are more valuable than those onlycapturing the normal flow of traffic

            313 Bio-medical data

            As a series of high-throughput bio-measurement technolo-gies are innovatively developed in the beginning of the21st century the frontier research in the bio-medicine fieldalso enters the era of big data By constructing smartefficient and accurate analytical models and theoretical sys-tems for bio-medicine applications the essential governingmechanism behind complex biological phenomena may berevealed Not only the future development of bio-medicinecan be determined but also the leading roles can be assumedin the development of a series of important strategic indus-tries related to the national economy peoplersquos livelihoodand national security with important applications such asmedical care new drug R amp D and grain production (egtransgenic crops)

            The completion of HGP (Human Genome Project) andthe continued development of sequencing technology alsolead to widespread applications of big data in the fieldThe masses of data generated by gene sequencing gothrough specialized analysis according to different applica-tion demands to combine it with the clinical gene diag-nosis and provide valuable information for early diagnosisand personalized treatment of disease One sequencing ofhuman gene may generate 100 600GB raw data In theChina National Genebank in Shenzhen there are 13 mil-lion samples including 115 million human samples and150000 animal plant and microorganism samples By theend of 2013 10 million traceable biological samples willbe stored and by the end of 2015 this figure will reach30 million It is predictable that with the development ofbio-medicine technologies gene sequencing will becomefaster and more convenient and thus making big data ofbio-medicine continuously grow beyond all doubt

            In addition data generated from clinical medical care andmedical R amp D also rise quickly For example the Uni-versity of Pittsburgh Medical Center (UPMC) has stored2TB such data Explorys an American company providesplatforms to collocate clinical data operation and mainte-nance data and financial data At present about 13 millionpeoplersquos information have been collocated with 44 arti-cles of data at the scale of about 60TB which will reach70TB in 2013 Practice Fusion another American com-pany manages electronic medical records of about 200000patients

            Apart from such small and medium-sized enterprisesother well-known IT companies such as Google Microsoftand IBM have invested extensively in the research and com-putational analysis of methods related to high-throughputbiological big data for shares in the huge market as known

            as the ldquoNext Internetrdquo IBM forecasts in the 2013 StrategyConference that with the sharp increase of medical imagesand electronic medical records medical professionals mayutilize big data to extract useful clinical information frommasses of data to obtain a medical history and forecast treat-ment effects thus improving patient care and reduce costIt is anticipated that by 2015 the average data volume ofevery hospital will increase from 167TB to 665TB

            314 Data generation from other fields

            As scientific applications are increasing the scale ofdatasets is gradually expanding and the development ofsome disciplines greatly relies on the analysis of masses ofdata Here we examine several such applications Althoughbeing in different scientific fields the applications havesimilar and increasing demand on data analysis The firstexample is related to computational biology GenBank isa nucleotide sequence database maintained by the USNational Bio-Technology Innovation Center Data in thisdatabase may double every 10 months By August 2009Genbank has more than 250 billion bases from 150000 dif-ferent organisms [34] The second example is related toastronomy Sloan Digital Sky Survey (SDSS) the biggestsky survey project in astronomy has recorded 25TB datafrom 1998 to 2008 As the resolution of the telescope isimproved by 2004 the data volume generated per night willsurpass 20TB The last application is related to high-energyphysics In the beginning of 2008 the Atlas experiment ofLarge Hadron Collider (LHC) of European Organization forNuclear Research generates raw data at 2PBs and storesabout 10TB processed data per year

            In addition pervasive sensing and computing amongnature commercial Internet government and social envi-ronments are generating heterogeneous data with unprece-dented complexity These datasets have their unique datacharacteristics in scale time dimension and data categoryFor example mobile data were recorded with respect topositions movement approximation degrees communica-tions multimedia use of applications and audio environ-ment [108] According to the application environment andrequirements such datasets into different categories so asto select the proper and feasible solutions for big data

            32 Big data acquisition

            As the second phase of the big data system big data acqui-sition includes data collection data transmission and datapre-processing During big data acquisition once we col-lect the raw data we shall utilize an efficient transmissionmechanism to send it to a proper storage managementsystem to support different analytical applications The col-lected datasets may sometimes include much redundant or

            180 Mobile Netw Appl (2014) 19171ndash209

            useless data which unnecessarily increases storage spaceand affects the subsequent data analysis For examplehigh redundancy is very common among datasets collectedby sensors for environment monitoring Data compressiontechnology can be applied to reduce the redundancy There-fore data pre-processing operations are indispensable toensure efficient data storage and exploitation

            321 Data collection

            Data collection is to utilize special data collection tech-niques to acquire raw data from a specific data generationenvironment Four common data collection methods areshown as follows

            ndash Log files As one widely used data collection methodlog files are record files automatically generated by thedata source system so as to record activities in desig-nated file formats for subsequent analysis Log files aretypically used in nearly all digital devices For exam-ple web servers record in log files number of clicksclick rates visits and other property records of webusers [35] To capture activities of users at the web sitesweb servers mainly include the following three log fileformats public log file format (NCSA) expanded logformat (W3C) and IIS log format (Microsoft) All thethree types of log files are in the ASCII text formatDatabases other than text files may sometimes be usedto store log information to improve the query efficiencyof the massive log store [36 37] There are also someother log files based on data collection including stockindicators in financial applications and determinationof operating states in network monitoring and trafficmanagement

            ndash Sensing Sensors are common in daily life to measurephysical quantities and transform physical quantitiesinto readable digital signals for subsequent process-ing (and storage) Sensory data may be classified assound wave voice vibration automobile chemicalcurrent weather pressure temperature etc Sensedinformation is transferred to a data collection pointthrough wired or wireless networks For applicationsthat may be easily deployed and managed eg videosurveillance system [38] the wired sensor network isa convenient solution to acquire related informationSometimes the accurate position of a specific phe-nomenon is unknown and sometimes the monitoredenvironment does not have the energy or communica-tion infrastructures Then wireless communication mustbe used to enable data transmission among sensor nodesunder limited energy and communication capability Inrecent years WSNs have received considerable inter-est and have been applied to many applications such

            as environmental research [39 40] water quality mon-itoring [41] civil engineering [42 43] and wildlifehabit monitoring [44] A WSN generally consists ofa large number of geographically distributed sensornodes each being a micro device powered by batterySuch sensors are deployed at designated positions asrequired by the application to collect remote sensingdata Once the sensors are deployed the base stationwill send control information for network configura-tionmanagement or data collection to sensor nodesBased on such control information the sensory data isassembled in different sensor nodes and sent back to thebase station for further processing Interested readersare referred to [45] for more detailed discussions

            ndash Methods for acquiring network data At present net-work data acquisition is accomplished using a com-bination of web crawler word segmentation systemtask system and index system etc Web crawler isa program used by search engines for downloadingand storing web pages [46] Generally speaking webcrawler starts from the uniform resource locator (URL)of an initial web page to access other linked web pagesduring which it stores and sequences all the retrievedURLs Web crawler acquires a URL in the order ofprecedence through a URL queue and then downloadsweb pages and identifies all URLs in the downloadedweb pages and extracts new URLs to be put in thequeue This process is repeated until the web crawleris stopped Data acquisition through a web crawler iswidely applied in applications based on web pagessuch as search engines or web caching Traditional webpage extraction technologies feature multiple efficientsolutions and considerable research has been done inthis field As more advanced web page applicationsare emerging some extraction strategies are proposedin [47] to cope with rich Internet applications

            The current network data acquisition technologiesmainly include traditional Libpcap-based packet capturetechnology zero-copy packet capture technology as wellas some specialized network monitoring software such asWireshark SmartSniff and WinNetCap

            ndash Libpcap-based packet capture technology Libpcap(packet capture library) is a widely used network datapacket capture function library It is a general tool thatdoes not depend on any specific system and is mainlyused to capture data in the data link layer It featuressimplicity easy-to-use and portability but has a rel-atively low efficiency Therefore under a high-speednetwork environment considerable packet losses mayoccur when Libpcap is used

            Mobile Netw Appl (2014) 19171ndash209 181

            ndash Zero-copy packet capture technology The so-calledzero-copy (ZC) means that no copies between any inter-nal memories occur during packet receiving and send-ing at a node In sending the data packets directly startfrom the user buffer of applications pass through thenetwork interfaces and arrive at an external networkIn receiving the network interfaces directly send datapackets to the user buffer The basic idea of zero-copyis to reduce data copy times reduce system calls andreduce CPU load while ddatagrams are passed from net-work equipments to user program space The zero-copytechnology first utilizes direct memory access (DMA)technology to directly transmit network datagrams to anaddress space pre-allocated by the system kernel so asto avoid the participation of CPU In the meanwhile itmaps the internal memory of the datagrams in the sys-tem kernel to the that of the detection program or buildsa cache region in the user space and maps it to the ker-nel space Then the detection program directly accessesthe internal memory so as to reduce internal memorycopy from system kernel to user space and reduce theamount of system calls

            ndash Mobile equipments At present mobile devices aremore widely used As mobile device functions becomeincreasingly stronger they feature more complex andmultiple means of data acquisition as well as morevariety of data Mobile devices may acquire geo-graphical location information through positioning sys-tems acquire audio information through microphonesacquire pictures videos streetscapes two-dimensionalbarcodes and other multimedia information throughcameras acquire user gestures and other body languageinformation through touch screens and gravity sensorsOver the years wireless operators have improved theservice level of the mobile Internet by acquiring andanalyzing such information For example iPhone itselfis a ldquomobile spyrdquo It may collect wireless data andgeographical location information and then send suchinformation back to Apple Inc for processing of whichthe user is not aware Apart from Apple smart phoneoperating systems such as Android of Google and Win-dows Phone of Microsoft can also collect informationin the similar manner

            In addition to the aforementioned three data acquisitionmethods of main data sources there are many other datacollect methods or systems For example in scientific exper-iments many special tools can be used to collect exper-imental data such as magnetic spectrometers and radiotelescopes We may classify data collection methods fromdifferent perspectives From the perspective of data sourcesdata collection methods can be classified into two cate-gories collection methods recording through data sources

            and collection methods recording through other auxiliarytools

            322 Data transportation

            Upon the completion of raw data collection data will betransferred to a data storage infrastructure for processingand analysis As discussed in Section 23 big data is mainlystored in a data center The data layout should be adjusted toimprove computing efficiency or facilitate hardware mainte-nance In other words internal data transmission may occurin the data center Therefore data transmission consistsof two phases Inter-DCN transmissions and Intra-DCNtransmissions

            ndash Inter-DCN transmissions Inter-DCN transmissions arefrom data source to data center which is generallyachieved with the existing physical network infrastruc-ture Because of the rapid growth of traffic demandsthe physical network infrastructure in most regionsaround the world are constituted by high-volumn high-rate and cost-effective optic fiber transmission systemsOver the past 20 years advanced management equip-ment and technologies have been developed such asIP-based wavelength division multiplexing (WDM) net-work architecture to conduct smart control and man-agement of optical fiber networks [48 49] WDM isa technology that multiplexes multiple optical carriersignals with different wave lengths and couples themto the same optical fiber of the optical link In suchtechnology lasers with different wave lengths carry dif-ferent signals By far the backbone network have beendeployed with WDM optical transmission systems withsingle channel rate of 40Gbs At present 100Gbs com-mercial interface are available and 100Gbs systems (orTBs systems) will be available in the near future [50]However traditional optical transmission technologiesare limited by the bandwidth of the electronic bot-tleneck [51] Recently orthogonal frequency-divisionmultiplexing (OFDM) initially designed for wirelesssystems is regarded as one of the main candidatetechnologies for future high-speed optical transmis-sion OFDM is a multi-carrier parallel transmissiontechnology It segments a high-speed data flow to trans-form it into low-speed sub-data-flows to be transmittedover multiple orthogonal sub-carriers [52] Comparedwith fixed channel spacing of WDM OFDM allowssub-channel frequency spectrums to overlap with eachother [53] Therefore it is a flexible and efficient opticalnetworking technology

            ndash Intra-DCN Transmissions Intra-DCN transmissionsare the data communication flows within data centersIntra-DCN transmissions depend on the communication

            182 Mobile Netw Appl (2014) 19171ndash209

            mechanism within the data center (ie on physical con-nection plates chips internal memories of data serversnetwork architectures of data centers and communica-tion protocols) A data center consists of multiple inte-grated server racks interconnected with its internal con-nection networks Nowadays the internal connectionnetworks of most data centers are fat-tree two-layeror three-layer structures based on multi-commoditynetwork flows [51 54] In the two-layer topologicalstructure the racks are connected by 1Gbps top rackswitches (TOR) and then such top rack switches areconnected with 10Gbps aggregation switches in thetopological structure The three-layer topological struc-ture is a structure augmented with one layer on the topof the two-layer topological structure and such layeris constituted by 10Gbps or 100Gbps core switchesto connect aggregation switches in the topologicalstructure There are also other topological structureswhich aim to improve the data center networks [55ndash58] Because of the inadequacy of electronic packetswitches it is difficult to increase communication band-widths while keeps energy consumption is low Overthe years due to the huge success achieved by opti-cal technologies the optical interconnection among thenetworks in data centers has drawn great interest Opti-cal interconnection is a high-throughput low-delayand low-energy-consumption solution At present opti-cal technologies are only used for point-to-point linksin data centers Such optical links provide connectionfor the switches using the low-cost multi-mode fiber(MMF) with 10Gbps data rate Optical interconnec-tion (switching in the optical domain) of networks indata centers is a feasible solution which can provideTbps-level transmission bandwidth with low energyconsumption Recently many optical interconnectionplans are proposed for data center networks [59] Someplans add optical paths to upgrade the existing net-works and other plans completely replace the currentswitches [59ndash64] As a strengthening technology Zhouet al in [65] adopt wireless links in the 60GHz fre-quency band to strengthen wired links Network vir-tualization should also be considered to improve theefficiency and utilization of data center networks

            323 Data pre-processing

            Because of the wide variety of data sources the collecteddatasets vary with respect to noise redundancy and con-sistency etc and it is undoubtedly a waste to store mean-ingless data In addition some analytical methods haveserious requirements on data quality Therefore in orderto enable effective data analysis we shall pre-process data

            under many circumstances to integrate the data from differ-ent sources which can not only reduces storage expensebut also improves analysis accuracy Some relational datapre-processing techniques are discussed as follows

            ndash Integration data integration is the cornerstone of mod-ern commercial informatics which involves the com-bination of data from different sources and providesusers with a uniform view of data [66] This is a matureresearch field for traditional database Historically twomethods have been widely recognized data ware-house and data federation Data warehousing includesa process named ETL (Extract Transform and Load)Extraction involves connecting source systems select-ing collecting analyzing and processing necessarydata Transformation is the execution of a series of rulesto transform the extracted data into standard formatsLoading means importing extracted and transformeddata into the target storage infrastructure Loading isthe most complex procedure among the three whichincludes operations such as transformation copy clear-ing standardization screening and data organizationA virtual database can be built to query and aggregatedata from different data sources but such database doesnot contain data On the contrary it includes informa-tion or metadata related to actual data and its positionsSuch two ldquostorage-readingrdquo approaches do not sat-isfy the high performance requirements of data flowsor search programs and applications Compared withqueries data in such two approaches is more dynamicand must be processed during data transmission Gen-erally data integration methods are accompanied withflow processing engines and search engines [30 67]

            ndash Cleaning data cleaning is a process to identify inac-curate incomplete or unreasonable data and thenmodify or delete such data to improve data qualityGenerally data cleaning includes five complementaryprocedures [68] defining and determining error typessearching and identifying errors correcting errors doc-umenting error examples and error types and mod-ifying data entry procedures to reduce future errorsDuring cleaning data formats completeness rational-ity and restriction shall be inspected Data cleaning isof vital importance to keep the data consistency whichis widely applied in many fields such as banking insur-ance retail industry telecommunications and trafficcontrol

            In e-commerce most data is electronically col-lected which may have serious data quality prob-lems Classic data quality problems mainly come fromsoftware defects customized errors or system mis-configuration Authors in [69] discussed data cleaning

            Mobile Netw Appl (2014) 19171ndash209 183

            in e-commerce by crawlers and regularly re-copyingcustomer and account information

            In [70] the problem of cleaning RFID data wasexamined RFID is widely used in many applica-tions eg inventory management and target track-ing However the original RFID features low qualitywhich includes a lot of abnormal data limited by thephysical design and affected by environmental noisesIn [71] a probability model was developed to copewith data loss in mobile environments Khoussainovaet al in [72] proposed a system to automatically cor-rect errors of input data by defining global integrityconstraints

            Herbert et al [73] proposed a framework called BIO-AJAX to standardize biological data so as to conductfurther computation and improve search quality WithBIO-AJAX some errors and repetitions may be elim-inated and common data mining technologies can beexecuted more effectively

            ndash Redundancy elimination data redundancy refers to datarepetitions or surplus which usually occurs in manydatasets Data redundancy can increase the unneces-sary data transmission expense and cause defects onstorage systems eg waste of storage space lead-ing to data inconsistency reduction of data reliabil-ity and data damage Therefore various redundancyreduction methods have been proposed such as redun-dancy detection data filtering and data compressionSuch methods may apply to different datasets or appli-cation environments However redundancy reductionmay also bring about certain negative effects Forexample data compression and decompression causeadditional computational burden Therefore the ben-efits of redundancy reduction and the cost should becarefully balanced Data collected from different fieldswill increasingly appear in image or video formatsIt is well-known that images and videos contain con-siderable redundancy including temporal redundancyspacial redundancy statistical redundancy and sens-ing redundancy Video compression is widely usedto reduce redundancy in video data as specified inthe many video coding standards (MPEG-2 MPEG-4H263 and H264AVC) In [74] the authors inves-tigated the problem of video compression in a videosurveillance system with a video sensor network Theauthors propose a new MPEG-4 based method byinvestigating the contextual redundancy related to back-ground and foreground in a scene The low com-plexity and the low compression ratio of the pro-posed approach were demonstrated by the evaluationresults

            On generalized data transmission or storage re-peated data deletion is a special data compression

            technology which aims to eliminate repeated datacopies [75] With repeated data deletion individual datablocks or data segments will be assigned with identi-fiers (eg using a hash algorithm) and stored with theidentifiers added to the identification list As the anal-ysis of repeated data deletion continues if a new datablock has an identifier that is identical to that listedin the identification list the new data block will bedeemed as redundant and will be replaced by the cor-responding stored data block Repeated data deletioncan greatly reduce storage requirement which is par-ticularly important to a big data storage system Apartfrom the aforementioned data pre-processing methodsspecific data objects shall go through some other oper-ations such as feature extraction Such operation playsan important role in multimedia search and DNA anal-ysis [76ndash78] Usually high-dimensional feature vec-tors (or high-dimensional feature points) are used todescribe such data objects and the system stores thedimensional feature vectors for future retrieval Datatransfer is usually used to process distributed hetero-geneous data sources especially business datasets [79]As a matter of fact in consideration of various datasetsit is non-trivial or impossible to build a uniform datapre-processing procedure and technology that is appli-cable to all types of datasets on the specific featureproblem performance requirements and other factorsof the datasets should be considered so as to select aproper data pre-processing strategy

            4 Big data storage

            The explosive growth of data has more strict requirementson storage and management In this section we focus onthe storage of big data Big data storage refers to the stor-age and management of large-scale datasets while achiev-ing reliability and availability of data accessing We willreview important issues including massive storage systemsdistributed storage systems and big data storage mecha-nisms On one hand the storage infrastructure needs toprovide information storage service with reliable storagespace on the other hand it must provide a powerful accessinterface for query and analysis of a large amount ofdata

            Traditionally as auxiliary equipment of server data stor-age device is used to store manage look up and analyzedata with structured RDBMSs With the sharp growth ofdata data storage device is becoming increasingly moreimportant and many Internet companies pursue big capac-ity of storage to be competitive Therefore there is acompelling need for research on data storage

            184 Mobile Netw Appl (2014) 19171ndash209

            41 Storage system for massive data

            Various storage systems emerge to meet the demands ofmassive data Existing massive storage technologies can beclassified as Direct Attached Storage (DAS) and networkstorage while network storage can be further classifiedinto Network Attached Storage (NAS) and Storage AreaNetwork (SAN)

            In DAS various harddisks are directly connected withservers and data management is server-centric such thatstorage devices are peripheral equipments each of whichtakes a certain amount of IO resource and is managed by anindividual application software For this reason DAS is onlysuitable to interconnect servers with a small scale How-ever due to its low scalability DAS will exhibit undesirableefficiency when the storage capacity is increased ie theupgradeability and expandability are greatly limited ThusDAS is mainly used in personal computers and small-sizedservers

            Network storage is to utilize network to provide userswith a union interface for data access and sharing Networkstorage equipment includes special data exchange equip-ments disk array tap library and other storage media aswell as special storage software It is characterized withstrong expandability

            NAS is actually an auxillary storage equipment of a net-work It is directly connected to a network through a hub orswitch through TCPIP protocols In NAS data is transmit-ted in the form of files Compared to DAS the IO burdenat a NAS server is reduced extensively since the serveraccesses a storage device indirectly through a network

            While NAS is network-oriented SAN is especiallydesigned for data storage with a scalable and bandwidthintensive network eg a high-speed network with opticalfiber connections In SAN data storage management is rel-atively independent within a storage local area networkwhere multipath based data switching among any internalnodes is utilized to achieve a maximum degree of datasharing and data management

            From the organization of a data storage system DASNAS and SAN can all be divided into three parts (i) discarray it is the foundation of a storage system and the fun-damental guarantee for data storage (ii) connection andnetwork sub-systems which provide connection among oneor more disc arrays and servers (iii) storage managementsoftware which handles data sharing disaster recovery andother storage management tasks of multiple servers

            42 Distributed storage system

            The first challenge brought about by big data is how todevelop a large scale distributed storage system for effi-ciently data processing and analysis To use a distributed

            system to store massive data the following factors shouldbe taken into consideration

            ndash Consistency a distributed storage system requires mul-tiple servers to cooperatively store data As there aremore servers the probability of server failures will belarger Usually data is divided into multiple pieces tobe stored at different servers to ensure availability incase of server failure However server failures and par-allel storage may cause inconsistency among differentcopies of the same data Consistency refers to assuringthat multiple copies of the same data are identical

            ndash Availability a distributed storage system operates inmultiple sets of servers As more servers are usedserver failures are inevitable It would be desirable ifthe entire system is not seriously affected to satisfy cus-tomerrsquos requests in terms of reading and writing Thisproperty is called availability

            ndash Partition Tolerance multiple servers in a distributedstorage system are connected by a network The net-work could have linknode failures or temporary con-gestion The distributed system should have a certainlevel of tolerance to problems caused by network fail-ures It would be desirable that the distributed storagestill works well when the network is partitioned

            Eric Brewer proposed a CAP [80 81] theory in 2000which indicated that a distributed system could not simulta-neously meet the requirements on consistency availabilityand partition tolerance at most two of the three require-ments can be satisfied simultaneously Seth Gilbert andNancy Lynch from MIT proved the correctness of CAP the-ory in 2002 Since consistency availability and partitiontolerance could not be achieved simultaneously we can havea CA system by ignoring partition tolerance a CP system byignoring availability and an AP system that ignores consis-tency according to different design goals The three systemsare discussed in the following

            CA systems do not have partition tolerance ie theycould not handle network failures Therefore CA sys-tems are generally deemed as storage systems with a sin-gle server such as the traditional small-scale relationaldatabases Such systems feature single copy of data suchthat consistency is easily ensured Availability is guaranteedby the excellent design of relational databases Howeversince CA systems could not handle network failures theycould not be expanded to use many servers Therefore mostlarge-scale storage systems are CP systems and AP systems

            Compared with CA systems CP systems ensure parti-tion tolerance Therefore CP systems can be expanded tobecome distributed systems CP systems generally main-tain several copies of the same data in order to ensure a

            Mobile Netw Appl (2014) 19171ndash209 185

            level of fault tolerance CP systems also ensure data consis-tency ie multiple copies of the same data are guaranteedto be completely identical However CP could not ensuresound availability because of the high cost for consistencyassurance Therefore CP systems are useful for the scenar-ios with moderate load but stringent requirements on dataaccuracy (eg trading data) BigTable and Hbase are twopopular CP systems

            AP systems also ensure partition tolerance However APsystems are different from CP systems in that AP systemsalso ensure availability However AP systems only ensureeventual consistency rather than strong consistency in theprevious two systems Therefore AP systems only applyto the scenarios with frequent requests but not very highrequirements on accuracy For example in online SocialNetworking Services (SNS) systems there are many con-current visits to the data but a certain amount of data errorsare tolerable Furthermore because AP systems ensureeventual consistency accurate data can still be obtained aftera certain amount of delay Therefore AP systems may alsobe used under the circumstances with no stringent realtimerequirements Dynamo and Cassandra are two popular APsystems

            43 Storage mechanism for big data

            Considerable research on big data promotes the develop-ment of storage mechanisms for big data Existing stor-age mechanisms of big data may be classified into threebottom-up levels (i) file systems (ii) databases and (iii)programming models

            File systems are the foundation of the applications atupper levels Googlersquos GFS is an expandable distributedfile system to support large-scale distributed data-intensiveapplications [25] GFS uses cheap commodity servers toachieve fault-tolerance and provides customers with high-performance services GFS supports large-scale file appli-cations with more frequent reading than writing HoweverGFS also has some limitations such as a single point offailure and poor performances for small files Such limita-tions have been overcome by Colossus [82] the successorof GFS

            In addition other companies and researchers also havetheir solutions to meet the different demands for storageof big data For example HDFS and Kosmosfs are deriva-tives of open source codes of GFS Microsoft developedCosmos [83] to support its search and advertisement busi-ness Facebook utilizes Haystack [84] to store the largeamount of small-sized photos Taobao also developed TFSand FastDFS In conclusion distributed file systems havebeen relatively mature after years of development and busi-ness operation Therefore we will focus on the other twolevels in the rest of this section

            431 Database technology

            The database technology has been evolving for more than30 years Various database systems are developed to handledatasets at different scales and support various applica-tions Traditional relational databases cannot meet the chal-lenges on categories and scales brought about by big dataNoSQL databases (ie non traditional relational databases)are becoming more popular for big data storage NoSQLdatabases feature flexible modes support for simple andeasy copy simple API eventual consistency and supportof large volume data NoSQL databases are becomingthe core technology for of big data We will examinethe following three main NoSQL databases in this sec-tion Key-value databases column-oriented databases anddocument-oriented databases each based on certain datamodels

            ndash Key-value Databases Key-value Databases are con-stituted by a simple data model and data is storedcorresponding to key-values Every key is unique andcustomers may input queried values according to thekeys Such databases feature a simple structure andthe modern key-value databases are characterized withhigh expandability and shorter query response time thanthose of relational databases Over the past few yearsmany key-value databases have appeared as motivatedby Amazonrsquos Dynamo system [85] We will introduceDynamo and several other representative key-valuedatabases

            ndash Dynamo Dynamo is a highly available andexpandable distributed key-value data stor-age system It is used to store and managethe status of some core services which canbe realized with key access in the Amazone-Commerce Platform The public mode ofrelational databases may generate invalid dataand limit data scale and availability whileDynamo can resolve these problems with asimple key-object interface which is consti-tuted by simple reading and writing opera-tion Dynamo achieves elasticity and avail-ability through the data partition data copyand object edition mechanisms Dynamo par-tition plan relies on Consistent Hashing [86]which has a main advantage that node pass-ing only affects directly adjacent nodes anddo not affect other nodes to divide the loadfor multiple main storage machines Dynamocopies data to N sets of servers in which Nis a configurable parameter in order to achieve

            186 Mobile Netw Appl (2014) 19171ndash209

            high availability and durability Dynamo sys-tem also provides eventual consistency so asto conduct asynchronous update on all copies

            ndash Voldemort Voldemort is also a key-value stor-age system which was initially developed forand is still used by LinkedIn Key words andvalues in Voldemort are composite objectsconstituted by tables and images Volde-mort interface includes three simple opera-tions reading writing and deletion all ofwhich are confirmed by key words Volde-mort provides asynchronous updating con-current control of multiple editions but doesnot ensure data consistency However Volde-mort supports optimistic locking for consistentmulti-record updating When conflict happensbetween the updating and any other opera-tions the updating operation will quit Thedata copy mechanism of Voldmort is the sameas that of Dynamo Voldemort not only storesdata in RAM but allows data be inserted intoa storage engine Especially Voldemort sup-ports two storage engines including BerkeleyDB and Random Access Files

            The key-value database emerged a few years agoDeeply influenced by Amazon Dynamo DB other key-value storage systems include Redis Tokyo Canbinetand Tokyo Tyrant Memcached and Memcache DBRiak and Scalaris all of which provide expandabilityby distributing key words into nodes Voldemort RiakTokyo Cabinet and Memecached can utilize attachedstorage devices to store data in RAM or disks Otherstorage systems store data at RAM and provide diskbackup or rely on copy and recovery to avoid backup

            ndash Column-oriented Database The column-orienteddatabases store and process data according to columnsother than rows Both columns and rows are segmentedin multiple nodes to realize expandability The column-oriented databases are mainly inspired by GooglersquosBigTable In this Section we first discuss BigTable andthen introduce several derivative tools

            ndash BigTable BigTable is a distributed structureddata storage system which is designed to pro-cess the large-scale (PB class) data amongthousands commercial servers [87] The basicdata structure of Bigtable is a multi-dimensionsequenced mapping with sparse distributedand persistent storage Indexes of mappingare row key column key and timestampsand every value in mapping is an unana-lyzed byte array Each row key in BigTable

            is a 64KB character string By lexicograph-ical order rows are stored and continuallysegmented into Tablets (ie units of distribu-tion) for load balance Thus reading a shortrow of data can be highly effective since itonly involves communication with a small por-tion of machines The columns are groupedaccording to the prefixes of keys and thusforming column families These column fami-lies are the basic units for access control Thetimestamps are 64-bit integers to distinguishdifferent editions of cell values Clients mayflexibly determine the number of cell editionsstored These editions are sequenced in thedescending order of timestamps so the latestedition will always be read

            The BigTable API features the creation anddeletion of Tablets and column families as wellas modification of metadata of clusters tablesand column families Client applications mayinsert or delete values of BigTable query val-ues from columns or browse sub-datasets in atable Bigtable also supports some other char-acteristics such as transaction processing in asingle row Users may utilize such features toconduct more complex data processing

            Every procedure executed by BigTableincludes three main components Masterserver Tablet server and client libraryBigtable only allows one set of Master serverbe distributed to be responsible for distribut-ing tablets for Tablet server detecting addedor removed Tablet servers and conductingload balance In addition it can also mod-ify BigTable schema eg creating tables andcolumn families and collecting garbage savedin GFS as well as deleted or disabled filesand using them in specific BigTable instancesEvery tablet server manages a Tablet set andis responsible for the reading and writing of aloaded Tablet When Tablets are too big theywill be segmented by the server The applica-tion client library is used to communicate withBigTable instances

            BigTable is based on many fundamentalcomponents of Google including GFS [25]cluster management system SSTable file for-mat and Chubby [88] GFS is use to store dataand log files The cluster management systemis responsible for task scheduling resourcessharing processing of machine failures andmonitoring of machine statuses SSTable fileformat is used to store BigTable data internally

            Mobile Netw Appl (2014) 19171ndash209 187

            and it provides mapping between persistentsequenced and unchangeable keys and valuesas any byte strings BigTable utilizes Chubbyfor the following tasks in server 1) ensurethere is at most one active Master copy atany time 2) store the bootstrap location ofBigTable data 3) look up Tablet server 4) con-duct error recovery in case of Table server fail-ures 5) store BigTable schema information 6)store the access control table

            ndash Cassandra Cassandra is a distributed storagesystem to manage the huge amount of struc-tured data distributed among multiple commer-cial servers [89] The system was developedby Facebook and became an open source toolin 2008 It adopts the ideas and concepts ofboth Amazon Dynamo and Google BigTableespecially integrating the distributed systemtechnology of Dynamo with the BigTable datamodel Tables in Cassandra are in the form ofdistributed four-dimensional structured map-ping where the four dimensions including rowcolumn column family and super column Arow is distinguished by a string-key with arbi-trary length No matter the amount of columnsto be read or written the operation on rowsis an auto Columns may constitute clusterswhich is called column families and are sim-ilar to the data model of Bigtable Cassandraprovides two kinds of column families columnfamilies and super columns The super columnincludes arbitrary number of columns relatedto same names A column family includescolumns and super columns which may becontinuously inserted to the column familyduring runtime The partition and copy mecha-nisms of Cassandra are very similar to those ofDynamo so as to achieve consistency

            ndash Derivative tools of BigTable since theBigTable code cannot be obtained throughthe open source license some open sourceprojects compete to implement the BigTableconcept to develop similar systems such asHBase and Hypertable

            HBase is a BigTable cloned version pro-grammed with Java and is a part of Hadoop ofApachersquos MapReduce framework [90] HBasereplaces GFS with HDFS It writes updatedcontents into RAM and regularly writes theminto files on disks The row operations areatomic operations equipped with row-levellocking and transaction processing which is

            optional for large scale Partition and distribu-tion are transparently operated and have spacefor client hash or fixed key

            HyperTable was developed similar toBigTable to obtain a set of high-performanceexpandable distributed storage and process-ing systems for structured and unstructureddata [91] HyperTable relies on distributedfile systems eg HDFS and distributed lockmanager Data representation processing andpartition mechanism are similar to that inBigTable HyperTable has its own query lan-guage called HyperTable query language(HQL) and allows users to create modify andquery underlying tables

            Since the column-oriented storage databases mainlyemulate BigTable their designs are all similar exceptfor the concurrency mechanism and several other fea-tures For example Cassandra emphasizes weak consis-tency of concurrent control of multiple editions whileHBase and HyperTable focus on strong consistencythrough locks or log records

            ndash Document Database Compared with key-value stor-age document storage can support more complex dataforms Since documents do not follow strict modesthere is no need to conduct mode migration In additionkey-value pairs can still be saved We will examine threeimportant representatives of document storage systemsie MongoDB SimpleDB and CouchDB

            ndash MongoDB MongoDB is open-source anddocument-oriented database [92] MongoDBstores documents as Binary JSON (BSON)objects [93] which is similar to object Everydocument has an ID field as the primary keyQuery in MongoDB is expressed with syn-tax similar to JSON A database driver sendsthe query as a BSON object to MongoDBThe system allows query on all documentsincluding embedded objects and arrays Toenable rapid query indexes can be createdin the queryable fields of documents Thecopy operation in MongoDB can be executedwith log files in the main nodes that supportall the high-level operations conducted in thedatabase During copying the slavers queryall the writing operations since the last syn-chronization to the master and execute opera-tions in log files in local databases MongoDBsupports horizontal expansion with automaticsharing to distribute data among thousandsof nodes by automatically balancing load andfailover

            188 Mobile Netw Appl (2014) 19171ndash209

            ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

            ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

            Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

            ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

            functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

            Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

            ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

            The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

            In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

            Mobile Netw Appl (2014) 19171ndash209 189

            DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

            ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

            All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

            ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

            The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

            Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

            The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

            Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

            ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

            ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

            51 Traditional data analysis

            5 Big data analysis

            190 Mobile Netw Appl (2014) 19171ndash209

            ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

            ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

            ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

            ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

            ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

            52 Big data analytic methods

            In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

            ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

            ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

            ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

            ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

            ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

            Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

            53 Architecture for big data analysis

            Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

            Mobile Netw Appl (2014) 19171ndash209 191

            Table 1 Comparison of MPI MapReduce and Dryad

            MPI MapReduce Dryad

            Deployment Computing node and data Computing and data storage Computing and data storage

            storage arranged separately arranged at the same node arranged at the same node

            (Data should be moved (Computing should (Computing should

            computing node) be close to data) be close to data)

            Resource management ndash Workqueue(google) Not clear

            scheduling HOD(Yahoo)

            Low level programming MPI API MapReduce API Dryad API

            High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

            Data storage The local file system GFS(google) NTFS

            NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

            Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

            the tasks

            Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

            memory access Shared-memory FIFOs

            Fault-tolerant Checkpoint Task re-execute Task re-execute

            531 Real-time vs offline analysis

            According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

            ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

            ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

            532 Analysis at different levels

            Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

            ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

            ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

            ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

            192 Mobile Netw Appl (2014) 19171ndash209

            533 Analysis with different complexity

            The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

            54 Tools for big data mining and analysis

            Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

            ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

            ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

            ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

            The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

            ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

            ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

            6 Big data applications

            In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

            Mobile Netw Appl (2014) 19171ndash209 193

            However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

            61 Application evolutions

            Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

            ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

            ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

            most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

            ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

            As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

            62 Big data analysis fields

            webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

            194 Mobile Netw Appl (2014) 19171ndash209

            621 Structured data analysis

            Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

            However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

            622 Text data analysis

            The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

            623 Web data analysis

            Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

            mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

            Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

            Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

            Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

            Mobile Netw Appl (2014) 19171ndash209 195

            624 Multimedia data analysis

            Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

            Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

            Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

            Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

            segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

            Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

            The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

            625 Network data analysis

            Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

            196 Mobile Netw Appl (2014) 19171ndash209

            and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

            The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

            Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

            Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

            Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

            is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

            626 Mobile data analysis

            By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

            With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

            Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

            Mobile Netw Appl (2014) 19171ndash209 197

            In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

            Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

            63 Key applications of big data

            631 Application of big data in enterprises

            At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

            In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

            Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

            632 Application of IoT based big data

            IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

            Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

            Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

            633 Application of online social network-oriented big data

            Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

            198 Mobile Netw Appl (2014) 19171ndash209

            information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

            ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

            ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

            is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

            The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

            In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

            Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

            Mobile Netw Appl (2014) 19171ndash209 199

            or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

            Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

            ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

            ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

            ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

            634 Applications of healthcare and medical big data

            Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

            effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

            For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

            The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

            HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

            Fig 6 The correlation between Tweets about rice price and food price inflation

            200 Mobile Netw Appl (2014) 19171ndash209

            imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

            635 Collective intelligence

            With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

            Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

            In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

            636 Smart grid

            Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

            supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

            ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

            ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

            Mobile Netw Appl (2014) 19171ndash209 201

            according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

            ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

            7 Conclusion open issues and outlook

            In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

            In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

            71 Open issues

            The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

            711 Theoretical research

            Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

            ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

            ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

            ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

            712 Technology development

            The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

            ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

            202 Mobile Netw Appl (2014) 19171ndash209

            ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

            ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

            ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

            713 Practical implications

            Although there are already many successful big data appli-cations many practical problems should be solved

            ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

            ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

            ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

            individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

            ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

            714 Data security

            In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

            ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

            ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

            Mobile Netw Appl (2014) 19171ndash209 203

            quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

            ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

            ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

            The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

            72 Outlook

            The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

            not predict the future but may take precautions for possibleevents to occur in the future

            ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

            ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

            ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

            ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

            204 Mobile Netw Appl (2014) 19171ndash209

            utilizes relational diagrams to express interpersonalrelationship

            ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

            ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

            ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

            ndash Compared with accurate data we would like toaccept numerous and complicated data

            ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

            ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

            ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

            Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

            increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

            Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

            References

            1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

            2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

            3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

            4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

            5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

            httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

            7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

            8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

            9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

            10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

            11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

            12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

            13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

            14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

            15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

            16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

            17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

            18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

            19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

            Mobile Netw Appl (2014) 19171ndash209 205

            20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

            21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

            22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

            23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

            24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

            25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

            26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

            27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

            28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

            29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

            30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

            31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

            32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

            33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

            34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

            35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

            36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

            37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

            38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

            39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

            40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

            41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

            42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

            43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

            44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

            45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

            46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

            47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

            48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

            49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

            50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

            51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

            52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

            53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

            54 Cisco data center interconnect design and deployment guide(2010)

            55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

            56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

            57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

            58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

            59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

            206 Mobile Netw Appl (2014) 19171ndash209

            60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

            61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

            62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

            63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

            64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

            65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

            66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

            67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

            68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

            69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

            70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

            71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

            72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

            73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

            74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

            75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

            76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

            77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

            78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

            79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

            80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

            81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

            82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

            83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

            84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

            85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

            86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

            87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

            88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

            89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

            90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

            Media Inc93 Crockford D (2006) The applicationjson media type for

            javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

            SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

            tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

            (2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

            97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

            98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

            99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

            Mobile Netw Appl (2014) 19171ndash209 207

            100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

            101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

            102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

            103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

            104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

            105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

            106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

            107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

            108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

            109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

            110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

            111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

            112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

            113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

            114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

            115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

            D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

            117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

            118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

            the 7th ACM international conference on computing frontiersACM pp 277ndash286

            119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

            120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

            121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

            122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

            123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

            124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

            125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

            126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

            127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

            128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

            129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

            130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

            131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

            132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

            133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

            134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

            135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

            136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

            137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

            138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

            139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

            140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

            208 Mobile Netw Appl (2014) 19171ndash209

            141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

            142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

            143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

            144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

            145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

            146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

            147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

            148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

            149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

            150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

            151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

            152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

            153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

            154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

            155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

            156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

            Mobile Netw Appl (2014) 19171ndash209 209

            • Big Data A Survey
              • Abstract
              • Background
                • Dawn of big data era
                • Definition and features of big data
                • Big data value
                • The development of big data
                • Challenges of big data
                  • Related technologies
                    • Relationship between cloud computing and big data
                    • Relationship between IoT and big data
                    • Data center
                    • Relationship between hadoop and big data
                      • Big data generation and acquisition
                        • Data generation
                          • Enterprise data
                          • IoT data
                          • Bio-medical data
                          • Data generation from other fields
                            • Big data acquisition
                              • Data collection
                              • Data transportation
                              • Data pre-processing
                                  • Big data storage
                                    • Storage system for massive data
                                    • Distributed storage system
                                    • Storage mechanism for big data
                                      • Database technology
                                        • Traditional data analysis
                                        • Big data analytic methods
                                        • Architecture for big data analysis
                                          • Real-time vs offline analysis
                                          • Analysis at different levels
                                          • Analysis with different complexity
                                            • Tools for big data mining and analysis
                                              • Big data applications
                                                • Key applications of big data
                                                  • Application evolutions
                                                  • Structured data analysis
                                                  • Text data analysis
                                                  • Web data analysis
                                                  • Multimedia data analysis
                                                  • Network data analysis
                                                  • Mobile data analysis
                                                    • Key applications of big data
                                                      • Application of big data in enterprises
                                                      • Application of IoT based big data
                                                      • Application of online social network-oriented big data
                                                      • Applications of healthcare and medical big data
                                                      • Collective intelligence
                                                      • Smart grid
                                                          • Conclusion open issues and outlook
                                                            • Open issues
                                                              • Theoretical research
                                                              • Technology development
                                                              • Practical implications
                                                              • Data security
                                                                • Outlook
                                                                  • Acknowledgments
                                                                  • References

              operates in the upper level supported by cloud computingand provides functions similar to those of database and effi-cient data processing capacity Kissinger President of EMCindicated that the application of big data must be based oncloud computing

              The evolution of big data was driven by the rapid growthof application demands and cloud computing developedfrom virtualized technologies Therefore cloud computingnot only provides computation and processing for big databut also itself is a service mode To a certain extent theadvances of cloud computing also promote the developmentof big data both of which supplement each other

              22 Relationship between IoT and big data

              In the IoT paradigm an enormous amount of networkingsensors are embedded into various devices and machinesin the real world Such sensors deployed in different fieldsmay collect various kinds of data such as environmentaldata geographical data astronomical data and logistic dataMobile equipments transportation facilities public facil-ities and home appliances could all be data acquisitionequipments in IoT as illustrated in Fig 4

              The big data generated by IoT has different characteris-tics compared with general big data because of the differenttypes of data collected of which the most classical charac-teristics include heterogeneity variety unstructured featurenoise and high redundancy Although the current IoT datais not the dominant part of big data by 2030 the quantity of

              sensors will reach one trillion and then the IoT data will be

              the most important part of big data according to the fore-

              cast of HP A report from Intel pointed out that big data in

              IoT has three features that conform to the big data paradigm

              (i) abundant terminals generating masses of data (ii) data

              generated by IoT is usually semi-structured or unstructured

              (iii) data of IoT is useful only when it is analyzed

              At present the data processing capacity of IoT has fallen

              behind the collected data and it is extremely urgent to accel-

              erate the introduction of big data technologies to promote

              the development of IoT Many operators of IoT realize the

              importance of big data since the success of IoT is hinged

              upon the effective integration of big data and cloud com-

              puting The widespread deployment of IoT will also bring

              many cities into the big data era

              There is a compelling need to adopt big data for IoT

              applications while the development of big data is already

              legged behind It has been widely recognized that these

              two technologies are inter-dependent and should be jointly

              developed on one hand the widespread deployment of IoT

              drives the high growth of data both in quantity and cate-

              gory thus providing the opportunity for the application and

              development of big data on the other hand the application

              of big data technology to IoT also accelerates the research

              advances and business models of of IoT

              Fig 4 Illustration of data acquisition equipment in IoT

              Mobile Netw Appl (2014) 19171ndash209 177

              23 Data center

              In the big data paradigm the data center not only is a plat-form for concentrated storage of data but also undertakesmore responsibilities such as acquiring data managingdata organizing data and leveraging the data values andfunctions Data centers mainly concern ldquodatardquo other thanldquocenterrdquo It has masses of data and organizes and man-ages data according to its core objective and developmentpath which is more valuable than owning a good site andresource The emergence of big data brings about sounddevelopment opportunities and great challenges to data cen-ters Big data is an emerging paradigm which will promotethe explosive growth of the infrastructure and related soft-ware of data center The physical data center network isthe core for supporting big data but at present is the keyinfrastructure that is most urgently required [29]

              ndash Big data requires data center provide powerful back-stage support The big data paradigm has more strin-gent requirements on storage capacity and processingcapacity as well as network transmission capacityEnterprises must take the development of data centersinto consideration to improve the capacity of rapidlyand effectively processing of big data under limitedpriceperformance ratio The data center shall providethe infrastructure with a large number of nodes build ahigh-speed internal network effectively dissipate heatand effective backup data Only when a highly energy-efficient stable safe expandable and redundant datacenter is built the normal operation of big data applica-tions may be ensured

              ndash The growth of big data applications accelerates therevolution and innovation of data centers Many bigdata applications have developed their unique architec-tures and directly promote the development of storagenetwork and computing technologies related to datacenter With the continued growth of the volumes ofstructured and unstructured data and the variety ofsources of analytical data the data processing and com-puting capacities of the data center shall be greatlyenhanced In addition as the scale of data center isincreasingly expanding it is also an important issue onhow to reduce the operational cost for the developmentof data centers

              ndash Big data endows more functions to the data center Inthe big data paradigm data center shall not only con-cern with hardware facilities but also strengthen softcapacities ie the capacities of acquisition processingorganization analysis and application of big data Thedata center may help business personnel analyze theexisting data discover problems in business operationand develop solutions from big data

              24 Relationship between hadoop and big data

              Presently Hadoop is widely used in big data applications inthe industry eg spam filtering network searching click-stream analysis and social recommendation In additionconsiderable academic research is now based on HadoopSome representative cases are given below As declaredin June 2012 Yahoo runs Hadoop in 42000 servers atfour data centers to support its products and services egsearching and spam filtering etc At present the biggestHadoop cluster has 4000 nodes but the number of nodeswill be increased to 10000 with the release of Hadoop 20In the same month Facebook announced that their Hadoopcluster can process 100 PB data which grew by 05 PB perday as in November 2012 Some well-known agencies thatuse Hadoop to conduct distributed computation are listedin [30] In addition many companies provide Hadoop com-mercial execution andor support including Cloudera IBMMapR EMC and Oracle

              Among modern industrial machinery and systems sen-sors are widely deployed to collect information for environ-ment monitoring and failure forecasting etc Bahga and oth-ers in [31] proposed a framework for data organization andcloud computing infrastructure termed CloudView Cloud-View uses mixed architectures local nodes and remoteclusters based on Hadoop to analyze machine-generateddata Local nodes are used for the forecast of real-time fail-ures clusters based on Hadoop are used for complex offlineanalysis eg case-driven data analysis

              The exponential growth of the genome data and the sharpdrop of sequencing cost transform bio-science and bio-medicine to data-driven science Gunarathne et al in [32]utilized cloud computing infrastructures Amazon AWSMicrosoft Azune and data processing framework basedon MapReduce Hadoop and Microsoft DryadLINQ torun two parallel bio-medicine applications (i) assembly ofgenome segments (ii) dimension reduction in the analy-sis of chemical structure In the subsequent application the166-D datasets used include 26000000 data points Theauthors compared the performance of all the frameworks interms of efficiency cost and availability According to thestudy the authors concluded that the loose coupling will beincreasingly applied to research on electron cloud and theparallel programming technology (MapReduce) frameworkmay provide the user an interface with more convenientservices and reduce unnecessary costs

              3 Big data generation and acquisition

              We have introduced several key technologies related to bigdata ie cloud computing IoT data center and HadoopNext we will focus on the value chain of big data which

              178 Mobile Netw Appl (2014) 19171ndash209

              can be generally divided into four phases data generationdata acquisition data storage and data analysis If we takedata as a raw material data generation and data acquisitionare an exploitation process data storage is a storage processand data analysis is a production process that utilizes theraw material to create new value

              31 Data generation

              Data generation is the first step of big data Given Inter-net data as an example huge amount of data in terms ofsearching entries Internet forum posts chatting records andmicroblog messages are generated Those data are closelyrelated to peoplersquos daily life and have similar features ofhigh value and low density Such Internet data may bevalueless individually but through the exploitation of accu-mulated big data useful information such as habits andhobbies of users can be identified and it is even possible toforecast usersrsquo behaviors and emotional moods

              Moreover generated through longitudinal andor dis-tributed data sources datasets are more large-scale highlydiverse and complex Such data sources include sensorsvideos clickstreams andor all other available data sourcesAt present main sources of big data are the operationand trading information in enterprises logistic and sens-ing information in the IoT human interaction informationand position information in the Internet world and datagenerated in scientific research etc The information far sur-passes the capacities of IT architectures and infrastructuresof existing enterprises while its real time requirement alsogreatly stresses the existing computing capacity

              311 Enterprise data

              In 2013 IBM issued Analysis the Applications of Big Datato the Real World which indicates that the internal data ofenterprises are the main sources of big data The internaldata of enterprises mainly consists of online trading data andonline analysis data most of which are historically staticdata and are managed by RDBMSs in a structured man-ner In addition production data inventory data sales dataand financial data etc also constitute enterprise internaldata which aims to capture informationized and data-drivenactivities in enterprises so as to record all activities ofenterprises in the form of internal data

              Over the past decades IT and digital data have con-tributed a lot to improve the profitability of business depart-ments It is estimated that the business data volume of allcompanies in the world may double every 12 years [10]in which the business turnover through the Internet enter-prises to enterprises and enterprises to consumers per daywill reach USD 450 billion [33] The continuously increas-ing business data volume requires more effective real-time

              analysis so as to fully harvest its potential For exampleAmazon processes millions of terminal operations and morethan 500000 queries from third-party sellers per day [12]Walmart processes one million customer trades per hour andsuch trading data are imported into a database with a capac-ity of over 25PB [3] Akamai analyzes 75 million eventsper day for its target advertisements [13]

              312 IoT data

              As discussed IoT is an important source of big data Amongsmart cities constructed based on IoT big data may comefrom industry agriculture traffic transportation medicalcare public departments and families etc

              According to the processes of data acquisition and trans-mission in IoT its network architecture may be dividedinto three layers the sensing layer the network layer andthe application layer The sensing layer is responsible fordata acquisition and mainly consists of sensor networksThe network layer is responsible for information transmis-sion and processing where close transmission may rely onsensor networks and remote transmission shall depend onthe Internet Finally the application layer support specificapplications of IoT

              According to characteristics of Internet of Things thedata generated from IoT has the following features

              ndash Large-scale data in IoT masses of data acquisi-tion equipments are distributedly deployed which mayacquire simple numeric data eg location or complexmultimedia data eg surveillance video In order tomeet the demands of analysis and processing not onlythe currently acquired data but also the historical datawithin a certain time frame should be stored Thereforedata generated by IoT are characterized by large scales

              ndash Heterogeneity because of the variety data acquisitiondevices the acquired data is also different and such datafeatures heterogeneity

              ndash Strong time and space correlation in IoT every dataacquisition device are placed at a specific geographiclocation and every piece of data has time stamp Thetime and space correlation are an important propertyof data from IoT During data analysis and process-ing time and space are also important dimensions forstatistical analysis

              ndash Effective data accounts for only a small portion of thebig data a great quantity of noises may occur dur-ing the acquisition and transmission of data in IoTAmong datasets acquired by acquisition devices only asmall amount of abnormal data is valuable For exam-ple during the acquisition of traffic video the few videoframes that capture the violation of traffic regulations

              Mobile Netw Appl (2014) 19171ndash209 179

              and traffic accidents are more valuable than those onlycapturing the normal flow of traffic

              313 Bio-medical data

              As a series of high-throughput bio-measurement technolo-gies are innovatively developed in the beginning of the21st century the frontier research in the bio-medicine fieldalso enters the era of big data By constructing smartefficient and accurate analytical models and theoretical sys-tems for bio-medicine applications the essential governingmechanism behind complex biological phenomena may berevealed Not only the future development of bio-medicinecan be determined but also the leading roles can be assumedin the development of a series of important strategic indus-tries related to the national economy peoplersquos livelihoodand national security with important applications such asmedical care new drug R amp D and grain production (egtransgenic crops)

              The completion of HGP (Human Genome Project) andthe continued development of sequencing technology alsolead to widespread applications of big data in the fieldThe masses of data generated by gene sequencing gothrough specialized analysis according to different applica-tion demands to combine it with the clinical gene diag-nosis and provide valuable information for early diagnosisand personalized treatment of disease One sequencing ofhuman gene may generate 100 600GB raw data In theChina National Genebank in Shenzhen there are 13 mil-lion samples including 115 million human samples and150000 animal plant and microorganism samples By theend of 2013 10 million traceable biological samples willbe stored and by the end of 2015 this figure will reach30 million It is predictable that with the development ofbio-medicine technologies gene sequencing will becomefaster and more convenient and thus making big data ofbio-medicine continuously grow beyond all doubt

              In addition data generated from clinical medical care andmedical R amp D also rise quickly For example the Uni-versity of Pittsburgh Medical Center (UPMC) has stored2TB such data Explorys an American company providesplatforms to collocate clinical data operation and mainte-nance data and financial data At present about 13 millionpeoplersquos information have been collocated with 44 arti-cles of data at the scale of about 60TB which will reach70TB in 2013 Practice Fusion another American com-pany manages electronic medical records of about 200000patients

              Apart from such small and medium-sized enterprisesother well-known IT companies such as Google Microsoftand IBM have invested extensively in the research and com-putational analysis of methods related to high-throughputbiological big data for shares in the huge market as known

              as the ldquoNext Internetrdquo IBM forecasts in the 2013 StrategyConference that with the sharp increase of medical imagesand electronic medical records medical professionals mayutilize big data to extract useful clinical information frommasses of data to obtain a medical history and forecast treat-ment effects thus improving patient care and reduce costIt is anticipated that by 2015 the average data volume ofevery hospital will increase from 167TB to 665TB

              314 Data generation from other fields

              As scientific applications are increasing the scale ofdatasets is gradually expanding and the development ofsome disciplines greatly relies on the analysis of masses ofdata Here we examine several such applications Althoughbeing in different scientific fields the applications havesimilar and increasing demand on data analysis The firstexample is related to computational biology GenBank isa nucleotide sequence database maintained by the USNational Bio-Technology Innovation Center Data in thisdatabase may double every 10 months By August 2009Genbank has more than 250 billion bases from 150000 dif-ferent organisms [34] The second example is related toastronomy Sloan Digital Sky Survey (SDSS) the biggestsky survey project in astronomy has recorded 25TB datafrom 1998 to 2008 As the resolution of the telescope isimproved by 2004 the data volume generated per night willsurpass 20TB The last application is related to high-energyphysics In the beginning of 2008 the Atlas experiment ofLarge Hadron Collider (LHC) of European Organization forNuclear Research generates raw data at 2PBs and storesabout 10TB processed data per year

              In addition pervasive sensing and computing amongnature commercial Internet government and social envi-ronments are generating heterogeneous data with unprece-dented complexity These datasets have their unique datacharacteristics in scale time dimension and data categoryFor example mobile data were recorded with respect topositions movement approximation degrees communica-tions multimedia use of applications and audio environ-ment [108] According to the application environment andrequirements such datasets into different categories so asto select the proper and feasible solutions for big data

              32 Big data acquisition

              As the second phase of the big data system big data acqui-sition includes data collection data transmission and datapre-processing During big data acquisition once we col-lect the raw data we shall utilize an efficient transmissionmechanism to send it to a proper storage managementsystem to support different analytical applications The col-lected datasets may sometimes include much redundant or

              180 Mobile Netw Appl (2014) 19171ndash209

              useless data which unnecessarily increases storage spaceand affects the subsequent data analysis For examplehigh redundancy is very common among datasets collectedby sensors for environment monitoring Data compressiontechnology can be applied to reduce the redundancy There-fore data pre-processing operations are indispensable toensure efficient data storage and exploitation

              321 Data collection

              Data collection is to utilize special data collection tech-niques to acquire raw data from a specific data generationenvironment Four common data collection methods areshown as follows

              ndash Log files As one widely used data collection methodlog files are record files automatically generated by thedata source system so as to record activities in desig-nated file formats for subsequent analysis Log files aretypically used in nearly all digital devices For exam-ple web servers record in log files number of clicksclick rates visits and other property records of webusers [35] To capture activities of users at the web sitesweb servers mainly include the following three log fileformats public log file format (NCSA) expanded logformat (W3C) and IIS log format (Microsoft) All thethree types of log files are in the ASCII text formatDatabases other than text files may sometimes be usedto store log information to improve the query efficiencyof the massive log store [36 37] There are also someother log files based on data collection including stockindicators in financial applications and determinationof operating states in network monitoring and trafficmanagement

              ndash Sensing Sensors are common in daily life to measurephysical quantities and transform physical quantitiesinto readable digital signals for subsequent process-ing (and storage) Sensory data may be classified assound wave voice vibration automobile chemicalcurrent weather pressure temperature etc Sensedinformation is transferred to a data collection pointthrough wired or wireless networks For applicationsthat may be easily deployed and managed eg videosurveillance system [38] the wired sensor network isa convenient solution to acquire related informationSometimes the accurate position of a specific phe-nomenon is unknown and sometimes the monitoredenvironment does not have the energy or communica-tion infrastructures Then wireless communication mustbe used to enable data transmission among sensor nodesunder limited energy and communication capability Inrecent years WSNs have received considerable inter-est and have been applied to many applications such

              as environmental research [39 40] water quality mon-itoring [41] civil engineering [42 43] and wildlifehabit monitoring [44] A WSN generally consists ofa large number of geographically distributed sensornodes each being a micro device powered by batterySuch sensors are deployed at designated positions asrequired by the application to collect remote sensingdata Once the sensors are deployed the base stationwill send control information for network configura-tionmanagement or data collection to sensor nodesBased on such control information the sensory data isassembled in different sensor nodes and sent back to thebase station for further processing Interested readersare referred to [45] for more detailed discussions

              ndash Methods for acquiring network data At present net-work data acquisition is accomplished using a com-bination of web crawler word segmentation systemtask system and index system etc Web crawler isa program used by search engines for downloadingand storing web pages [46] Generally speaking webcrawler starts from the uniform resource locator (URL)of an initial web page to access other linked web pagesduring which it stores and sequences all the retrievedURLs Web crawler acquires a URL in the order ofprecedence through a URL queue and then downloadsweb pages and identifies all URLs in the downloadedweb pages and extracts new URLs to be put in thequeue This process is repeated until the web crawleris stopped Data acquisition through a web crawler iswidely applied in applications based on web pagessuch as search engines or web caching Traditional webpage extraction technologies feature multiple efficientsolutions and considerable research has been done inthis field As more advanced web page applicationsare emerging some extraction strategies are proposedin [47] to cope with rich Internet applications

              The current network data acquisition technologiesmainly include traditional Libpcap-based packet capturetechnology zero-copy packet capture technology as wellas some specialized network monitoring software such asWireshark SmartSniff and WinNetCap

              ndash Libpcap-based packet capture technology Libpcap(packet capture library) is a widely used network datapacket capture function library It is a general tool thatdoes not depend on any specific system and is mainlyused to capture data in the data link layer It featuressimplicity easy-to-use and portability but has a rel-atively low efficiency Therefore under a high-speednetwork environment considerable packet losses mayoccur when Libpcap is used

              Mobile Netw Appl (2014) 19171ndash209 181

              ndash Zero-copy packet capture technology The so-calledzero-copy (ZC) means that no copies between any inter-nal memories occur during packet receiving and send-ing at a node In sending the data packets directly startfrom the user buffer of applications pass through thenetwork interfaces and arrive at an external networkIn receiving the network interfaces directly send datapackets to the user buffer The basic idea of zero-copyis to reduce data copy times reduce system calls andreduce CPU load while ddatagrams are passed from net-work equipments to user program space The zero-copytechnology first utilizes direct memory access (DMA)technology to directly transmit network datagrams to anaddress space pre-allocated by the system kernel so asto avoid the participation of CPU In the meanwhile itmaps the internal memory of the datagrams in the sys-tem kernel to the that of the detection program or buildsa cache region in the user space and maps it to the ker-nel space Then the detection program directly accessesthe internal memory so as to reduce internal memorycopy from system kernel to user space and reduce theamount of system calls

              ndash Mobile equipments At present mobile devices aremore widely used As mobile device functions becomeincreasingly stronger they feature more complex andmultiple means of data acquisition as well as morevariety of data Mobile devices may acquire geo-graphical location information through positioning sys-tems acquire audio information through microphonesacquire pictures videos streetscapes two-dimensionalbarcodes and other multimedia information throughcameras acquire user gestures and other body languageinformation through touch screens and gravity sensorsOver the years wireless operators have improved theservice level of the mobile Internet by acquiring andanalyzing such information For example iPhone itselfis a ldquomobile spyrdquo It may collect wireless data andgeographical location information and then send suchinformation back to Apple Inc for processing of whichthe user is not aware Apart from Apple smart phoneoperating systems such as Android of Google and Win-dows Phone of Microsoft can also collect informationin the similar manner

              In addition to the aforementioned three data acquisitionmethods of main data sources there are many other datacollect methods or systems For example in scientific exper-iments many special tools can be used to collect exper-imental data such as magnetic spectrometers and radiotelescopes We may classify data collection methods fromdifferent perspectives From the perspective of data sourcesdata collection methods can be classified into two cate-gories collection methods recording through data sources

              and collection methods recording through other auxiliarytools

              322 Data transportation

              Upon the completion of raw data collection data will betransferred to a data storage infrastructure for processingand analysis As discussed in Section 23 big data is mainlystored in a data center The data layout should be adjusted toimprove computing efficiency or facilitate hardware mainte-nance In other words internal data transmission may occurin the data center Therefore data transmission consistsof two phases Inter-DCN transmissions and Intra-DCNtransmissions

              ndash Inter-DCN transmissions Inter-DCN transmissions arefrom data source to data center which is generallyachieved with the existing physical network infrastruc-ture Because of the rapid growth of traffic demandsthe physical network infrastructure in most regionsaround the world are constituted by high-volumn high-rate and cost-effective optic fiber transmission systemsOver the past 20 years advanced management equip-ment and technologies have been developed such asIP-based wavelength division multiplexing (WDM) net-work architecture to conduct smart control and man-agement of optical fiber networks [48 49] WDM isa technology that multiplexes multiple optical carriersignals with different wave lengths and couples themto the same optical fiber of the optical link In suchtechnology lasers with different wave lengths carry dif-ferent signals By far the backbone network have beendeployed with WDM optical transmission systems withsingle channel rate of 40Gbs At present 100Gbs com-mercial interface are available and 100Gbs systems (orTBs systems) will be available in the near future [50]However traditional optical transmission technologiesare limited by the bandwidth of the electronic bot-tleneck [51] Recently orthogonal frequency-divisionmultiplexing (OFDM) initially designed for wirelesssystems is regarded as one of the main candidatetechnologies for future high-speed optical transmis-sion OFDM is a multi-carrier parallel transmissiontechnology It segments a high-speed data flow to trans-form it into low-speed sub-data-flows to be transmittedover multiple orthogonal sub-carriers [52] Comparedwith fixed channel spacing of WDM OFDM allowssub-channel frequency spectrums to overlap with eachother [53] Therefore it is a flexible and efficient opticalnetworking technology

              ndash Intra-DCN Transmissions Intra-DCN transmissionsare the data communication flows within data centersIntra-DCN transmissions depend on the communication

              182 Mobile Netw Appl (2014) 19171ndash209

              mechanism within the data center (ie on physical con-nection plates chips internal memories of data serversnetwork architectures of data centers and communica-tion protocols) A data center consists of multiple inte-grated server racks interconnected with its internal con-nection networks Nowadays the internal connectionnetworks of most data centers are fat-tree two-layeror three-layer structures based on multi-commoditynetwork flows [51 54] In the two-layer topologicalstructure the racks are connected by 1Gbps top rackswitches (TOR) and then such top rack switches areconnected with 10Gbps aggregation switches in thetopological structure The three-layer topological struc-ture is a structure augmented with one layer on the topof the two-layer topological structure and such layeris constituted by 10Gbps or 100Gbps core switchesto connect aggregation switches in the topologicalstructure There are also other topological structureswhich aim to improve the data center networks [55ndash58] Because of the inadequacy of electronic packetswitches it is difficult to increase communication band-widths while keeps energy consumption is low Overthe years due to the huge success achieved by opti-cal technologies the optical interconnection among thenetworks in data centers has drawn great interest Opti-cal interconnection is a high-throughput low-delayand low-energy-consumption solution At present opti-cal technologies are only used for point-to-point linksin data centers Such optical links provide connectionfor the switches using the low-cost multi-mode fiber(MMF) with 10Gbps data rate Optical interconnec-tion (switching in the optical domain) of networks indata centers is a feasible solution which can provideTbps-level transmission bandwidth with low energyconsumption Recently many optical interconnectionplans are proposed for data center networks [59] Someplans add optical paths to upgrade the existing net-works and other plans completely replace the currentswitches [59ndash64] As a strengthening technology Zhouet al in [65] adopt wireless links in the 60GHz fre-quency band to strengthen wired links Network vir-tualization should also be considered to improve theefficiency and utilization of data center networks

              323 Data pre-processing

              Because of the wide variety of data sources the collecteddatasets vary with respect to noise redundancy and con-sistency etc and it is undoubtedly a waste to store mean-ingless data In addition some analytical methods haveserious requirements on data quality Therefore in orderto enable effective data analysis we shall pre-process data

              under many circumstances to integrate the data from differ-ent sources which can not only reduces storage expensebut also improves analysis accuracy Some relational datapre-processing techniques are discussed as follows

              ndash Integration data integration is the cornerstone of mod-ern commercial informatics which involves the com-bination of data from different sources and providesusers with a uniform view of data [66] This is a matureresearch field for traditional database Historically twomethods have been widely recognized data ware-house and data federation Data warehousing includesa process named ETL (Extract Transform and Load)Extraction involves connecting source systems select-ing collecting analyzing and processing necessarydata Transformation is the execution of a series of rulesto transform the extracted data into standard formatsLoading means importing extracted and transformeddata into the target storage infrastructure Loading isthe most complex procedure among the three whichincludes operations such as transformation copy clear-ing standardization screening and data organizationA virtual database can be built to query and aggregatedata from different data sources but such database doesnot contain data On the contrary it includes informa-tion or metadata related to actual data and its positionsSuch two ldquostorage-readingrdquo approaches do not sat-isfy the high performance requirements of data flowsor search programs and applications Compared withqueries data in such two approaches is more dynamicand must be processed during data transmission Gen-erally data integration methods are accompanied withflow processing engines and search engines [30 67]

              ndash Cleaning data cleaning is a process to identify inac-curate incomplete or unreasonable data and thenmodify or delete such data to improve data qualityGenerally data cleaning includes five complementaryprocedures [68] defining and determining error typessearching and identifying errors correcting errors doc-umenting error examples and error types and mod-ifying data entry procedures to reduce future errorsDuring cleaning data formats completeness rational-ity and restriction shall be inspected Data cleaning isof vital importance to keep the data consistency whichis widely applied in many fields such as banking insur-ance retail industry telecommunications and trafficcontrol

              In e-commerce most data is electronically col-lected which may have serious data quality prob-lems Classic data quality problems mainly come fromsoftware defects customized errors or system mis-configuration Authors in [69] discussed data cleaning

              Mobile Netw Appl (2014) 19171ndash209 183

              in e-commerce by crawlers and regularly re-copyingcustomer and account information

              In [70] the problem of cleaning RFID data wasexamined RFID is widely used in many applica-tions eg inventory management and target track-ing However the original RFID features low qualitywhich includes a lot of abnormal data limited by thephysical design and affected by environmental noisesIn [71] a probability model was developed to copewith data loss in mobile environments Khoussainovaet al in [72] proposed a system to automatically cor-rect errors of input data by defining global integrityconstraints

              Herbert et al [73] proposed a framework called BIO-AJAX to standardize biological data so as to conductfurther computation and improve search quality WithBIO-AJAX some errors and repetitions may be elim-inated and common data mining technologies can beexecuted more effectively

              ndash Redundancy elimination data redundancy refers to datarepetitions or surplus which usually occurs in manydatasets Data redundancy can increase the unneces-sary data transmission expense and cause defects onstorage systems eg waste of storage space lead-ing to data inconsistency reduction of data reliabil-ity and data damage Therefore various redundancyreduction methods have been proposed such as redun-dancy detection data filtering and data compressionSuch methods may apply to different datasets or appli-cation environments However redundancy reductionmay also bring about certain negative effects Forexample data compression and decompression causeadditional computational burden Therefore the ben-efits of redundancy reduction and the cost should becarefully balanced Data collected from different fieldswill increasingly appear in image or video formatsIt is well-known that images and videos contain con-siderable redundancy including temporal redundancyspacial redundancy statistical redundancy and sens-ing redundancy Video compression is widely usedto reduce redundancy in video data as specified inthe many video coding standards (MPEG-2 MPEG-4H263 and H264AVC) In [74] the authors inves-tigated the problem of video compression in a videosurveillance system with a video sensor network Theauthors propose a new MPEG-4 based method byinvestigating the contextual redundancy related to back-ground and foreground in a scene The low com-plexity and the low compression ratio of the pro-posed approach were demonstrated by the evaluationresults

              On generalized data transmission or storage re-peated data deletion is a special data compression

              technology which aims to eliminate repeated datacopies [75] With repeated data deletion individual datablocks or data segments will be assigned with identi-fiers (eg using a hash algorithm) and stored with theidentifiers added to the identification list As the anal-ysis of repeated data deletion continues if a new datablock has an identifier that is identical to that listedin the identification list the new data block will bedeemed as redundant and will be replaced by the cor-responding stored data block Repeated data deletioncan greatly reduce storage requirement which is par-ticularly important to a big data storage system Apartfrom the aforementioned data pre-processing methodsspecific data objects shall go through some other oper-ations such as feature extraction Such operation playsan important role in multimedia search and DNA anal-ysis [76ndash78] Usually high-dimensional feature vec-tors (or high-dimensional feature points) are used todescribe such data objects and the system stores thedimensional feature vectors for future retrieval Datatransfer is usually used to process distributed hetero-geneous data sources especially business datasets [79]As a matter of fact in consideration of various datasetsit is non-trivial or impossible to build a uniform datapre-processing procedure and technology that is appli-cable to all types of datasets on the specific featureproblem performance requirements and other factorsof the datasets should be considered so as to select aproper data pre-processing strategy

              4 Big data storage

              The explosive growth of data has more strict requirementson storage and management In this section we focus onthe storage of big data Big data storage refers to the stor-age and management of large-scale datasets while achiev-ing reliability and availability of data accessing We willreview important issues including massive storage systemsdistributed storage systems and big data storage mecha-nisms On one hand the storage infrastructure needs toprovide information storage service with reliable storagespace on the other hand it must provide a powerful accessinterface for query and analysis of a large amount ofdata

              Traditionally as auxiliary equipment of server data stor-age device is used to store manage look up and analyzedata with structured RDBMSs With the sharp growth ofdata data storage device is becoming increasingly moreimportant and many Internet companies pursue big capac-ity of storage to be competitive Therefore there is acompelling need for research on data storage

              184 Mobile Netw Appl (2014) 19171ndash209

              41 Storage system for massive data

              Various storage systems emerge to meet the demands ofmassive data Existing massive storage technologies can beclassified as Direct Attached Storage (DAS) and networkstorage while network storage can be further classifiedinto Network Attached Storage (NAS) and Storage AreaNetwork (SAN)

              In DAS various harddisks are directly connected withservers and data management is server-centric such thatstorage devices are peripheral equipments each of whichtakes a certain amount of IO resource and is managed by anindividual application software For this reason DAS is onlysuitable to interconnect servers with a small scale How-ever due to its low scalability DAS will exhibit undesirableefficiency when the storage capacity is increased ie theupgradeability and expandability are greatly limited ThusDAS is mainly used in personal computers and small-sizedservers

              Network storage is to utilize network to provide userswith a union interface for data access and sharing Networkstorage equipment includes special data exchange equip-ments disk array tap library and other storage media aswell as special storage software It is characterized withstrong expandability

              NAS is actually an auxillary storage equipment of a net-work It is directly connected to a network through a hub orswitch through TCPIP protocols In NAS data is transmit-ted in the form of files Compared to DAS the IO burdenat a NAS server is reduced extensively since the serveraccesses a storage device indirectly through a network

              While NAS is network-oriented SAN is especiallydesigned for data storage with a scalable and bandwidthintensive network eg a high-speed network with opticalfiber connections In SAN data storage management is rel-atively independent within a storage local area networkwhere multipath based data switching among any internalnodes is utilized to achieve a maximum degree of datasharing and data management

              From the organization of a data storage system DASNAS and SAN can all be divided into three parts (i) discarray it is the foundation of a storage system and the fun-damental guarantee for data storage (ii) connection andnetwork sub-systems which provide connection among oneor more disc arrays and servers (iii) storage managementsoftware which handles data sharing disaster recovery andother storage management tasks of multiple servers

              42 Distributed storage system

              The first challenge brought about by big data is how todevelop a large scale distributed storage system for effi-ciently data processing and analysis To use a distributed

              system to store massive data the following factors shouldbe taken into consideration

              ndash Consistency a distributed storage system requires mul-tiple servers to cooperatively store data As there aremore servers the probability of server failures will belarger Usually data is divided into multiple pieces tobe stored at different servers to ensure availability incase of server failure However server failures and par-allel storage may cause inconsistency among differentcopies of the same data Consistency refers to assuringthat multiple copies of the same data are identical

              ndash Availability a distributed storage system operates inmultiple sets of servers As more servers are usedserver failures are inevitable It would be desirable ifthe entire system is not seriously affected to satisfy cus-tomerrsquos requests in terms of reading and writing Thisproperty is called availability

              ndash Partition Tolerance multiple servers in a distributedstorage system are connected by a network The net-work could have linknode failures or temporary con-gestion The distributed system should have a certainlevel of tolerance to problems caused by network fail-ures It would be desirable that the distributed storagestill works well when the network is partitioned

              Eric Brewer proposed a CAP [80 81] theory in 2000which indicated that a distributed system could not simulta-neously meet the requirements on consistency availabilityand partition tolerance at most two of the three require-ments can be satisfied simultaneously Seth Gilbert andNancy Lynch from MIT proved the correctness of CAP the-ory in 2002 Since consistency availability and partitiontolerance could not be achieved simultaneously we can havea CA system by ignoring partition tolerance a CP system byignoring availability and an AP system that ignores consis-tency according to different design goals The three systemsare discussed in the following

              CA systems do not have partition tolerance ie theycould not handle network failures Therefore CA sys-tems are generally deemed as storage systems with a sin-gle server such as the traditional small-scale relationaldatabases Such systems feature single copy of data suchthat consistency is easily ensured Availability is guaranteedby the excellent design of relational databases Howeversince CA systems could not handle network failures theycould not be expanded to use many servers Therefore mostlarge-scale storage systems are CP systems and AP systems

              Compared with CA systems CP systems ensure parti-tion tolerance Therefore CP systems can be expanded tobecome distributed systems CP systems generally main-tain several copies of the same data in order to ensure a

              Mobile Netw Appl (2014) 19171ndash209 185

              level of fault tolerance CP systems also ensure data consis-tency ie multiple copies of the same data are guaranteedto be completely identical However CP could not ensuresound availability because of the high cost for consistencyassurance Therefore CP systems are useful for the scenar-ios with moderate load but stringent requirements on dataaccuracy (eg trading data) BigTable and Hbase are twopopular CP systems

              AP systems also ensure partition tolerance However APsystems are different from CP systems in that AP systemsalso ensure availability However AP systems only ensureeventual consistency rather than strong consistency in theprevious two systems Therefore AP systems only applyto the scenarios with frequent requests but not very highrequirements on accuracy For example in online SocialNetworking Services (SNS) systems there are many con-current visits to the data but a certain amount of data errorsare tolerable Furthermore because AP systems ensureeventual consistency accurate data can still be obtained aftera certain amount of delay Therefore AP systems may alsobe used under the circumstances with no stringent realtimerequirements Dynamo and Cassandra are two popular APsystems

              43 Storage mechanism for big data

              Considerable research on big data promotes the develop-ment of storage mechanisms for big data Existing stor-age mechanisms of big data may be classified into threebottom-up levels (i) file systems (ii) databases and (iii)programming models

              File systems are the foundation of the applications atupper levels Googlersquos GFS is an expandable distributedfile system to support large-scale distributed data-intensiveapplications [25] GFS uses cheap commodity servers toachieve fault-tolerance and provides customers with high-performance services GFS supports large-scale file appli-cations with more frequent reading than writing HoweverGFS also has some limitations such as a single point offailure and poor performances for small files Such limita-tions have been overcome by Colossus [82] the successorof GFS

              In addition other companies and researchers also havetheir solutions to meet the different demands for storageof big data For example HDFS and Kosmosfs are deriva-tives of open source codes of GFS Microsoft developedCosmos [83] to support its search and advertisement busi-ness Facebook utilizes Haystack [84] to store the largeamount of small-sized photos Taobao also developed TFSand FastDFS In conclusion distributed file systems havebeen relatively mature after years of development and busi-ness operation Therefore we will focus on the other twolevels in the rest of this section

              431 Database technology

              The database technology has been evolving for more than30 years Various database systems are developed to handledatasets at different scales and support various applica-tions Traditional relational databases cannot meet the chal-lenges on categories and scales brought about by big dataNoSQL databases (ie non traditional relational databases)are becoming more popular for big data storage NoSQLdatabases feature flexible modes support for simple andeasy copy simple API eventual consistency and supportof large volume data NoSQL databases are becomingthe core technology for of big data We will examinethe following three main NoSQL databases in this sec-tion Key-value databases column-oriented databases anddocument-oriented databases each based on certain datamodels

              ndash Key-value Databases Key-value Databases are con-stituted by a simple data model and data is storedcorresponding to key-values Every key is unique andcustomers may input queried values according to thekeys Such databases feature a simple structure andthe modern key-value databases are characterized withhigh expandability and shorter query response time thanthose of relational databases Over the past few yearsmany key-value databases have appeared as motivatedby Amazonrsquos Dynamo system [85] We will introduceDynamo and several other representative key-valuedatabases

              ndash Dynamo Dynamo is a highly available andexpandable distributed key-value data stor-age system It is used to store and managethe status of some core services which canbe realized with key access in the Amazone-Commerce Platform The public mode ofrelational databases may generate invalid dataand limit data scale and availability whileDynamo can resolve these problems with asimple key-object interface which is consti-tuted by simple reading and writing opera-tion Dynamo achieves elasticity and avail-ability through the data partition data copyand object edition mechanisms Dynamo par-tition plan relies on Consistent Hashing [86]which has a main advantage that node pass-ing only affects directly adjacent nodes anddo not affect other nodes to divide the loadfor multiple main storage machines Dynamocopies data to N sets of servers in which Nis a configurable parameter in order to achieve

              186 Mobile Netw Appl (2014) 19171ndash209

              high availability and durability Dynamo sys-tem also provides eventual consistency so asto conduct asynchronous update on all copies

              ndash Voldemort Voldemort is also a key-value stor-age system which was initially developed forand is still used by LinkedIn Key words andvalues in Voldemort are composite objectsconstituted by tables and images Volde-mort interface includes three simple opera-tions reading writing and deletion all ofwhich are confirmed by key words Volde-mort provides asynchronous updating con-current control of multiple editions but doesnot ensure data consistency However Volde-mort supports optimistic locking for consistentmulti-record updating When conflict happensbetween the updating and any other opera-tions the updating operation will quit Thedata copy mechanism of Voldmort is the sameas that of Dynamo Voldemort not only storesdata in RAM but allows data be inserted intoa storage engine Especially Voldemort sup-ports two storage engines including BerkeleyDB and Random Access Files

              The key-value database emerged a few years agoDeeply influenced by Amazon Dynamo DB other key-value storage systems include Redis Tokyo Canbinetand Tokyo Tyrant Memcached and Memcache DBRiak and Scalaris all of which provide expandabilityby distributing key words into nodes Voldemort RiakTokyo Cabinet and Memecached can utilize attachedstorage devices to store data in RAM or disks Otherstorage systems store data at RAM and provide diskbackup or rely on copy and recovery to avoid backup

              ndash Column-oriented Database The column-orienteddatabases store and process data according to columnsother than rows Both columns and rows are segmentedin multiple nodes to realize expandability The column-oriented databases are mainly inspired by GooglersquosBigTable In this Section we first discuss BigTable andthen introduce several derivative tools

              ndash BigTable BigTable is a distributed structureddata storage system which is designed to pro-cess the large-scale (PB class) data amongthousands commercial servers [87] The basicdata structure of Bigtable is a multi-dimensionsequenced mapping with sparse distributedand persistent storage Indexes of mappingare row key column key and timestampsand every value in mapping is an unana-lyzed byte array Each row key in BigTable

              is a 64KB character string By lexicograph-ical order rows are stored and continuallysegmented into Tablets (ie units of distribu-tion) for load balance Thus reading a shortrow of data can be highly effective since itonly involves communication with a small por-tion of machines The columns are groupedaccording to the prefixes of keys and thusforming column families These column fami-lies are the basic units for access control Thetimestamps are 64-bit integers to distinguishdifferent editions of cell values Clients mayflexibly determine the number of cell editionsstored These editions are sequenced in thedescending order of timestamps so the latestedition will always be read

              The BigTable API features the creation anddeletion of Tablets and column families as wellas modification of metadata of clusters tablesand column families Client applications mayinsert or delete values of BigTable query val-ues from columns or browse sub-datasets in atable Bigtable also supports some other char-acteristics such as transaction processing in asingle row Users may utilize such features toconduct more complex data processing

              Every procedure executed by BigTableincludes three main components Masterserver Tablet server and client libraryBigtable only allows one set of Master serverbe distributed to be responsible for distribut-ing tablets for Tablet server detecting addedor removed Tablet servers and conductingload balance In addition it can also mod-ify BigTable schema eg creating tables andcolumn families and collecting garbage savedin GFS as well as deleted or disabled filesand using them in specific BigTable instancesEvery tablet server manages a Tablet set andis responsible for the reading and writing of aloaded Tablet When Tablets are too big theywill be segmented by the server The applica-tion client library is used to communicate withBigTable instances

              BigTable is based on many fundamentalcomponents of Google including GFS [25]cluster management system SSTable file for-mat and Chubby [88] GFS is use to store dataand log files The cluster management systemis responsible for task scheduling resourcessharing processing of machine failures andmonitoring of machine statuses SSTable fileformat is used to store BigTable data internally

              Mobile Netw Appl (2014) 19171ndash209 187

              and it provides mapping between persistentsequenced and unchangeable keys and valuesas any byte strings BigTable utilizes Chubbyfor the following tasks in server 1) ensurethere is at most one active Master copy atany time 2) store the bootstrap location ofBigTable data 3) look up Tablet server 4) con-duct error recovery in case of Table server fail-ures 5) store BigTable schema information 6)store the access control table

              ndash Cassandra Cassandra is a distributed storagesystem to manage the huge amount of struc-tured data distributed among multiple commer-cial servers [89] The system was developedby Facebook and became an open source toolin 2008 It adopts the ideas and concepts ofboth Amazon Dynamo and Google BigTableespecially integrating the distributed systemtechnology of Dynamo with the BigTable datamodel Tables in Cassandra are in the form ofdistributed four-dimensional structured map-ping where the four dimensions including rowcolumn column family and super column Arow is distinguished by a string-key with arbi-trary length No matter the amount of columnsto be read or written the operation on rowsis an auto Columns may constitute clusterswhich is called column families and are sim-ilar to the data model of Bigtable Cassandraprovides two kinds of column families columnfamilies and super columns The super columnincludes arbitrary number of columns relatedto same names A column family includescolumns and super columns which may becontinuously inserted to the column familyduring runtime The partition and copy mecha-nisms of Cassandra are very similar to those ofDynamo so as to achieve consistency

              ndash Derivative tools of BigTable since theBigTable code cannot be obtained throughthe open source license some open sourceprojects compete to implement the BigTableconcept to develop similar systems such asHBase and Hypertable

              HBase is a BigTable cloned version pro-grammed with Java and is a part of Hadoop ofApachersquos MapReduce framework [90] HBasereplaces GFS with HDFS It writes updatedcontents into RAM and regularly writes theminto files on disks The row operations areatomic operations equipped with row-levellocking and transaction processing which is

              optional for large scale Partition and distribu-tion are transparently operated and have spacefor client hash or fixed key

              HyperTable was developed similar toBigTable to obtain a set of high-performanceexpandable distributed storage and process-ing systems for structured and unstructureddata [91] HyperTable relies on distributedfile systems eg HDFS and distributed lockmanager Data representation processing andpartition mechanism are similar to that inBigTable HyperTable has its own query lan-guage called HyperTable query language(HQL) and allows users to create modify andquery underlying tables

              Since the column-oriented storage databases mainlyemulate BigTable their designs are all similar exceptfor the concurrency mechanism and several other fea-tures For example Cassandra emphasizes weak consis-tency of concurrent control of multiple editions whileHBase and HyperTable focus on strong consistencythrough locks or log records

              ndash Document Database Compared with key-value stor-age document storage can support more complex dataforms Since documents do not follow strict modesthere is no need to conduct mode migration In additionkey-value pairs can still be saved We will examine threeimportant representatives of document storage systemsie MongoDB SimpleDB and CouchDB

              ndash MongoDB MongoDB is open-source anddocument-oriented database [92] MongoDBstores documents as Binary JSON (BSON)objects [93] which is similar to object Everydocument has an ID field as the primary keyQuery in MongoDB is expressed with syn-tax similar to JSON A database driver sendsthe query as a BSON object to MongoDBThe system allows query on all documentsincluding embedded objects and arrays Toenable rapid query indexes can be createdin the queryable fields of documents Thecopy operation in MongoDB can be executedwith log files in the main nodes that supportall the high-level operations conducted in thedatabase During copying the slavers queryall the writing operations since the last syn-chronization to the master and execute opera-tions in log files in local databases MongoDBsupports horizontal expansion with automaticsharing to distribute data among thousandsof nodes by automatically balancing load andfailover

              188 Mobile Netw Appl (2014) 19171ndash209

              ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

              ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

              Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

              ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

              functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

              Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

              ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

              The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

              In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

              Mobile Netw Appl (2014) 19171ndash209 189

              DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

              ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

              All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

              ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

              The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

              Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

              The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

              Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

              ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

              ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

              51 Traditional data analysis

              5 Big data analysis

              190 Mobile Netw Appl (2014) 19171ndash209

              ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

              ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

              ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

              ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

              ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

              52 Big data analytic methods

              In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

              ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

              ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

              ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

              ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

              ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

              Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

              53 Architecture for big data analysis

              Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

              Mobile Netw Appl (2014) 19171ndash209 191

              Table 1 Comparison of MPI MapReduce and Dryad

              MPI MapReduce Dryad

              Deployment Computing node and data Computing and data storage Computing and data storage

              storage arranged separately arranged at the same node arranged at the same node

              (Data should be moved (Computing should (Computing should

              computing node) be close to data) be close to data)

              Resource management ndash Workqueue(google) Not clear

              scheduling HOD(Yahoo)

              Low level programming MPI API MapReduce API Dryad API

              High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

              Data storage The local file system GFS(google) NTFS

              NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

              Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

              the tasks

              Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

              memory access Shared-memory FIFOs

              Fault-tolerant Checkpoint Task re-execute Task re-execute

              531 Real-time vs offline analysis

              According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

              ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

              ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

              532 Analysis at different levels

              Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

              ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

              ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

              ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

              192 Mobile Netw Appl (2014) 19171ndash209

              533 Analysis with different complexity

              The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

              54 Tools for big data mining and analysis

              Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

              ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

              ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

              ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

              The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

              ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

              ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

              6 Big data applications

              In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

              Mobile Netw Appl (2014) 19171ndash209 193

              However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

              61 Application evolutions

              Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

              ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

              ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

              most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

              ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

              As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

              62 Big data analysis fields

              webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

              194 Mobile Netw Appl (2014) 19171ndash209

              621 Structured data analysis

              Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

              However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

              622 Text data analysis

              The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

              623 Web data analysis

              Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

              mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

              Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

              Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

              Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

              Mobile Netw Appl (2014) 19171ndash209 195

              624 Multimedia data analysis

              Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

              Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

              Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

              Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

              segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

              Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

              The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

              625 Network data analysis

              Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

              196 Mobile Netw Appl (2014) 19171ndash209

              and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

              The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

              Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

              Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

              Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

              is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

              626 Mobile data analysis

              By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

              With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

              Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

              Mobile Netw Appl (2014) 19171ndash209 197

              In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

              Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

              63 Key applications of big data

              631 Application of big data in enterprises

              At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

              In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

              Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

              632 Application of IoT based big data

              IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

              Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

              Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

              633 Application of online social network-oriented big data

              Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

              198 Mobile Netw Appl (2014) 19171ndash209

              information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

              ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

              ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

              is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

              The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

              In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

              Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

              Mobile Netw Appl (2014) 19171ndash209 199

              or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

              Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

              ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

              ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

              ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

              634 Applications of healthcare and medical big data

              Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

              effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

              For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

              The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

              HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

              Fig 6 The correlation between Tweets about rice price and food price inflation

              200 Mobile Netw Appl (2014) 19171ndash209

              imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

              635 Collective intelligence

              With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

              Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

              In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

              636 Smart grid

              Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

              supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

              ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

              ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

              Mobile Netw Appl (2014) 19171ndash209 201

              according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

              ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

              7 Conclusion open issues and outlook

              In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

              In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

              71 Open issues

              The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

              711 Theoretical research

              Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

              ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

              ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

              ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

              712 Technology development

              The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

              ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

              202 Mobile Netw Appl (2014) 19171ndash209

              ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

              ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

              ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

              713 Practical implications

              Although there are already many successful big data appli-cations many practical problems should be solved

              ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

              ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

              ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

              individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

              ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

              714 Data security

              In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

              ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

              ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

              Mobile Netw Appl (2014) 19171ndash209 203

              quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

              ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

              ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

              The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

              72 Outlook

              The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

              not predict the future but may take precautions for possibleevents to occur in the future

              ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

              ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

              ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

              ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

              204 Mobile Netw Appl (2014) 19171ndash209

              utilizes relational diagrams to express interpersonalrelationship

              ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

              ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

              ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

              ndash Compared with accurate data we would like toaccept numerous and complicated data

              ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

              ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

              ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

              Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

              increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

              Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

              References

              1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

              2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

              3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

              4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

              5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

              httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

              7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

              8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

              9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

              10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

              11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

              12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

              13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

              14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

              15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

              16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

              17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

              18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

              19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

              Mobile Netw Appl (2014) 19171ndash209 205

              20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

              21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

              22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

              23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

              24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

              25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

              26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

              27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

              28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

              29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

              30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

              31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

              32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

              33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

              34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

              35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

              36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

              37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

              38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

              39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

              40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

              41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

              42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

              43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

              44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

              45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

              46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

              47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

              48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

              49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

              50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

              51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

              52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

              53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

              54 Cisco data center interconnect design and deployment guide(2010)

              55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

              56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

              57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

              58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

              59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

              206 Mobile Netw Appl (2014) 19171ndash209

              60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

              61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

              62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

              63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

              64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

              65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

              66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

              67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

              68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

              69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

              70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

              71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

              72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

              73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

              74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

              75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

              76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

              77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

              78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

              79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

              80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

              81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

              82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

              83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

              84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

              85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

              86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

              87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

              88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

              89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

              90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

              Media Inc93 Crockford D (2006) The applicationjson media type for

              javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

              SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

              tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

              (2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

              97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

              98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

              99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

              Mobile Netw Appl (2014) 19171ndash209 207

              100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

              101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

              102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

              103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

              104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

              105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

              106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

              107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

              108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

              109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

              110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

              111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

              112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

              113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

              114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

              115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

              D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

              117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

              118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

              the 7th ACM international conference on computing frontiersACM pp 277ndash286

              119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

              120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

              121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

              122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

              123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

              124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

              125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

              126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

              127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

              128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

              129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

              130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

              131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

              132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

              133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

              134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

              135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

              136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

              137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

              138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

              139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

              140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

              208 Mobile Netw Appl (2014) 19171ndash209

              141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

              142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

              143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

              144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

              145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

              146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

              147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

              148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

              149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

              150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

              151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

              152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

              153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

              154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

              155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

              156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

              Mobile Netw Appl (2014) 19171ndash209 209

              • Big Data A Survey
                • Abstract
                • Background
                  • Dawn of big data era
                  • Definition and features of big data
                  • Big data value
                  • The development of big data
                  • Challenges of big data
                    • Related technologies
                      • Relationship between cloud computing and big data
                      • Relationship between IoT and big data
                      • Data center
                      • Relationship between hadoop and big data
                        • Big data generation and acquisition
                          • Data generation
                            • Enterprise data
                            • IoT data
                            • Bio-medical data
                            • Data generation from other fields
                              • Big data acquisition
                                • Data collection
                                • Data transportation
                                • Data pre-processing
                                    • Big data storage
                                      • Storage system for massive data
                                      • Distributed storage system
                                      • Storage mechanism for big data
                                        • Database technology
                                          • Traditional data analysis
                                          • Big data analytic methods
                                          • Architecture for big data analysis
                                            • Real-time vs offline analysis
                                            • Analysis at different levels
                                            • Analysis with different complexity
                                              • Tools for big data mining and analysis
                                                • Big data applications
                                                  • Key applications of big data
                                                    • Application evolutions
                                                    • Structured data analysis
                                                    • Text data analysis
                                                    • Web data analysis
                                                    • Multimedia data analysis
                                                    • Network data analysis
                                                    • Mobile data analysis
                                                      • Key applications of big data
                                                        • Application of big data in enterprises
                                                        • Application of IoT based big data
                                                        • Application of online social network-oriented big data
                                                        • Applications of healthcare and medical big data
                                                        • Collective intelligence
                                                        • Smart grid
                                                            • Conclusion open issues and outlook
                                                              • Open issues
                                                                • Theoretical research
                                                                • Technology development
                                                                • Practical implications
                                                                • Data security
                                                                  • Outlook
                                                                    • Acknowledgments
                                                                    • References

                23 Data center

                In the big data paradigm the data center not only is a plat-form for concentrated storage of data but also undertakesmore responsibilities such as acquiring data managingdata organizing data and leveraging the data values andfunctions Data centers mainly concern ldquodatardquo other thanldquocenterrdquo It has masses of data and organizes and man-ages data according to its core objective and developmentpath which is more valuable than owning a good site andresource The emergence of big data brings about sounddevelopment opportunities and great challenges to data cen-ters Big data is an emerging paradigm which will promotethe explosive growth of the infrastructure and related soft-ware of data center The physical data center network isthe core for supporting big data but at present is the keyinfrastructure that is most urgently required [29]

                ndash Big data requires data center provide powerful back-stage support The big data paradigm has more strin-gent requirements on storage capacity and processingcapacity as well as network transmission capacityEnterprises must take the development of data centersinto consideration to improve the capacity of rapidlyand effectively processing of big data under limitedpriceperformance ratio The data center shall providethe infrastructure with a large number of nodes build ahigh-speed internal network effectively dissipate heatand effective backup data Only when a highly energy-efficient stable safe expandable and redundant datacenter is built the normal operation of big data applica-tions may be ensured

                ndash The growth of big data applications accelerates therevolution and innovation of data centers Many bigdata applications have developed their unique architec-tures and directly promote the development of storagenetwork and computing technologies related to datacenter With the continued growth of the volumes ofstructured and unstructured data and the variety ofsources of analytical data the data processing and com-puting capacities of the data center shall be greatlyenhanced In addition as the scale of data center isincreasingly expanding it is also an important issue onhow to reduce the operational cost for the developmentof data centers

                ndash Big data endows more functions to the data center Inthe big data paradigm data center shall not only con-cern with hardware facilities but also strengthen softcapacities ie the capacities of acquisition processingorganization analysis and application of big data Thedata center may help business personnel analyze theexisting data discover problems in business operationand develop solutions from big data

                24 Relationship between hadoop and big data

                Presently Hadoop is widely used in big data applications inthe industry eg spam filtering network searching click-stream analysis and social recommendation In additionconsiderable academic research is now based on HadoopSome representative cases are given below As declaredin June 2012 Yahoo runs Hadoop in 42000 servers atfour data centers to support its products and services egsearching and spam filtering etc At present the biggestHadoop cluster has 4000 nodes but the number of nodeswill be increased to 10000 with the release of Hadoop 20In the same month Facebook announced that their Hadoopcluster can process 100 PB data which grew by 05 PB perday as in November 2012 Some well-known agencies thatuse Hadoop to conduct distributed computation are listedin [30] In addition many companies provide Hadoop com-mercial execution andor support including Cloudera IBMMapR EMC and Oracle

                Among modern industrial machinery and systems sen-sors are widely deployed to collect information for environ-ment monitoring and failure forecasting etc Bahga and oth-ers in [31] proposed a framework for data organization andcloud computing infrastructure termed CloudView Cloud-View uses mixed architectures local nodes and remoteclusters based on Hadoop to analyze machine-generateddata Local nodes are used for the forecast of real-time fail-ures clusters based on Hadoop are used for complex offlineanalysis eg case-driven data analysis

                The exponential growth of the genome data and the sharpdrop of sequencing cost transform bio-science and bio-medicine to data-driven science Gunarathne et al in [32]utilized cloud computing infrastructures Amazon AWSMicrosoft Azune and data processing framework basedon MapReduce Hadoop and Microsoft DryadLINQ torun two parallel bio-medicine applications (i) assembly ofgenome segments (ii) dimension reduction in the analy-sis of chemical structure In the subsequent application the166-D datasets used include 26000000 data points Theauthors compared the performance of all the frameworks interms of efficiency cost and availability According to thestudy the authors concluded that the loose coupling will beincreasingly applied to research on electron cloud and theparallel programming technology (MapReduce) frameworkmay provide the user an interface with more convenientservices and reduce unnecessary costs

                3 Big data generation and acquisition

                We have introduced several key technologies related to bigdata ie cloud computing IoT data center and HadoopNext we will focus on the value chain of big data which

                178 Mobile Netw Appl (2014) 19171ndash209

                can be generally divided into four phases data generationdata acquisition data storage and data analysis If we takedata as a raw material data generation and data acquisitionare an exploitation process data storage is a storage processand data analysis is a production process that utilizes theraw material to create new value

                31 Data generation

                Data generation is the first step of big data Given Inter-net data as an example huge amount of data in terms ofsearching entries Internet forum posts chatting records andmicroblog messages are generated Those data are closelyrelated to peoplersquos daily life and have similar features ofhigh value and low density Such Internet data may bevalueless individually but through the exploitation of accu-mulated big data useful information such as habits andhobbies of users can be identified and it is even possible toforecast usersrsquo behaviors and emotional moods

                Moreover generated through longitudinal andor dis-tributed data sources datasets are more large-scale highlydiverse and complex Such data sources include sensorsvideos clickstreams andor all other available data sourcesAt present main sources of big data are the operationand trading information in enterprises logistic and sens-ing information in the IoT human interaction informationand position information in the Internet world and datagenerated in scientific research etc The information far sur-passes the capacities of IT architectures and infrastructuresof existing enterprises while its real time requirement alsogreatly stresses the existing computing capacity

                311 Enterprise data

                In 2013 IBM issued Analysis the Applications of Big Datato the Real World which indicates that the internal data ofenterprises are the main sources of big data The internaldata of enterprises mainly consists of online trading data andonline analysis data most of which are historically staticdata and are managed by RDBMSs in a structured man-ner In addition production data inventory data sales dataand financial data etc also constitute enterprise internaldata which aims to capture informationized and data-drivenactivities in enterprises so as to record all activities ofenterprises in the form of internal data

                Over the past decades IT and digital data have con-tributed a lot to improve the profitability of business depart-ments It is estimated that the business data volume of allcompanies in the world may double every 12 years [10]in which the business turnover through the Internet enter-prises to enterprises and enterprises to consumers per daywill reach USD 450 billion [33] The continuously increas-ing business data volume requires more effective real-time

                analysis so as to fully harvest its potential For exampleAmazon processes millions of terminal operations and morethan 500000 queries from third-party sellers per day [12]Walmart processes one million customer trades per hour andsuch trading data are imported into a database with a capac-ity of over 25PB [3] Akamai analyzes 75 million eventsper day for its target advertisements [13]

                312 IoT data

                As discussed IoT is an important source of big data Amongsmart cities constructed based on IoT big data may comefrom industry agriculture traffic transportation medicalcare public departments and families etc

                According to the processes of data acquisition and trans-mission in IoT its network architecture may be dividedinto three layers the sensing layer the network layer andthe application layer The sensing layer is responsible fordata acquisition and mainly consists of sensor networksThe network layer is responsible for information transmis-sion and processing where close transmission may rely onsensor networks and remote transmission shall depend onthe Internet Finally the application layer support specificapplications of IoT

                According to characteristics of Internet of Things thedata generated from IoT has the following features

                ndash Large-scale data in IoT masses of data acquisi-tion equipments are distributedly deployed which mayacquire simple numeric data eg location or complexmultimedia data eg surveillance video In order tomeet the demands of analysis and processing not onlythe currently acquired data but also the historical datawithin a certain time frame should be stored Thereforedata generated by IoT are characterized by large scales

                ndash Heterogeneity because of the variety data acquisitiondevices the acquired data is also different and such datafeatures heterogeneity

                ndash Strong time and space correlation in IoT every dataacquisition device are placed at a specific geographiclocation and every piece of data has time stamp Thetime and space correlation are an important propertyof data from IoT During data analysis and process-ing time and space are also important dimensions forstatistical analysis

                ndash Effective data accounts for only a small portion of thebig data a great quantity of noises may occur dur-ing the acquisition and transmission of data in IoTAmong datasets acquired by acquisition devices only asmall amount of abnormal data is valuable For exam-ple during the acquisition of traffic video the few videoframes that capture the violation of traffic regulations

                Mobile Netw Appl (2014) 19171ndash209 179

                and traffic accidents are more valuable than those onlycapturing the normal flow of traffic

                313 Bio-medical data

                As a series of high-throughput bio-measurement technolo-gies are innovatively developed in the beginning of the21st century the frontier research in the bio-medicine fieldalso enters the era of big data By constructing smartefficient and accurate analytical models and theoretical sys-tems for bio-medicine applications the essential governingmechanism behind complex biological phenomena may berevealed Not only the future development of bio-medicinecan be determined but also the leading roles can be assumedin the development of a series of important strategic indus-tries related to the national economy peoplersquos livelihoodand national security with important applications such asmedical care new drug R amp D and grain production (egtransgenic crops)

                The completion of HGP (Human Genome Project) andthe continued development of sequencing technology alsolead to widespread applications of big data in the fieldThe masses of data generated by gene sequencing gothrough specialized analysis according to different applica-tion demands to combine it with the clinical gene diag-nosis and provide valuable information for early diagnosisand personalized treatment of disease One sequencing ofhuman gene may generate 100 600GB raw data In theChina National Genebank in Shenzhen there are 13 mil-lion samples including 115 million human samples and150000 animal plant and microorganism samples By theend of 2013 10 million traceable biological samples willbe stored and by the end of 2015 this figure will reach30 million It is predictable that with the development ofbio-medicine technologies gene sequencing will becomefaster and more convenient and thus making big data ofbio-medicine continuously grow beyond all doubt

                In addition data generated from clinical medical care andmedical R amp D also rise quickly For example the Uni-versity of Pittsburgh Medical Center (UPMC) has stored2TB such data Explorys an American company providesplatforms to collocate clinical data operation and mainte-nance data and financial data At present about 13 millionpeoplersquos information have been collocated with 44 arti-cles of data at the scale of about 60TB which will reach70TB in 2013 Practice Fusion another American com-pany manages electronic medical records of about 200000patients

                Apart from such small and medium-sized enterprisesother well-known IT companies such as Google Microsoftand IBM have invested extensively in the research and com-putational analysis of methods related to high-throughputbiological big data for shares in the huge market as known

                as the ldquoNext Internetrdquo IBM forecasts in the 2013 StrategyConference that with the sharp increase of medical imagesand electronic medical records medical professionals mayutilize big data to extract useful clinical information frommasses of data to obtain a medical history and forecast treat-ment effects thus improving patient care and reduce costIt is anticipated that by 2015 the average data volume ofevery hospital will increase from 167TB to 665TB

                314 Data generation from other fields

                As scientific applications are increasing the scale ofdatasets is gradually expanding and the development ofsome disciplines greatly relies on the analysis of masses ofdata Here we examine several such applications Althoughbeing in different scientific fields the applications havesimilar and increasing demand on data analysis The firstexample is related to computational biology GenBank isa nucleotide sequence database maintained by the USNational Bio-Technology Innovation Center Data in thisdatabase may double every 10 months By August 2009Genbank has more than 250 billion bases from 150000 dif-ferent organisms [34] The second example is related toastronomy Sloan Digital Sky Survey (SDSS) the biggestsky survey project in astronomy has recorded 25TB datafrom 1998 to 2008 As the resolution of the telescope isimproved by 2004 the data volume generated per night willsurpass 20TB The last application is related to high-energyphysics In the beginning of 2008 the Atlas experiment ofLarge Hadron Collider (LHC) of European Organization forNuclear Research generates raw data at 2PBs and storesabout 10TB processed data per year

                In addition pervasive sensing and computing amongnature commercial Internet government and social envi-ronments are generating heterogeneous data with unprece-dented complexity These datasets have their unique datacharacteristics in scale time dimension and data categoryFor example mobile data were recorded with respect topositions movement approximation degrees communica-tions multimedia use of applications and audio environ-ment [108] According to the application environment andrequirements such datasets into different categories so asto select the proper and feasible solutions for big data

                32 Big data acquisition

                As the second phase of the big data system big data acqui-sition includes data collection data transmission and datapre-processing During big data acquisition once we col-lect the raw data we shall utilize an efficient transmissionmechanism to send it to a proper storage managementsystem to support different analytical applications The col-lected datasets may sometimes include much redundant or

                180 Mobile Netw Appl (2014) 19171ndash209

                useless data which unnecessarily increases storage spaceand affects the subsequent data analysis For examplehigh redundancy is very common among datasets collectedby sensors for environment monitoring Data compressiontechnology can be applied to reduce the redundancy There-fore data pre-processing operations are indispensable toensure efficient data storage and exploitation

                321 Data collection

                Data collection is to utilize special data collection tech-niques to acquire raw data from a specific data generationenvironment Four common data collection methods areshown as follows

                ndash Log files As one widely used data collection methodlog files are record files automatically generated by thedata source system so as to record activities in desig-nated file formats for subsequent analysis Log files aretypically used in nearly all digital devices For exam-ple web servers record in log files number of clicksclick rates visits and other property records of webusers [35] To capture activities of users at the web sitesweb servers mainly include the following three log fileformats public log file format (NCSA) expanded logformat (W3C) and IIS log format (Microsoft) All thethree types of log files are in the ASCII text formatDatabases other than text files may sometimes be usedto store log information to improve the query efficiencyof the massive log store [36 37] There are also someother log files based on data collection including stockindicators in financial applications and determinationof operating states in network monitoring and trafficmanagement

                ndash Sensing Sensors are common in daily life to measurephysical quantities and transform physical quantitiesinto readable digital signals for subsequent process-ing (and storage) Sensory data may be classified assound wave voice vibration automobile chemicalcurrent weather pressure temperature etc Sensedinformation is transferred to a data collection pointthrough wired or wireless networks For applicationsthat may be easily deployed and managed eg videosurveillance system [38] the wired sensor network isa convenient solution to acquire related informationSometimes the accurate position of a specific phe-nomenon is unknown and sometimes the monitoredenvironment does not have the energy or communica-tion infrastructures Then wireless communication mustbe used to enable data transmission among sensor nodesunder limited energy and communication capability Inrecent years WSNs have received considerable inter-est and have been applied to many applications such

                as environmental research [39 40] water quality mon-itoring [41] civil engineering [42 43] and wildlifehabit monitoring [44] A WSN generally consists ofa large number of geographically distributed sensornodes each being a micro device powered by batterySuch sensors are deployed at designated positions asrequired by the application to collect remote sensingdata Once the sensors are deployed the base stationwill send control information for network configura-tionmanagement or data collection to sensor nodesBased on such control information the sensory data isassembled in different sensor nodes and sent back to thebase station for further processing Interested readersare referred to [45] for more detailed discussions

                ndash Methods for acquiring network data At present net-work data acquisition is accomplished using a com-bination of web crawler word segmentation systemtask system and index system etc Web crawler isa program used by search engines for downloadingand storing web pages [46] Generally speaking webcrawler starts from the uniform resource locator (URL)of an initial web page to access other linked web pagesduring which it stores and sequences all the retrievedURLs Web crawler acquires a URL in the order ofprecedence through a URL queue and then downloadsweb pages and identifies all URLs in the downloadedweb pages and extracts new URLs to be put in thequeue This process is repeated until the web crawleris stopped Data acquisition through a web crawler iswidely applied in applications based on web pagessuch as search engines or web caching Traditional webpage extraction technologies feature multiple efficientsolutions and considerable research has been done inthis field As more advanced web page applicationsare emerging some extraction strategies are proposedin [47] to cope with rich Internet applications

                The current network data acquisition technologiesmainly include traditional Libpcap-based packet capturetechnology zero-copy packet capture technology as wellas some specialized network monitoring software such asWireshark SmartSniff and WinNetCap

                ndash Libpcap-based packet capture technology Libpcap(packet capture library) is a widely used network datapacket capture function library It is a general tool thatdoes not depend on any specific system and is mainlyused to capture data in the data link layer It featuressimplicity easy-to-use and portability but has a rel-atively low efficiency Therefore under a high-speednetwork environment considerable packet losses mayoccur when Libpcap is used

                Mobile Netw Appl (2014) 19171ndash209 181

                ndash Zero-copy packet capture technology The so-calledzero-copy (ZC) means that no copies between any inter-nal memories occur during packet receiving and send-ing at a node In sending the data packets directly startfrom the user buffer of applications pass through thenetwork interfaces and arrive at an external networkIn receiving the network interfaces directly send datapackets to the user buffer The basic idea of zero-copyis to reduce data copy times reduce system calls andreduce CPU load while ddatagrams are passed from net-work equipments to user program space The zero-copytechnology first utilizes direct memory access (DMA)technology to directly transmit network datagrams to anaddress space pre-allocated by the system kernel so asto avoid the participation of CPU In the meanwhile itmaps the internal memory of the datagrams in the sys-tem kernel to the that of the detection program or buildsa cache region in the user space and maps it to the ker-nel space Then the detection program directly accessesthe internal memory so as to reduce internal memorycopy from system kernel to user space and reduce theamount of system calls

                ndash Mobile equipments At present mobile devices aremore widely used As mobile device functions becomeincreasingly stronger they feature more complex andmultiple means of data acquisition as well as morevariety of data Mobile devices may acquire geo-graphical location information through positioning sys-tems acquire audio information through microphonesacquire pictures videos streetscapes two-dimensionalbarcodes and other multimedia information throughcameras acquire user gestures and other body languageinformation through touch screens and gravity sensorsOver the years wireless operators have improved theservice level of the mobile Internet by acquiring andanalyzing such information For example iPhone itselfis a ldquomobile spyrdquo It may collect wireless data andgeographical location information and then send suchinformation back to Apple Inc for processing of whichthe user is not aware Apart from Apple smart phoneoperating systems such as Android of Google and Win-dows Phone of Microsoft can also collect informationin the similar manner

                In addition to the aforementioned three data acquisitionmethods of main data sources there are many other datacollect methods or systems For example in scientific exper-iments many special tools can be used to collect exper-imental data such as magnetic spectrometers and radiotelescopes We may classify data collection methods fromdifferent perspectives From the perspective of data sourcesdata collection methods can be classified into two cate-gories collection methods recording through data sources

                and collection methods recording through other auxiliarytools

                322 Data transportation

                Upon the completion of raw data collection data will betransferred to a data storage infrastructure for processingand analysis As discussed in Section 23 big data is mainlystored in a data center The data layout should be adjusted toimprove computing efficiency or facilitate hardware mainte-nance In other words internal data transmission may occurin the data center Therefore data transmission consistsof two phases Inter-DCN transmissions and Intra-DCNtransmissions

                ndash Inter-DCN transmissions Inter-DCN transmissions arefrom data source to data center which is generallyachieved with the existing physical network infrastruc-ture Because of the rapid growth of traffic demandsthe physical network infrastructure in most regionsaround the world are constituted by high-volumn high-rate and cost-effective optic fiber transmission systemsOver the past 20 years advanced management equip-ment and technologies have been developed such asIP-based wavelength division multiplexing (WDM) net-work architecture to conduct smart control and man-agement of optical fiber networks [48 49] WDM isa technology that multiplexes multiple optical carriersignals with different wave lengths and couples themto the same optical fiber of the optical link In suchtechnology lasers with different wave lengths carry dif-ferent signals By far the backbone network have beendeployed with WDM optical transmission systems withsingle channel rate of 40Gbs At present 100Gbs com-mercial interface are available and 100Gbs systems (orTBs systems) will be available in the near future [50]However traditional optical transmission technologiesare limited by the bandwidth of the electronic bot-tleneck [51] Recently orthogonal frequency-divisionmultiplexing (OFDM) initially designed for wirelesssystems is regarded as one of the main candidatetechnologies for future high-speed optical transmis-sion OFDM is a multi-carrier parallel transmissiontechnology It segments a high-speed data flow to trans-form it into low-speed sub-data-flows to be transmittedover multiple orthogonal sub-carriers [52] Comparedwith fixed channel spacing of WDM OFDM allowssub-channel frequency spectrums to overlap with eachother [53] Therefore it is a flexible and efficient opticalnetworking technology

                ndash Intra-DCN Transmissions Intra-DCN transmissionsare the data communication flows within data centersIntra-DCN transmissions depend on the communication

                182 Mobile Netw Appl (2014) 19171ndash209

                mechanism within the data center (ie on physical con-nection plates chips internal memories of data serversnetwork architectures of data centers and communica-tion protocols) A data center consists of multiple inte-grated server racks interconnected with its internal con-nection networks Nowadays the internal connectionnetworks of most data centers are fat-tree two-layeror three-layer structures based on multi-commoditynetwork flows [51 54] In the two-layer topologicalstructure the racks are connected by 1Gbps top rackswitches (TOR) and then such top rack switches areconnected with 10Gbps aggregation switches in thetopological structure The three-layer topological struc-ture is a structure augmented with one layer on the topof the two-layer topological structure and such layeris constituted by 10Gbps or 100Gbps core switchesto connect aggregation switches in the topologicalstructure There are also other topological structureswhich aim to improve the data center networks [55ndash58] Because of the inadequacy of electronic packetswitches it is difficult to increase communication band-widths while keeps energy consumption is low Overthe years due to the huge success achieved by opti-cal technologies the optical interconnection among thenetworks in data centers has drawn great interest Opti-cal interconnection is a high-throughput low-delayand low-energy-consumption solution At present opti-cal technologies are only used for point-to-point linksin data centers Such optical links provide connectionfor the switches using the low-cost multi-mode fiber(MMF) with 10Gbps data rate Optical interconnec-tion (switching in the optical domain) of networks indata centers is a feasible solution which can provideTbps-level transmission bandwidth with low energyconsumption Recently many optical interconnectionplans are proposed for data center networks [59] Someplans add optical paths to upgrade the existing net-works and other plans completely replace the currentswitches [59ndash64] As a strengthening technology Zhouet al in [65] adopt wireless links in the 60GHz fre-quency band to strengthen wired links Network vir-tualization should also be considered to improve theefficiency and utilization of data center networks

                323 Data pre-processing

                Because of the wide variety of data sources the collecteddatasets vary with respect to noise redundancy and con-sistency etc and it is undoubtedly a waste to store mean-ingless data In addition some analytical methods haveserious requirements on data quality Therefore in orderto enable effective data analysis we shall pre-process data

                under many circumstances to integrate the data from differ-ent sources which can not only reduces storage expensebut also improves analysis accuracy Some relational datapre-processing techniques are discussed as follows

                ndash Integration data integration is the cornerstone of mod-ern commercial informatics which involves the com-bination of data from different sources and providesusers with a uniform view of data [66] This is a matureresearch field for traditional database Historically twomethods have been widely recognized data ware-house and data federation Data warehousing includesa process named ETL (Extract Transform and Load)Extraction involves connecting source systems select-ing collecting analyzing and processing necessarydata Transformation is the execution of a series of rulesto transform the extracted data into standard formatsLoading means importing extracted and transformeddata into the target storage infrastructure Loading isthe most complex procedure among the three whichincludes operations such as transformation copy clear-ing standardization screening and data organizationA virtual database can be built to query and aggregatedata from different data sources but such database doesnot contain data On the contrary it includes informa-tion or metadata related to actual data and its positionsSuch two ldquostorage-readingrdquo approaches do not sat-isfy the high performance requirements of data flowsor search programs and applications Compared withqueries data in such two approaches is more dynamicand must be processed during data transmission Gen-erally data integration methods are accompanied withflow processing engines and search engines [30 67]

                ndash Cleaning data cleaning is a process to identify inac-curate incomplete or unreasonable data and thenmodify or delete such data to improve data qualityGenerally data cleaning includes five complementaryprocedures [68] defining and determining error typessearching and identifying errors correcting errors doc-umenting error examples and error types and mod-ifying data entry procedures to reduce future errorsDuring cleaning data formats completeness rational-ity and restriction shall be inspected Data cleaning isof vital importance to keep the data consistency whichis widely applied in many fields such as banking insur-ance retail industry telecommunications and trafficcontrol

                In e-commerce most data is electronically col-lected which may have serious data quality prob-lems Classic data quality problems mainly come fromsoftware defects customized errors or system mis-configuration Authors in [69] discussed data cleaning

                Mobile Netw Appl (2014) 19171ndash209 183

                in e-commerce by crawlers and regularly re-copyingcustomer and account information

                In [70] the problem of cleaning RFID data wasexamined RFID is widely used in many applica-tions eg inventory management and target track-ing However the original RFID features low qualitywhich includes a lot of abnormal data limited by thephysical design and affected by environmental noisesIn [71] a probability model was developed to copewith data loss in mobile environments Khoussainovaet al in [72] proposed a system to automatically cor-rect errors of input data by defining global integrityconstraints

                Herbert et al [73] proposed a framework called BIO-AJAX to standardize biological data so as to conductfurther computation and improve search quality WithBIO-AJAX some errors and repetitions may be elim-inated and common data mining technologies can beexecuted more effectively

                ndash Redundancy elimination data redundancy refers to datarepetitions or surplus which usually occurs in manydatasets Data redundancy can increase the unneces-sary data transmission expense and cause defects onstorage systems eg waste of storage space lead-ing to data inconsistency reduction of data reliabil-ity and data damage Therefore various redundancyreduction methods have been proposed such as redun-dancy detection data filtering and data compressionSuch methods may apply to different datasets or appli-cation environments However redundancy reductionmay also bring about certain negative effects Forexample data compression and decompression causeadditional computational burden Therefore the ben-efits of redundancy reduction and the cost should becarefully balanced Data collected from different fieldswill increasingly appear in image or video formatsIt is well-known that images and videos contain con-siderable redundancy including temporal redundancyspacial redundancy statistical redundancy and sens-ing redundancy Video compression is widely usedto reduce redundancy in video data as specified inthe many video coding standards (MPEG-2 MPEG-4H263 and H264AVC) In [74] the authors inves-tigated the problem of video compression in a videosurveillance system with a video sensor network Theauthors propose a new MPEG-4 based method byinvestigating the contextual redundancy related to back-ground and foreground in a scene The low com-plexity and the low compression ratio of the pro-posed approach were demonstrated by the evaluationresults

                On generalized data transmission or storage re-peated data deletion is a special data compression

                technology which aims to eliminate repeated datacopies [75] With repeated data deletion individual datablocks or data segments will be assigned with identi-fiers (eg using a hash algorithm) and stored with theidentifiers added to the identification list As the anal-ysis of repeated data deletion continues if a new datablock has an identifier that is identical to that listedin the identification list the new data block will bedeemed as redundant and will be replaced by the cor-responding stored data block Repeated data deletioncan greatly reduce storage requirement which is par-ticularly important to a big data storage system Apartfrom the aforementioned data pre-processing methodsspecific data objects shall go through some other oper-ations such as feature extraction Such operation playsan important role in multimedia search and DNA anal-ysis [76ndash78] Usually high-dimensional feature vec-tors (or high-dimensional feature points) are used todescribe such data objects and the system stores thedimensional feature vectors for future retrieval Datatransfer is usually used to process distributed hetero-geneous data sources especially business datasets [79]As a matter of fact in consideration of various datasetsit is non-trivial or impossible to build a uniform datapre-processing procedure and technology that is appli-cable to all types of datasets on the specific featureproblem performance requirements and other factorsof the datasets should be considered so as to select aproper data pre-processing strategy

                4 Big data storage

                The explosive growth of data has more strict requirementson storage and management In this section we focus onthe storage of big data Big data storage refers to the stor-age and management of large-scale datasets while achiev-ing reliability and availability of data accessing We willreview important issues including massive storage systemsdistributed storage systems and big data storage mecha-nisms On one hand the storage infrastructure needs toprovide information storage service with reliable storagespace on the other hand it must provide a powerful accessinterface for query and analysis of a large amount ofdata

                Traditionally as auxiliary equipment of server data stor-age device is used to store manage look up and analyzedata with structured RDBMSs With the sharp growth ofdata data storage device is becoming increasingly moreimportant and many Internet companies pursue big capac-ity of storage to be competitive Therefore there is acompelling need for research on data storage

                184 Mobile Netw Appl (2014) 19171ndash209

                41 Storage system for massive data

                Various storage systems emerge to meet the demands ofmassive data Existing massive storage technologies can beclassified as Direct Attached Storage (DAS) and networkstorage while network storage can be further classifiedinto Network Attached Storage (NAS) and Storage AreaNetwork (SAN)

                In DAS various harddisks are directly connected withservers and data management is server-centric such thatstorage devices are peripheral equipments each of whichtakes a certain amount of IO resource and is managed by anindividual application software For this reason DAS is onlysuitable to interconnect servers with a small scale How-ever due to its low scalability DAS will exhibit undesirableefficiency when the storage capacity is increased ie theupgradeability and expandability are greatly limited ThusDAS is mainly used in personal computers and small-sizedservers

                Network storage is to utilize network to provide userswith a union interface for data access and sharing Networkstorage equipment includes special data exchange equip-ments disk array tap library and other storage media aswell as special storage software It is characterized withstrong expandability

                NAS is actually an auxillary storage equipment of a net-work It is directly connected to a network through a hub orswitch through TCPIP protocols In NAS data is transmit-ted in the form of files Compared to DAS the IO burdenat a NAS server is reduced extensively since the serveraccesses a storage device indirectly through a network

                While NAS is network-oriented SAN is especiallydesigned for data storage with a scalable and bandwidthintensive network eg a high-speed network with opticalfiber connections In SAN data storage management is rel-atively independent within a storage local area networkwhere multipath based data switching among any internalnodes is utilized to achieve a maximum degree of datasharing and data management

                From the organization of a data storage system DASNAS and SAN can all be divided into three parts (i) discarray it is the foundation of a storage system and the fun-damental guarantee for data storage (ii) connection andnetwork sub-systems which provide connection among oneor more disc arrays and servers (iii) storage managementsoftware which handles data sharing disaster recovery andother storage management tasks of multiple servers

                42 Distributed storage system

                The first challenge brought about by big data is how todevelop a large scale distributed storage system for effi-ciently data processing and analysis To use a distributed

                system to store massive data the following factors shouldbe taken into consideration

                ndash Consistency a distributed storage system requires mul-tiple servers to cooperatively store data As there aremore servers the probability of server failures will belarger Usually data is divided into multiple pieces tobe stored at different servers to ensure availability incase of server failure However server failures and par-allel storage may cause inconsistency among differentcopies of the same data Consistency refers to assuringthat multiple copies of the same data are identical

                ndash Availability a distributed storage system operates inmultiple sets of servers As more servers are usedserver failures are inevitable It would be desirable ifthe entire system is not seriously affected to satisfy cus-tomerrsquos requests in terms of reading and writing Thisproperty is called availability

                ndash Partition Tolerance multiple servers in a distributedstorage system are connected by a network The net-work could have linknode failures or temporary con-gestion The distributed system should have a certainlevel of tolerance to problems caused by network fail-ures It would be desirable that the distributed storagestill works well when the network is partitioned

                Eric Brewer proposed a CAP [80 81] theory in 2000which indicated that a distributed system could not simulta-neously meet the requirements on consistency availabilityand partition tolerance at most two of the three require-ments can be satisfied simultaneously Seth Gilbert andNancy Lynch from MIT proved the correctness of CAP the-ory in 2002 Since consistency availability and partitiontolerance could not be achieved simultaneously we can havea CA system by ignoring partition tolerance a CP system byignoring availability and an AP system that ignores consis-tency according to different design goals The three systemsare discussed in the following

                CA systems do not have partition tolerance ie theycould not handle network failures Therefore CA sys-tems are generally deemed as storage systems with a sin-gle server such as the traditional small-scale relationaldatabases Such systems feature single copy of data suchthat consistency is easily ensured Availability is guaranteedby the excellent design of relational databases Howeversince CA systems could not handle network failures theycould not be expanded to use many servers Therefore mostlarge-scale storage systems are CP systems and AP systems

                Compared with CA systems CP systems ensure parti-tion tolerance Therefore CP systems can be expanded tobecome distributed systems CP systems generally main-tain several copies of the same data in order to ensure a

                Mobile Netw Appl (2014) 19171ndash209 185

                level of fault tolerance CP systems also ensure data consis-tency ie multiple copies of the same data are guaranteedto be completely identical However CP could not ensuresound availability because of the high cost for consistencyassurance Therefore CP systems are useful for the scenar-ios with moderate load but stringent requirements on dataaccuracy (eg trading data) BigTable and Hbase are twopopular CP systems

                AP systems also ensure partition tolerance However APsystems are different from CP systems in that AP systemsalso ensure availability However AP systems only ensureeventual consistency rather than strong consistency in theprevious two systems Therefore AP systems only applyto the scenarios with frequent requests but not very highrequirements on accuracy For example in online SocialNetworking Services (SNS) systems there are many con-current visits to the data but a certain amount of data errorsare tolerable Furthermore because AP systems ensureeventual consistency accurate data can still be obtained aftera certain amount of delay Therefore AP systems may alsobe used under the circumstances with no stringent realtimerequirements Dynamo and Cassandra are two popular APsystems

                43 Storage mechanism for big data

                Considerable research on big data promotes the develop-ment of storage mechanisms for big data Existing stor-age mechanisms of big data may be classified into threebottom-up levels (i) file systems (ii) databases and (iii)programming models

                File systems are the foundation of the applications atupper levels Googlersquos GFS is an expandable distributedfile system to support large-scale distributed data-intensiveapplications [25] GFS uses cheap commodity servers toachieve fault-tolerance and provides customers with high-performance services GFS supports large-scale file appli-cations with more frequent reading than writing HoweverGFS also has some limitations such as a single point offailure and poor performances for small files Such limita-tions have been overcome by Colossus [82] the successorof GFS

                In addition other companies and researchers also havetheir solutions to meet the different demands for storageof big data For example HDFS and Kosmosfs are deriva-tives of open source codes of GFS Microsoft developedCosmos [83] to support its search and advertisement busi-ness Facebook utilizes Haystack [84] to store the largeamount of small-sized photos Taobao also developed TFSand FastDFS In conclusion distributed file systems havebeen relatively mature after years of development and busi-ness operation Therefore we will focus on the other twolevels in the rest of this section

                431 Database technology

                The database technology has been evolving for more than30 years Various database systems are developed to handledatasets at different scales and support various applica-tions Traditional relational databases cannot meet the chal-lenges on categories and scales brought about by big dataNoSQL databases (ie non traditional relational databases)are becoming more popular for big data storage NoSQLdatabases feature flexible modes support for simple andeasy copy simple API eventual consistency and supportof large volume data NoSQL databases are becomingthe core technology for of big data We will examinethe following three main NoSQL databases in this sec-tion Key-value databases column-oriented databases anddocument-oriented databases each based on certain datamodels

                ndash Key-value Databases Key-value Databases are con-stituted by a simple data model and data is storedcorresponding to key-values Every key is unique andcustomers may input queried values according to thekeys Such databases feature a simple structure andthe modern key-value databases are characterized withhigh expandability and shorter query response time thanthose of relational databases Over the past few yearsmany key-value databases have appeared as motivatedby Amazonrsquos Dynamo system [85] We will introduceDynamo and several other representative key-valuedatabases

                ndash Dynamo Dynamo is a highly available andexpandable distributed key-value data stor-age system It is used to store and managethe status of some core services which canbe realized with key access in the Amazone-Commerce Platform The public mode ofrelational databases may generate invalid dataand limit data scale and availability whileDynamo can resolve these problems with asimple key-object interface which is consti-tuted by simple reading and writing opera-tion Dynamo achieves elasticity and avail-ability through the data partition data copyand object edition mechanisms Dynamo par-tition plan relies on Consistent Hashing [86]which has a main advantage that node pass-ing only affects directly adjacent nodes anddo not affect other nodes to divide the loadfor multiple main storage machines Dynamocopies data to N sets of servers in which Nis a configurable parameter in order to achieve

                186 Mobile Netw Appl (2014) 19171ndash209

                high availability and durability Dynamo sys-tem also provides eventual consistency so asto conduct asynchronous update on all copies

                ndash Voldemort Voldemort is also a key-value stor-age system which was initially developed forand is still used by LinkedIn Key words andvalues in Voldemort are composite objectsconstituted by tables and images Volde-mort interface includes three simple opera-tions reading writing and deletion all ofwhich are confirmed by key words Volde-mort provides asynchronous updating con-current control of multiple editions but doesnot ensure data consistency However Volde-mort supports optimistic locking for consistentmulti-record updating When conflict happensbetween the updating and any other opera-tions the updating operation will quit Thedata copy mechanism of Voldmort is the sameas that of Dynamo Voldemort not only storesdata in RAM but allows data be inserted intoa storage engine Especially Voldemort sup-ports two storage engines including BerkeleyDB and Random Access Files

                The key-value database emerged a few years agoDeeply influenced by Amazon Dynamo DB other key-value storage systems include Redis Tokyo Canbinetand Tokyo Tyrant Memcached and Memcache DBRiak and Scalaris all of which provide expandabilityby distributing key words into nodes Voldemort RiakTokyo Cabinet and Memecached can utilize attachedstorage devices to store data in RAM or disks Otherstorage systems store data at RAM and provide diskbackup or rely on copy and recovery to avoid backup

                ndash Column-oriented Database The column-orienteddatabases store and process data according to columnsother than rows Both columns and rows are segmentedin multiple nodes to realize expandability The column-oriented databases are mainly inspired by GooglersquosBigTable In this Section we first discuss BigTable andthen introduce several derivative tools

                ndash BigTable BigTable is a distributed structureddata storage system which is designed to pro-cess the large-scale (PB class) data amongthousands commercial servers [87] The basicdata structure of Bigtable is a multi-dimensionsequenced mapping with sparse distributedand persistent storage Indexes of mappingare row key column key and timestampsand every value in mapping is an unana-lyzed byte array Each row key in BigTable

                is a 64KB character string By lexicograph-ical order rows are stored and continuallysegmented into Tablets (ie units of distribu-tion) for load balance Thus reading a shortrow of data can be highly effective since itonly involves communication with a small por-tion of machines The columns are groupedaccording to the prefixes of keys and thusforming column families These column fami-lies are the basic units for access control Thetimestamps are 64-bit integers to distinguishdifferent editions of cell values Clients mayflexibly determine the number of cell editionsstored These editions are sequenced in thedescending order of timestamps so the latestedition will always be read

                The BigTable API features the creation anddeletion of Tablets and column families as wellas modification of metadata of clusters tablesand column families Client applications mayinsert or delete values of BigTable query val-ues from columns or browse sub-datasets in atable Bigtable also supports some other char-acteristics such as transaction processing in asingle row Users may utilize such features toconduct more complex data processing

                Every procedure executed by BigTableincludes three main components Masterserver Tablet server and client libraryBigtable only allows one set of Master serverbe distributed to be responsible for distribut-ing tablets for Tablet server detecting addedor removed Tablet servers and conductingload balance In addition it can also mod-ify BigTable schema eg creating tables andcolumn families and collecting garbage savedin GFS as well as deleted or disabled filesand using them in specific BigTable instancesEvery tablet server manages a Tablet set andis responsible for the reading and writing of aloaded Tablet When Tablets are too big theywill be segmented by the server The applica-tion client library is used to communicate withBigTable instances

                BigTable is based on many fundamentalcomponents of Google including GFS [25]cluster management system SSTable file for-mat and Chubby [88] GFS is use to store dataand log files The cluster management systemis responsible for task scheduling resourcessharing processing of machine failures andmonitoring of machine statuses SSTable fileformat is used to store BigTable data internally

                Mobile Netw Appl (2014) 19171ndash209 187

                and it provides mapping between persistentsequenced and unchangeable keys and valuesas any byte strings BigTable utilizes Chubbyfor the following tasks in server 1) ensurethere is at most one active Master copy atany time 2) store the bootstrap location ofBigTable data 3) look up Tablet server 4) con-duct error recovery in case of Table server fail-ures 5) store BigTable schema information 6)store the access control table

                ndash Cassandra Cassandra is a distributed storagesystem to manage the huge amount of struc-tured data distributed among multiple commer-cial servers [89] The system was developedby Facebook and became an open source toolin 2008 It adopts the ideas and concepts ofboth Amazon Dynamo and Google BigTableespecially integrating the distributed systemtechnology of Dynamo with the BigTable datamodel Tables in Cassandra are in the form ofdistributed four-dimensional structured map-ping where the four dimensions including rowcolumn column family and super column Arow is distinguished by a string-key with arbi-trary length No matter the amount of columnsto be read or written the operation on rowsis an auto Columns may constitute clusterswhich is called column families and are sim-ilar to the data model of Bigtable Cassandraprovides two kinds of column families columnfamilies and super columns The super columnincludes arbitrary number of columns relatedto same names A column family includescolumns and super columns which may becontinuously inserted to the column familyduring runtime The partition and copy mecha-nisms of Cassandra are very similar to those ofDynamo so as to achieve consistency

                ndash Derivative tools of BigTable since theBigTable code cannot be obtained throughthe open source license some open sourceprojects compete to implement the BigTableconcept to develop similar systems such asHBase and Hypertable

                HBase is a BigTable cloned version pro-grammed with Java and is a part of Hadoop ofApachersquos MapReduce framework [90] HBasereplaces GFS with HDFS It writes updatedcontents into RAM and regularly writes theminto files on disks The row operations areatomic operations equipped with row-levellocking and transaction processing which is

                optional for large scale Partition and distribu-tion are transparently operated and have spacefor client hash or fixed key

                HyperTable was developed similar toBigTable to obtain a set of high-performanceexpandable distributed storage and process-ing systems for structured and unstructureddata [91] HyperTable relies on distributedfile systems eg HDFS and distributed lockmanager Data representation processing andpartition mechanism are similar to that inBigTable HyperTable has its own query lan-guage called HyperTable query language(HQL) and allows users to create modify andquery underlying tables

                Since the column-oriented storage databases mainlyemulate BigTable their designs are all similar exceptfor the concurrency mechanism and several other fea-tures For example Cassandra emphasizes weak consis-tency of concurrent control of multiple editions whileHBase and HyperTable focus on strong consistencythrough locks or log records

                ndash Document Database Compared with key-value stor-age document storage can support more complex dataforms Since documents do not follow strict modesthere is no need to conduct mode migration In additionkey-value pairs can still be saved We will examine threeimportant representatives of document storage systemsie MongoDB SimpleDB and CouchDB

                ndash MongoDB MongoDB is open-source anddocument-oriented database [92] MongoDBstores documents as Binary JSON (BSON)objects [93] which is similar to object Everydocument has an ID field as the primary keyQuery in MongoDB is expressed with syn-tax similar to JSON A database driver sendsthe query as a BSON object to MongoDBThe system allows query on all documentsincluding embedded objects and arrays Toenable rapid query indexes can be createdin the queryable fields of documents Thecopy operation in MongoDB can be executedwith log files in the main nodes that supportall the high-level operations conducted in thedatabase During copying the slavers queryall the writing operations since the last syn-chronization to the master and execute opera-tions in log files in local databases MongoDBsupports horizontal expansion with automaticsharing to distribute data among thousandsof nodes by automatically balancing load andfailover

                188 Mobile Netw Appl (2014) 19171ndash209

                ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

                ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

                Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

                ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

                functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

                Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

                ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

                The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

                In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

                Mobile Netw Appl (2014) 19171ndash209 189

                DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

                ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

                All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

                ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

                The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

                Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

                The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

                Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

                ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

                ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

                51 Traditional data analysis

                5 Big data analysis

                190 Mobile Netw Appl (2014) 19171ndash209

                ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

                ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

                ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

                ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

                ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

                52 Big data analytic methods

                In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

                ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

                ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

                ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

                ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

                ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

                Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

                53 Architecture for big data analysis

                Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

                Mobile Netw Appl (2014) 19171ndash209 191

                Table 1 Comparison of MPI MapReduce and Dryad

                MPI MapReduce Dryad

                Deployment Computing node and data Computing and data storage Computing and data storage

                storage arranged separately arranged at the same node arranged at the same node

                (Data should be moved (Computing should (Computing should

                computing node) be close to data) be close to data)

                Resource management ndash Workqueue(google) Not clear

                scheduling HOD(Yahoo)

                Low level programming MPI API MapReduce API Dryad API

                High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

                Data storage The local file system GFS(google) NTFS

                NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

                Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

                the tasks

                Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

                memory access Shared-memory FIFOs

                Fault-tolerant Checkpoint Task re-execute Task re-execute

                531 Real-time vs offline analysis

                According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

                ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

                ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

                532 Analysis at different levels

                Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

                ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

                ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

                ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

                192 Mobile Netw Appl (2014) 19171ndash209

                533 Analysis with different complexity

                The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

                54 Tools for big data mining and analysis

                Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

                ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

                ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

                ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

                The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

                ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

                ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

                6 Big data applications

                In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

                Mobile Netw Appl (2014) 19171ndash209 193

                However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

                61 Application evolutions

                Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

                ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

                ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

                most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

                ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

                As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

                62 Big data analysis fields

                webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

                194 Mobile Netw Appl (2014) 19171ndash209

                621 Structured data analysis

                Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

                However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

                622 Text data analysis

                The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

                623 Web data analysis

                Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

                mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

                Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

                Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

                Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

                Mobile Netw Appl (2014) 19171ndash209 195

                624 Multimedia data analysis

                Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

                Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

                Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

                Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

                segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

                Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

                The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

                625 Network data analysis

                Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

                196 Mobile Netw Appl (2014) 19171ndash209

                and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

                The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

                Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

                Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

                Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

                is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

                626 Mobile data analysis

                By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

                With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

                Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

                Mobile Netw Appl (2014) 19171ndash209 197

                In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

                Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

                63 Key applications of big data

                631 Application of big data in enterprises

                At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

                In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

                Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

                632 Application of IoT based big data

                IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

                Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

                Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

                633 Application of online social network-oriented big data

                Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

                198 Mobile Netw Appl (2014) 19171ndash209

                information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

                ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

                ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

                is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

                The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

                In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

                Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

                Mobile Netw Appl (2014) 19171ndash209 199

                or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

                Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

                ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

                ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

                ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

                634 Applications of healthcare and medical big data

                Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

                effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

                For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

                The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

                HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

                Fig 6 The correlation between Tweets about rice price and food price inflation

                200 Mobile Netw Appl (2014) 19171ndash209

                imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

                635 Collective intelligence

                With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

                Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

                In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

                636 Smart grid

                Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

                supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

                ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

                ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

                Mobile Netw Appl (2014) 19171ndash209 201

                according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

                ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

                7 Conclusion open issues and outlook

                In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

                In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

                71 Open issues

                The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

                711 Theoretical research

                Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

                ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

                ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

                ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

                712 Technology development

                The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

                ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

                202 Mobile Netw Appl (2014) 19171ndash209

                ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

                ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

                ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

                713 Practical implications

                Although there are already many successful big data appli-cations many practical problems should be solved

                ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

                ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

                ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

                individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

                ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

                714 Data security

                In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

                ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

                ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

                Mobile Netw Appl (2014) 19171ndash209 203

                quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

                ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

                ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

                The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

                72 Outlook

                The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

                not predict the future but may take precautions for possibleevents to occur in the future

                ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

                ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

                ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

                ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

                204 Mobile Netw Appl (2014) 19171ndash209

                utilizes relational diagrams to express interpersonalrelationship

                ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

                ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

                ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

                ndash Compared with accurate data we would like toaccept numerous and complicated data

                ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

                ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

                ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

                Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

                increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

                Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

                References

                1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

                2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

                3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

                4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

                5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

                httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

                7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

                8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

                9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

                10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

                11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

                12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

                13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

                14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

                15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

                16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

                17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

                18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

                19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

                Mobile Netw Appl (2014) 19171ndash209 205

                20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

                21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

                22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

                23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

                24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

                25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

                26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

                27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

                28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

                29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

                30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

                31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

                32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

                33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

                34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

                35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

                36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

                37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

                38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

                39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

                40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

                41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

                42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

                43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

                44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

                45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

                46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

                47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

                48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

                49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

                50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

                51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

                52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

                53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

                54 Cisco data center interconnect design and deployment guide(2010)

                55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

                56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

                57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

                58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

                59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

                206 Mobile Netw Appl (2014) 19171ndash209

                60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

                61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

                62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

                63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

                64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

                65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

                66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

                67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

                68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

                69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

                70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

                71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

                72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

                73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

                74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

                75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

                76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

                77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

                78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

                79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

                80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

                81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

                82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

                83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

                84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

                85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

                86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

                87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

                88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

                89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

                90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

                Media Inc93 Crockford D (2006) The applicationjson media type for

                javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

                SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

                tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

                (2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

                97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

                98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

                99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

                Mobile Netw Appl (2014) 19171ndash209 207

                100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

                101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

                102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

                103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

                104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

                105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

                106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

                107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

                108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

                109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

                110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

                111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

                112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

                113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

                114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

                115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

                D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

                117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

                118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

                the 7th ACM international conference on computing frontiersACM pp 277ndash286

                119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

                120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

                121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

                122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

                123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

                124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

                125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

                126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

                127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

                128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

                129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

                130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

                131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

                132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

                133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

                134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

                135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

                136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

                137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

                138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

                139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

                140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

                208 Mobile Netw Appl (2014) 19171ndash209

                141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

                142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

                143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

                144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

                145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

                146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

                147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

                148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

                149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

                150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

                151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

                152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

                153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

                154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

                155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

                156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

                Mobile Netw Appl (2014) 19171ndash209 209

                • Big Data A Survey
                  • Abstract
                  • Background
                    • Dawn of big data era
                    • Definition and features of big data
                    • Big data value
                    • The development of big data
                    • Challenges of big data
                      • Related technologies
                        • Relationship between cloud computing and big data
                        • Relationship between IoT and big data
                        • Data center
                        • Relationship between hadoop and big data
                          • Big data generation and acquisition
                            • Data generation
                              • Enterprise data
                              • IoT data
                              • Bio-medical data
                              • Data generation from other fields
                                • Big data acquisition
                                  • Data collection
                                  • Data transportation
                                  • Data pre-processing
                                      • Big data storage
                                        • Storage system for massive data
                                        • Distributed storage system
                                        • Storage mechanism for big data
                                          • Database technology
                                            • Traditional data analysis
                                            • Big data analytic methods
                                            • Architecture for big data analysis
                                              • Real-time vs offline analysis
                                              • Analysis at different levels
                                              • Analysis with different complexity
                                                • Tools for big data mining and analysis
                                                  • Big data applications
                                                    • Key applications of big data
                                                      • Application evolutions
                                                      • Structured data analysis
                                                      • Text data analysis
                                                      • Web data analysis
                                                      • Multimedia data analysis
                                                      • Network data analysis
                                                      • Mobile data analysis
                                                        • Key applications of big data
                                                          • Application of big data in enterprises
                                                          • Application of IoT based big data
                                                          • Application of online social network-oriented big data
                                                          • Applications of healthcare and medical big data
                                                          • Collective intelligence
                                                          • Smart grid
                                                              • Conclusion open issues and outlook
                                                                • Open issues
                                                                  • Theoretical research
                                                                  • Technology development
                                                                  • Practical implications
                                                                  • Data security
                                                                    • Outlook
                                                                      • Acknowledgments
                                                                      • References

                  can be generally divided into four phases data generationdata acquisition data storage and data analysis If we takedata as a raw material data generation and data acquisitionare an exploitation process data storage is a storage processand data analysis is a production process that utilizes theraw material to create new value

                  31 Data generation

                  Data generation is the first step of big data Given Inter-net data as an example huge amount of data in terms ofsearching entries Internet forum posts chatting records andmicroblog messages are generated Those data are closelyrelated to peoplersquos daily life and have similar features ofhigh value and low density Such Internet data may bevalueless individually but through the exploitation of accu-mulated big data useful information such as habits andhobbies of users can be identified and it is even possible toforecast usersrsquo behaviors and emotional moods

                  Moreover generated through longitudinal andor dis-tributed data sources datasets are more large-scale highlydiverse and complex Such data sources include sensorsvideos clickstreams andor all other available data sourcesAt present main sources of big data are the operationand trading information in enterprises logistic and sens-ing information in the IoT human interaction informationand position information in the Internet world and datagenerated in scientific research etc The information far sur-passes the capacities of IT architectures and infrastructuresof existing enterprises while its real time requirement alsogreatly stresses the existing computing capacity

                  311 Enterprise data

                  In 2013 IBM issued Analysis the Applications of Big Datato the Real World which indicates that the internal data ofenterprises are the main sources of big data The internaldata of enterprises mainly consists of online trading data andonline analysis data most of which are historically staticdata and are managed by RDBMSs in a structured man-ner In addition production data inventory data sales dataand financial data etc also constitute enterprise internaldata which aims to capture informationized and data-drivenactivities in enterprises so as to record all activities ofenterprises in the form of internal data

                  Over the past decades IT and digital data have con-tributed a lot to improve the profitability of business depart-ments It is estimated that the business data volume of allcompanies in the world may double every 12 years [10]in which the business turnover through the Internet enter-prises to enterprises and enterprises to consumers per daywill reach USD 450 billion [33] The continuously increas-ing business data volume requires more effective real-time

                  analysis so as to fully harvest its potential For exampleAmazon processes millions of terminal operations and morethan 500000 queries from third-party sellers per day [12]Walmart processes one million customer trades per hour andsuch trading data are imported into a database with a capac-ity of over 25PB [3] Akamai analyzes 75 million eventsper day for its target advertisements [13]

                  312 IoT data

                  As discussed IoT is an important source of big data Amongsmart cities constructed based on IoT big data may comefrom industry agriculture traffic transportation medicalcare public departments and families etc

                  According to the processes of data acquisition and trans-mission in IoT its network architecture may be dividedinto three layers the sensing layer the network layer andthe application layer The sensing layer is responsible fordata acquisition and mainly consists of sensor networksThe network layer is responsible for information transmis-sion and processing where close transmission may rely onsensor networks and remote transmission shall depend onthe Internet Finally the application layer support specificapplications of IoT

                  According to characteristics of Internet of Things thedata generated from IoT has the following features

                  ndash Large-scale data in IoT masses of data acquisi-tion equipments are distributedly deployed which mayacquire simple numeric data eg location or complexmultimedia data eg surveillance video In order tomeet the demands of analysis and processing not onlythe currently acquired data but also the historical datawithin a certain time frame should be stored Thereforedata generated by IoT are characterized by large scales

                  ndash Heterogeneity because of the variety data acquisitiondevices the acquired data is also different and such datafeatures heterogeneity

                  ndash Strong time and space correlation in IoT every dataacquisition device are placed at a specific geographiclocation and every piece of data has time stamp Thetime and space correlation are an important propertyof data from IoT During data analysis and process-ing time and space are also important dimensions forstatistical analysis

                  ndash Effective data accounts for only a small portion of thebig data a great quantity of noises may occur dur-ing the acquisition and transmission of data in IoTAmong datasets acquired by acquisition devices only asmall amount of abnormal data is valuable For exam-ple during the acquisition of traffic video the few videoframes that capture the violation of traffic regulations

                  Mobile Netw Appl (2014) 19171ndash209 179

                  and traffic accidents are more valuable than those onlycapturing the normal flow of traffic

                  313 Bio-medical data

                  As a series of high-throughput bio-measurement technolo-gies are innovatively developed in the beginning of the21st century the frontier research in the bio-medicine fieldalso enters the era of big data By constructing smartefficient and accurate analytical models and theoretical sys-tems for bio-medicine applications the essential governingmechanism behind complex biological phenomena may berevealed Not only the future development of bio-medicinecan be determined but also the leading roles can be assumedin the development of a series of important strategic indus-tries related to the national economy peoplersquos livelihoodand national security with important applications such asmedical care new drug R amp D and grain production (egtransgenic crops)

                  The completion of HGP (Human Genome Project) andthe continued development of sequencing technology alsolead to widespread applications of big data in the fieldThe masses of data generated by gene sequencing gothrough specialized analysis according to different applica-tion demands to combine it with the clinical gene diag-nosis and provide valuable information for early diagnosisand personalized treatment of disease One sequencing ofhuman gene may generate 100 600GB raw data In theChina National Genebank in Shenzhen there are 13 mil-lion samples including 115 million human samples and150000 animal plant and microorganism samples By theend of 2013 10 million traceable biological samples willbe stored and by the end of 2015 this figure will reach30 million It is predictable that with the development ofbio-medicine technologies gene sequencing will becomefaster and more convenient and thus making big data ofbio-medicine continuously grow beyond all doubt

                  In addition data generated from clinical medical care andmedical R amp D also rise quickly For example the Uni-versity of Pittsburgh Medical Center (UPMC) has stored2TB such data Explorys an American company providesplatforms to collocate clinical data operation and mainte-nance data and financial data At present about 13 millionpeoplersquos information have been collocated with 44 arti-cles of data at the scale of about 60TB which will reach70TB in 2013 Practice Fusion another American com-pany manages electronic medical records of about 200000patients

                  Apart from such small and medium-sized enterprisesother well-known IT companies such as Google Microsoftand IBM have invested extensively in the research and com-putational analysis of methods related to high-throughputbiological big data for shares in the huge market as known

                  as the ldquoNext Internetrdquo IBM forecasts in the 2013 StrategyConference that with the sharp increase of medical imagesand electronic medical records medical professionals mayutilize big data to extract useful clinical information frommasses of data to obtain a medical history and forecast treat-ment effects thus improving patient care and reduce costIt is anticipated that by 2015 the average data volume ofevery hospital will increase from 167TB to 665TB

                  314 Data generation from other fields

                  As scientific applications are increasing the scale ofdatasets is gradually expanding and the development ofsome disciplines greatly relies on the analysis of masses ofdata Here we examine several such applications Althoughbeing in different scientific fields the applications havesimilar and increasing demand on data analysis The firstexample is related to computational biology GenBank isa nucleotide sequence database maintained by the USNational Bio-Technology Innovation Center Data in thisdatabase may double every 10 months By August 2009Genbank has more than 250 billion bases from 150000 dif-ferent organisms [34] The second example is related toastronomy Sloan Digital Sky Survey (SDSS) the biggestsky survey project in astronomy has recorded 25TB datafrom 1998 to 2008 As the resolution of the telescope isimproved by 2004 the data volume generated per night willsurpass 20TB The last application is related to high-energyphysics In the beginning of 2008 the Atlas experiment ofLarge Hadron Collider (LHC) of European Organization forNuclear Research generates raw data at 2PBs and storesabout 10TB processed data per year

                  In addition pervasive sensing and computing amongnature commercial Internet government and social envi-ronments are generating heterogeneous data with unprece-dented complexity These datasets have their unique datacharacteristics in scale time dimension and data categoryFor example mobile data were recorded with respect topositions movement approximation degrees communica-tions multimedia use of applications and audio environ-ment [108] According to the application environment andrequirements such datasets into different categories so asto select the proper and feasible solutions for big data

                  32 Big data acquisition

                  As the second phase of the big data system big data acqui-sition includes data collection data transmission and datapre-processing During big data acquisition once we col-lect the raw data we shall utilize an efficient transmissionmechanism to send it to a proper storage managementsystem to support different analytical applications The col-lected datasets may sometimes include much redundant or

                  180 Mobile Netw Appl (2014) 19171ndash209

                  useless data which unnecessarily increases storage spaceand affects the subsequent data analysis For examplehigh redundancy is very common among datasets collectedby sensors for environment monitoring Data compressiontechnology can be applied to reduce the redundancy There-fore data pre-processing operations are indispensable toensure efficient data storage and exploitation

                  321 Data collection

                  Data collection is to utilize special data collection tech-niques to acquire raw data from a specific data generationenvironment Four common data collection methods areshown as follows

                  ndash Log files As one widely used data collection methodlog files are record files automatically generated by thedata source system so as to record activities in desig-nated file formats for subsequent analysis Log files aretypically used in nearly all digital devices For exam-ple web servers record in log files number of clicksclick rates visits and other property records of webusers [35] To capture activities of users at the web sitesweb servers mainly include the following three log fileformats public log file format (NCSA) expanded logformat (W3C) and IIS log format (Microsoft) All thethree types of log files are in the ASCII text formatDatabases other than text files may sometimes be usedto store log information to improve the query efficiencyof the massive log store [36 37] There are also someother log files based on data collection including stockindicators in financial applications and determinationof operating states in network monitoring and trafficmanagement

                  ndash Sensing Sensors are common in daily life to measurephysical quantities and transform physical quantitiesinto readable digital signals for subsequent process-ing (and storage) Sensory data may be classified assound wave voice vibration automobile chemicalcurrent weather pressure temperature etc Sensedinformation is transferred to a data collection pointthrough wired or wireless networks For applicationsthat may be easily deployed and managed eg videosurveillance system [38] the wired sensor network isa convenient solution to acquire related informationSometimes the accurate position of a specific phe-nomenon is unknown and sometimes the monitoredenvironment does not have the energy or communica-tion infrastructures Then wireless communication mustbe used to enable data transmission among sensor nodesunder limited energy and communication capability Inrecent years WSNs have received considerable inter-est and have been applied to many applications such

                  as environmental research [39 40] water quality mon-itoring [41] civil engineering [42 43] and wildlifehabit monitoring [44] A WSN generally consists ofa large number of geographically distributed sensornodes each being a micro device powered by batterySuch sensors are deployed at designated positions asrequired by the application to collect remote sensingdata Once the sensors are deployed the base stationwill send control information for network configura-tionmanagement or data collection to sensor nodesBased on such control information the sensory data isassembled in different sensor nodes and sent back to thebase station for further processing Interested readersare referred to [45] for more detailed discussions

                  ndash Methods for acquiring network data At present net-work data acquisition is accomplished using a com-bination of web crawler word segmentation systemtask system and index system etc Web crawler isa program used by search engines for downloadingand storing web pages [46] Generally speaking webcrawler starts from the uniform resource locator (URL)of an initial web page to access other linked web pagesduring which it stores and sequences all the retrievedURLs Web crawler acquires a URL in the order ofprecedence through a URL queue and then downloadsweb pages and identifies all URLs in the downloadedweb pages and extracts new URLs to be put in thequeue This process is repeated until the web crawleris stopped Data acquisition through a web crawler iswidely applied in applications based on web pagessuch as search engines or web caching Traditional webpage extraction technologies feature multiple efficientsolutions and considerable research has been done inthis field As more advanced web page applicationsare emerging some extraction strategies are proposedin [47] to cope with rich Internet applications

                  The current network data acquisition technologiesmainly include traditional Libpcap-based packet capturetechnology zero-copy packet capture technology as wellas some specialized network monitoring software such asWireshark SmartSniff and WinNetCap

                  ndash Libpcap-based packet capture technology Libpcap(packet capture library) is a widely used network datapacket capture function library It is a general tool thatdoes not depend on any specific system and is mainlyused to capture data in the data link layer It featuressimplicity easy-to-use and portability but has a rel-atively low efficiency Therefore under a high-speednetwork environment considerable packet losses mayoccur when Libpcap is used

                  Mobile Netw Appl (2014) 19171ndash209 181

                  ndash Zero-copy packet capture technology The so-calledzero-copy (ZC) means that no copies between any inter-nal memories occur during packet receiving and send-ing at a node In sending the data packets directly startfrom the user buffer of applications pass through thenetwork interfaces and arrive at an external networkIn receiving the network interfaces directly send datapackets to the user buffer The basic idea of zero-copyis to reduce data copy times reduce system calls andreduce CPU load while ddatagrams are passed from net-work equipments to user program space The zero-copytechnology first utilizes direct memory access (DMA)technology to directly transmit network datagrams to anaddress space pre-allocated by the system kernel so asto avoid the participation of CPU In the meanwhile itmaps the internal memory of the datagrams in the sys-tem kernel to the that of the detection program or buildsa cache region in the user space and maps it to the ker-nel space Then the detection program directly accessesthe internal memory so as to reduce internal memorycopy from system kernel to user space and reduce theamount of system calls

                  ndash Mobile equipments At present mobile devices aremore widely used As mobile device functions becomeincreasingly stronger they feature more complex andmultiple means of data acquisition as well as morevariety of data Mobile devices may acquire geo-graphical location information through positioning sys-tems acquire audio information through microphonesacquire pictures videos streetscapes two-dimensionalbarcodes and other multimedia information throughcameras acquire user gestures and other body languageinformation through touch screens and gravity sensorsOver the years wireless operators have improved theservice level of the mobile Internet by acquiring andanalyzing such information For example iPhone itselfis a ldquomobile spyrdquo It may collect wireless data andgeographical location information and then send suchinformation back to Apple Inc for processing of whichthe user is not aware Apart from Apple smart phoneoperating systems such as Android of Google and Win-dows Phone of Microsoft can also collect informationin the similar manner

                  In addition to the aforementioned three data acquisitionmethods of main data sources there are many other datacollect methods or systems For example in scientific exper-iments many special tools can be used to collect exper-imental data such as magnetic spectrometers and radiotelescopes We may classify data collection methods fromdifferent perspectives From the perspective of data sourcesdata collection methods can be classified into two cate-gories collection methods recording through data sources

                  and collection methods recording through other auxiliarytools

                  322 Data transportation

                  Upon the completion of raw data collection data will betransferred to a data storage infrastructure for processingand analysis As discussed in Section 23 big data is mainlystored in a data center The data layout should be adjusted toimprove computing efficiency or facilitate hardware mainte-nance In other words internal data transmission may occurin the data center Therefore data transmission consistsof two phases Inter-DCN transmissions and Intra-DCNtransmissions

                  ndash Inter-DCN transmissions Inter-DCN transmissions arefrom data source to data center which is generallyachieved with the existing physical network infrastruc-ture Because of the rapid growth of traffic demandsthe physical network infrastructure in most regionsaround the world are constituted by high-volumn high-rate and cost-effective optic fiber transmission systemsOver the past 20 years advanced management equip-ment and technologies have been developed such asIP-based wavelength division multiplexing (WDM) net-work architecture to conduct smart control and man-agement of optical fiber networks [48 49] WDM isa technology that multiplexes multiple optical carriersignals with different wave lengths and couples themto the same optical fiber of the optical link In suchtechnology lasers with different wave lengths carry dif-ferent signals By far the backbone network have beendeployed with WDM optical transmission systems withsingle channel rate of 40Gbs At present 100Gbs com-mercial interface are available and 100Gbs systems (orTBs systems) will be available in the near future [50]However traditional optical transmission technologiesare limited by the bandwidth of the electronic bot-tleneck [51] Recently orthogonal frequency-divisionmultiplexing (OFDM) initially designed for wirelesssystems is regarded as one of the main candidatetechnologies for future high-speed optical transmis-sion OFDM is a multi-carrier parallel transmissiontechnology It segments a high-speed data flow to trans-form it into low-speed sub-data-flows to be transmittedover multiple orthogonal sub-carriers [52] Comparedwith fixed channel spacing of WDM OFDM allowssub-channel frequency spectrums to overlap with eachother [53] Therefore it is a flexible and efficient opticalnetworking technology

                  ndash Intra-DCN Transmissions Intra-DCN transmissionsare the data communication flows within data centersIntra-DCN transmissions depend on the communication

                  182 Mobile Netw Appl (2014) 19171ndash209

                  mechanism within the data center (ie on physical con-nection plates chips internal memories of data serversnetwork architectures of data centers and communica-tion protocols) A data center consists of multiple inte-grated server racks interconnected with its internal con-nection networks Nowadays the internal connectionnetworks of most data centers are fat-tree two-layeror three-layer structures based on multi-commoditynetwork flows [51 54] In the two-layer topologicalstructure the racks are connected by 1Gbps top rackswitches (TOR) and then such top rack switches areconnected with 10Gbps aggregation switches in thetopological structure The three-layer topological struc-ture is a structure augmented with one layer on the topof the two-layer topological structure and such layeris constituted by 10Gbps or 100Gbps core switchesto connect aggregation switches in the topologicalstructure There are also other topological structureswhich aim to improve the data center networks [55ndash58] Because of the inadequacy of electronic packetswitches it is difficult to increase communication band-widths while keeps energy consumption is low Overthe years due to the huge success achieved by opti-cal technologies the optical interconnection among thenetworks in data centers has drawn great interest Opti-cal interconnection is a high-throughput low-delayand low-energy-consumption solution At present opti-cal technologies are only used for point-to-point linksin data centers Such optical links provide connectionfor the switches using the low-cost multi-mode fiber(MMF) with 10Gbps data rate Optical interconnec-tion (switching in the optical domain) of networks indata centers is a feasible solution which can provideTbps-level transmission bandwidth with low energyconsumption Recently many optical interconnectionplans are proposed for data center networks [59] Someplans add optical paths to upgrade the existing net-works and other plans completely replace the currentswitches [59ndash64] As a strengthening technology Zhouet al in [65] adopt wireless links in the 60GHz fre-quency band to strengthen wired links Network vir-tualization should also be considered to improve theefficiency and utilization of data center networks

                  323 Data pre-processing

                  Because of the wide variety of data sources the collecteddatasets vary with respect to noise redundancy and con-sistency etc and it is undoubtedly a waste to store mean-ingless data In addition some analytical methods haveserious requirements on data quality Therefore in orderto enable effective data analysis we shall pre-process data

                  under many circumstances to integrate the data from differ-ent sources which can not only reduces storage expensebut also improves analysis accuracy Some relational datapre-processing techniques are discussed as follows

                  ndash Integration data integration is the cornerstone of mod-ern commercial informatics which involves the com-bination of data from different sources and providesusers with a uniform view of data [66] This is a matureresearch field for traditional database Historically twomethods have been widely recognized data ware-house and data federation Data warehousing includesa process named ETL (Extract Transform and Load)Extraction involves connecting source systems select-ing collecting analyzing and processing necessarydata Transformation is the execution of a series of rulesto transform the extracted data into standard formatsLoading means importing extracted and transformeddata into the target storage infrastructure Loading isthe most complex procedure among the three whichincludes operations such as transformation copy clear-ing standardization screening and data organizationA virtual database can be built to query and aggregatedata from different data sources but such database doesnot contain data On the contrary it includes informa-tion or metadata related to actual data and its positionsSuch two ldquostorage-readingrdquo approaches do not sat-isfy the high performance requirements of data flowsor search programs and applications Compared withqueries data in such two approaches is more dynamicand must be processed during data transmission Gen-erally data integration methods are accompanied withflow processing engines and search engines [30 67]

                  ndash Cleaning data cleaning is a process to identify inac-curate incomplete or unreasonable data and thenmodify or delete such data to improve data qualityGenerally data cleaning includes five complementaryprocedures [68] defining and determining error typessearching and identifying errors correcting errors doc-umenting error examples and error types and mod-ifying data entry procedures to reduce future errorsDuring cleaning data formats completeness rational-ity and restriction shall be inspected Data cleaning isof vital importance to keep the data consistency whichis widely applied in many fields such as banking insur-ance retail industry telecommunications and trafficcontrol

                  In e-commerce most data is electronically col-lected which may have serious data quality prob-lems Classic data quality problems mainly come fromsoftware defects customized errors or system mis-configuration Authors in [69] discussed data cleaning

                  Mobile Netw Appl (2014) 19171ndash209 183

                  in e-commerce by crawlers and regularly re-copyingcustomer and account information

                  In [70] the problem of cleaning RFID data wasexamined RFID is widely used in many applica-tions eg inventory management and target track-ing However the original RFID features low qualitywhich includes a lot of abnormal data limited by thephysical design and affected by environmental noisesIn [71] a probability model was developed to copewith data loss in mobile environments Khoussainovaet al in [72] proposed a system to automatically cor-rect errors of input data by defining global integrityconstraints

                  Herbert et al [73] proposed a framework called BIO-AJAX to standardize biological data so as to conductfurther computation and improve search quality WithBIO-AJAX some errors and repetitions may be elim-inated and common data mining technologies can beexecuted more effectively

                  ndash Redundancy elimination data redundancy refers to datarepetitions or surplus which usually occurs in manydatasets Data redundancy can increase the unneces-sary data transmission expense and cause defects onstorage systems eg waste of storage space lead-ing to data inconsistency reduction of data reliabil-ity and data damage Therefore various redundancyreduction methods have been proposed such as redun-dancy detection data filtering and data compressionSuch methods may apply to different datasets or appli-cation environments However redundancy reductionmay also bring about certain negative effects Forexample data compression and decompression causeadditional computational burden Therefore the ben-efits of redundancy reduction and the cost should becarefully balanced Data collected from different fieldswill increasingly appear in image or video formatsIt is well-known that images and videos contain con-siderable redundancy including temporal redundancyspacial redundancy statistical redundancy and sens-ing redundancy Video compression is widely usedto reduce redundancy in video data as specified inthe many video coding standards (MPEG-2 MPEG-4H263 and H264AVC) In [74] the authors inves-tigated the problem of video compression in a videosurveillance system with a video sensor network Theauthors propose a new MPEG-4 based method byinvestigating the contextual redundancy related to back-ground and foreground in a scene The low com-plexity and the low compression ratio of the pro-posed approach were demonstrated by the evaluationresults

                  On generalized data transmission or storage re-peated data deletion is a special data compression

                  technology which aims to eliminate repeated datacopies [75] With repeated data deletion individual datablocks or data segments will be assigned with identi-fiers (eg using a hash algorithm) and stored with theidentifiers added to the identification list As the anal-ysis of repeated data deletion continues if a new datablock has an identifier that is identical to that listedin the identification list the new data block will bedeemed as redundant and will be replaced by the cor-responding stored data block Repeated data deletioncan greatly reduce storage requirement which is par-ticularly important to a big data storage system Apartfrom the aforementioned data pre-processing methodsspecific data objects shall go through some other oper-ations such as feature extraction Such operation playsan important role in multimedia search and DNA anal-ysis [76ndash78] Usually high-dimensional feature vec-tors (or high-dimensional feature points) are used todescribe such data objects and the system stores thedimensional feature vectors for future retrieval Datatransfer is usually used to process distributed hetero-geneous data sources especially business datasets [79]As a matter of fact in consideration of various datasetsit is non-trivial or impossible to build a uniform datapre-processing procedure and technology that is appli-cable to all types of datasets on the specific featureproblem performance requirements and other factorsof the datasets should be considered so as to select aproper data pre-processing strategy

                  4 Big data storage

                  The explosive growth of data has more strict requirementson storage and management In this section we focus onthe storage of big data Big data storage refers to the stor-age and management of large-scale datasets while achiev-ing reliability and availability of data accessing We willreview important issues including massive storage systemsdistributed storage systems and big data storage mecha-nisms On one hand the storage infrastructure needs toprovide information storage service with reliable storagespace on the other hand it must provide a powerful accessinterface for query and analysis of a large amount ofdata

                  Traditionally as auxiliary equipment of server data stor-age device is used to store manage look up and analyzedata with structured RDBMSs With the sharp growth ofdata data storage device is becoming increasingly moreimportant and many Internet companies pursue big capac-ity of storage to be competitive Therefore there is acompelling need for research on data storage

                  184 Mobile Netw Appl (2014) 19171ndash209

                  41 Storage system for massive data

                  Various storage systems emerge to meet the demands ofmassive data Existing massive storage technologies can beclassified as Direct Attached Storage (DAS) and networkstorage while network storage can be further classifiedinto Network Attached Storage (NAS) and Storage AreaNetwork (SAN)

                  In DAS various harddisks are directly connected withservers and data management is server-centric such thatstorage devices are peripheral equipments each of whichtakes a certain amount of IO resource and is managed by anindividual application software For this reason DAS is onlysuitable to interconnect servers with a small scale How-ever due to its low scalability DAS will exhibit undesirableefficiency when the storage capacity is increased ie theupgradeability and expandability are greatly limited ThusDAS is mainly used in personal computers and small-sizedservers

                  Network storage is to utilize network to provide userswith a union interface for data access and sharing Networkstorage equipment includes special data exchange equip-ments disk array tap library and other storage media aswell as special storage software It is characterized withstrong expandability

                  NAS is actually an auxillary storage equipment of a net-work It is directly connected to a network through a hub orswitch through TCPIP protocols In NAS data is transmit-ted in the form of files Compared to DAS the IO burdenat a NAS server is reduced extensively since the serveraccesses a storage device indirectly through a network

                  While NAS is network-oriented SAN is especiallydesigned for data storage with a scalable and bandwidthintensive network eg a high-speed network with opticalfiber connections In SAN data storage management is rel-atively independent within a storage local area networkwhere multipath based data switching among any internalnodes is utilized to achieve a maximum degree of datasharing and data management

                  From the organization of a data storage system DASNAS and SAN can all be divided into three parts (i) discarray it is the foundation of a storage system and the fun-damental guarantee for data storage (ii) connection andnetwork sub-systems which provide connection among oneor more disc arrays and servers (iii) storage managementsoftware which handles data sharing disaster recovery andother storage management tasks of multiple servers

                  42 Distributed storage system

                  The first challenge brought about by big data is how todevelop a large scale distributed storage system for effi-ciently data processing and analysis To use a distributed

                  system to store massive data the following factors shouldbe taken into consideration

                  ndash Consistency a distributed storage system requires mul-tiple servers to cooperatively store data As there aremore servers the probability of server failures will belarger Usually data is divided into multiple pieces tobe stored at different servers to ensure availability incase of server failure However server failures and par-allel storage may cause inconsistency among differentcopies of the same data Consistency refers to assuringthat multiple copies of the same data are identical

                  ndash Availability a distributed storage system operates inmultiple sets of servers As more servers are usedserver failures are inevitable It would be desirable ifthe entire system is not seriously affected to satisfy cus-tomerrsquos requests in terms of reading and writing Thisproperty is called availability

                  ndash Partition Tolerance multiple servers in a distributedstorage system are connected by a network The net-work could have linknode failures or temporary con-gestion The distributed system should have a certainlevel of tolerance to problems caused by network fail-ures It would be desirable that the distributed storagestill works well when the network is partitioned

                  Eric Brewer proposed a CAP [80 81] theory in 2000which indicated that a distributed system could not simulta-neously meet the requirements on consistency availabilityand partition tolerance at most two of the three require-ments can be satisfied simultaneously Seth Gilbert andNancy Lynch from MIT proved the correctness of CAP the-ory in 2002 Since consistency availability and partitiontolerance could not be achieved simultaneously we can havea CA system by ignoring partition tolerance a CP system byignoring availability and an AP system that ignores consis-tency according to different design goals The three systemsare discussed in the following

                  CA systems do not have partition tolerance ie theycould not handle network failures Therefore CA sys-tems are generally deemed as storage systems with a sin-gle server such as the traditional small-scale relationaldatabases Such systems feature single copy of data suchthat consistency is easily ensured Availability is guaranteedby the excellent design of relational databases Howeversince CA systems could not handle network failures theycould not be expanded to use many servers Therefore mostlarge-scale storage systems are CP systems and AP systems

                  Compared with CA systems CP systems ensure parti-tion tolerance Therefore CP systems can be expanded tobecome distributed systems CP systems generally main-tain several copies of the same data in order to ensure a

                  Mobile Netw Appl (2014) 19171ndash209 185

                  level of fault tolerance CP systems also ensure data consis-tency ie multiple copies of the same data are guaranteedto be completely identical However CP could not ensuresound availability because of the high cost for consistencyassurance Therefore CP systems are useful for the scenar-ios with moderate load but stringent requirements on dataaccuracy (eg trading data) BigTable and Hbase are twopopular CP systems

                  AP systems also ensure partition tolerance However APsystems are different from CP systems in that AP systemsalso ensure availability However AP systems only ensureeventual consistency rather than strong consistency in theprevious two systems Therefore AP systems only applyto the scenarios with frequent requests but not very highrequirements on accuracy For example in online SocialNetworking Services (SNS) systems there are many con-current visits to the data but a certain amount of data errorsare tolerable Furthermore because AP systems ensureeventual consistency accurate data can still be obtained aftera certain amount of delay Therefore AP systems may alsobe used under the circumstances with no stringent realtimerequirements Dynamo and Cassandra are two popular APsystems

                  43 Storage mechanism for big data

                  Considerable research on big data promotes the develop-ment of storage mechanisms for big data Existing stor-age mechanisms of big data may be classified into threebottom-up levels (i) file systems (ii) databases and (iii)programming models

                  File systems are the foundation of the applications atupper levels Googlersquos GFS is an expandable distributedfile system to support large-scale distributed data-intensiveapplications [25] GFS uses cheap commodity servers toachieve fault-tolerance and provides customers with high-performance services GFS supports large-scale file appli-cations with more frequent reading than writing HoweverGFS also has some limitations such as a single point offailure and poor performances for small files Such limita-tions have been overcome by Colossus [82] the successorof GFS

                  In addition other companies and researchers also havetheir solutions to meet the different demands for storageof big data For example HDFS and Kosmosfs are deriva-tives of open source codes of GFS Microsoft developedCosmos [83] to support its search and advertisement busi-ness Facebook utilizes Haystack [84] to store the largeamount of small-sized photos Taobao also developed TFSand FastDFS In conclusion distributed file systems havebeen relatively mature after years of development and busi-ness operation Therefore we will focus on the other twolevels in the rest of this section

                  431 Database technology

                  The database technology has been evolving for more than30 years Various database systems are developed to handledatasets at different scales and support various applica-tions Traditional relational databases cannot meet the chal-lenges on categories and scales brought about by big dataNoSQL databases (ie non traditional relational databases)are becoming more popular for big data storage NoSQLdatabases feature flexible modes support for simple andeasy copy simple API eventual consistency and supportof large volume data NoSQL databases are becomingthe core technology for of big data We will examinethe following three main NoSQL databases in this sec-tion Key-value databases column-oriented databases anddocument-oriented databases each based on certain datamodels

                  ndash Key-value Databases Key-value Databases are con-stituted by a simple data model and data is storedcorresponding to key-values Every key is unique andcustomers may input queried values according to thekeys Such databases feature a simple structure andthe modern key-value databases are characterized withhigh expandability and shorter query response time thanthose of relational databases Over the past few yearsmany key-value databases have appeared as motivatedby Amazonrsquos Dynamo system [85] We will introduceDynamo and several other representative key-valuedatabases

                  ndash Dynamo Dynamo is a highly available andexpandable distributed key-value data stor-age system It is used to store and managethe status of some core services which canbe realized with key access in the Amazone-Commerce Platform The public mode ofrelational databases may generate invalid dataand limit data scale and availability whileDynamo can resolve these problems with asimple key-object interface which is consti-tuted by simple reading and writing opera-tion Dynamo achieves elasticity and avail-ability through the data partition data copyand object edition mechanisms Dynamo par-tition plan relies on Consistent Hashing [86]which has a main advantage that node pass-ing only affects directly adjacent nodes anddo not affect other nodes to divide the loadfor multiple main storage machines Dynamocopies data to N sets of servers in which Nis a configurable parameter in order to achieve

                  186 Mobile Netw Appl (2014) 19171ndash209

                  high availability and durability Dynamo sys-tem also provides eventual consistency so asto conduct asynchronous update on all copies

                  ndash Voldemort Voldemort is also a key-value stor-age system which was initially developed forand is still used by LinkedIn Key words andvalues in Voldemort are composite objectsconstituted by tables and images Volde-mort interface includes three simple opera-tions reading writing and deletion all ofwhich are confirmed by key words Volde-mort provides asynchronous updating con-current control of multiple editions but doesnot ensure data consistency However Volde-mort supports optimistic locking for consistentmulti-record updating When conflict happensbetween the updating and any other opera-tions the updating operation will quit Thedata copy mechanism of Voldmort is the sameas that of Dynamo Voldemort not only storesdata in RAM but allows data be inserted intoa storage engine Especially Voldemort sup-ports two storage engines including BerkeleyDB and Random Access Files

                  The key-value database emerged a few years agoDeeply influenced by Amazon Dynamo DB other key-value storage systems include Redis Tokyo Canbinetand Tokyo Tyrant Memcached and Memcache DBRiak and Scalaris all of which provide expandabilityby distributing key words into nodes Voldemort RiakTokyo Cabinet and Memecached can utilize attachedstorage devices to store data in RAM or disks Otherstorage systems store data at RAM and provide diskbackup or rely on copy and recovery to avoid backup

                  ndash Column-oriented Database The column-orienteddatabases store and process data according to columnsother than rows Both columns and rows are segmentedin multiple nodes to realize expandability The column-oriented databases are mainly inspired by GooglersquosBigTable In this Section we first discuss BigTable andthen introduce several derivative tools

                  ndash BigTable BigTable is a distributed structureddata storage system which is designed to pro-cess the large-scale (PB class) data amongthousands commercial servers [87] The basicdata structure of Bigtable is a multi-dimensionsequenced mapping with sparse distributedand persistent storage Indexes of mappingare row key column key and timestampsand every value in mapping is an unana-lyzed byte array Each row key in BigTable

                  is a 64KB character string By lexicograph-ical order rows are stored and continuallysegmented into Tablets (ie units of distribu-tion) for load balance Thus reading a shortrow of data can be highly effective since itonly involves communication with a small por-tion of machines The columns are groupedaccording to the prefixes of keys and thusforming column families These column fami-lies are the basic units for access control Thetimestamps are 64-bit integers to distinguishdifferent editions of cell values Clients mayflexibly determine the number of cell editionsstored These editions are sequenced in thedescending order of timestamps so the latestedition will always be read

                  The BigTable API features the creation anddeletion of Tablets and column families as wellas modification of metadata of clusters tablesand column families Client applications mayinsert or delete values of BigTable query val-ues from columns or browse sub-datasets in atable Bigtable also supports some other char-acteristics such as transaction processing in asingle row Users may utilize such features toconduct more complex data processing

                  Every procedure executed by BigTableincludes three main components Masterserver Tablet server and client libraryBigtable only allows one set of Master serverbe distributed to be responsible for distribut-ing tablets for Tablet server detecting addedor removed Tablet servers and conductingload balance In addition it can also mod-ify BigTable schema eg creating tables andcolumn families and collecting garbage savedin GFS as well as deleted or disabled filesand using them in specific BigTable instancesEvery tablet server manages a Tablet set andis responsible for the reading and writing of aloaded Tablet When Tablets are too big theywill be segmented by the server The applica-tion client library is used to communicate withBigTable instances

                  BigTable is based on many fundamentalcomponents of Google including GFS [25]cluster management system SSTable file for-mat and Chubby [88] GFS is use to store dataand log files The cluster management systemis responsible for task scheduling resourcessharing processing of machine failures andmonitoring of machine statuses SSTable fileformat is used to store BigTable data internally

                  Mobile Netw Appl (2014) 19171ndash209 187

                  and it provides mapping between persistentsequenced and unchangeable keys and valuesas any byte strings BigTable utilizes Chubbyfor the following tasks in server 1) ensurethere is at most one active Master copy atany time 2) store the bootstrap location ofBigTable data 3) look up Tablet server 4) con-duct error recovery in case of Table server fail-ures 5) store BigTable schema information 6)store the access control table

                  ndash Cassandra Cassandra is a distributed storagesystem to manage the huge amount of struc-tured data distributed among multiple commer-cial servers [89] The system was developedby Facebook and became an open source toolin 2008 It adopts the ideas and concepts ofboth Amazon Dynamo and Google BigTableespecially integrating the distributed systemtechnology of Dynamo with the BigTable datamodel Tables in Cassandra are in the form ofdistributed four-dimensional structured map-ping where the four dimensions including rowcolumn column family and super column Arow is distinguished by a string-key with arbi-trary length No matter the amount of columnsto be read or written the operation on rowsis an auto Columns may constitute clusterswhich is called column families and are sim-ilar to the data model of Bigtable Cassandraprovides two kinds of column families columnfamilies and super columns The super columnincludes arbitrary number of columns relatedto same names A column family includescolumns and super columns which may becontinuously inserted to the column familyduring runtime The partition and copy mecha-nisms of Cassandra are very similar to those ofDynamo so as to achieve consistency

                  ndash Derivative tools of BigTable since theBigTable code cannot be obtained throughthe open source license some open sourceprojects compete to implement the BigTableconcept to develop similar systems such asHBase and Hypertable

                  HBase is a BigTable cloned version pro-grammed with Java and is a part of Hadoop ofApachersquos MapReduce framework [90] HBasereplaces GFS with HDFS It writes updatedcontents into RAM and regularly writes theminto files on disks The row operations areatomic operations equipped with row-levellocking and transaction processing which is

                  optional for large scale Partition and distribu-tion are transparently operated and have spacefor client hash or fixed key

                  HyperTable was developed similar toBigTable to obtain a set of high-performanceexpandable distributed storage and process-ing systems for structured and unstructureddata [91] HyperTable relies on distributedfile systems eg HDFS and distributed lockmanager Data representation processing andpartition mechanism are similar to that inBigTable HyperTable has its own query lan-guage called HyperTable query language(HQL) and allows users to create modify andquery underlying tables

                  Since the column-oriented storage databases mainlyemulate BigTable their designs are all similar exceptfor the concurrency mechanism and several other fea-tures For example Cassandra emphasizes weak consis-tency of concurrent control of multiple editions whileHBase and HyperTable focus on strong consistencythrough locks or log records

                  ndash Document Database Compared with key-value stor-age document storage can support more complex dataforms Since documents do not follow strict modesthere is no need to conduct mode migration In additionkey-value pairs can still be saved We will examine threeimportant representatives of document storage systemsie MongoDB SimpleDB and CouchDB

                  ndash MongoDB MongoDB is open-source anddocument-oriented database [92] MongoDBstores documents as Binary JSON (BSON)objects [93] which is similar to object Everydocument has an ID field as the primary keyQuery in MongoDB is expressed with syn-tax similar to JSON A database driver sendsthe query as a BSON object to MongoDBThe system allows query on all documentsincluding embedded objects and arrays Toenable rapid query indexes can be createdin the queryable fields of documents Thecopy operation in MongoDB can be executedwith log files in the main nodes that supportall the high-level operations conducted in thedatabase During copying the slavers queryall the writing operations since the last syn-chronization to the master and execute opera-tions in log files in local databases MongoDBsupports horizontal expansion with automaticsharing to distribute data among thousandsof nodes by automatically balancing load andfailover

                  188 Mobile Netw Appl (2014) 19171ndash209

                  ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

                  ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

                  Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

                  ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

                  functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

                  Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

                  ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

                  The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

                  In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

                  Mobile Netw Appl (2014) 19171ndash209 189

                  DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

                  ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

                  All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

                  ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

                  The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

                  Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

                  The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

                  Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

                  ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

                  ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

                  51 Traditional data analysis

                  5 Big data analysis

                  190 Mobile Netw Appl (2014) 19171ndash209

                  ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

                  ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

                  ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

                  ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

                  ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

                  52 Big data analytic methods

                  In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

                  ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

                  ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

                  ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

                  ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

                  ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

                  Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

                  53 Architecture for big data analysis

                  Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

                  Mobile Netw Appl (2014) 19171ndash209 191

                  Table 1 Comparison of MPI MapReduce and Dryad

                  MPI MapReduce Dryad

                  Deployment Computing node and data Computing and data storage Computing and data storage

                  storage arranged separately arranged at the same node arranged at the same node

                  (Data should be moved (Computing should (Computing should

                  computing node) be close to data) be close to data)

                  Resource management ndash Workqueue(google) Not clear

                  scheduling HOD(Yahoo)

                  Low level programming MPI API MapReduce API Dryad API

                  High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

                  Data storage The local file system GFS(google) NTFS

                  NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

                  Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

                  the tasks

                  Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

                  memory access Shared-memory FIFOs

                  Fault-tolerant Checkpoint Task re-execute Task re-execute

                  531 Real-time vs offline analysis

                  According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

                  ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

                  ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

                  532 Analysis at different levels

                  Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

                  ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

                  ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

                  ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

                  192 Mobile Netw Appl (2014) 19171ndash209

                  533 Analysis with different complexity

                  The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

                  54 Tools for big data mining and analysis

                  Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

                  ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

                  ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

                  ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

                  The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

                  ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

                  ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

                  6 Big data applications

                  In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

                  Mobile Netw Appl (2014) 19171ndash209 193

                  However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

                  61 Application evolutions

                  Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

                  ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

                  ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

                  most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

                  ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

                  As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

                  62 Big data analysis fields

                  webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

                  194 Mobile Netw Appl (2014) 19171ndash209

                  621 Structured data analysis

                  Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

                  However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

                  622 Text data analysis

                  The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

                  623 Web data analysis

                  Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

                  mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

                  Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

                  Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

                  Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

                  Mobile Netw Appl (2014) 19171ndash209 195

                  624 Multimedia data analysis

                  Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

                  Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

                  Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

                  Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

                  segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

                  Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

                  The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

                  625 Network data analysis

                  Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

                  196 Mobile Netw Appl (2014) 19171ndash209

                  and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

                  The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

                  Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

                  Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

                  Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

                  is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

                  626 Mobile data analysis

                  By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

                  With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

                  Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

                  Mobile Netw Appl (2014) 19171ndash209 197

                  In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

                  Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

                  63 Key applications of big data

                  631 Application of big data in enterprises

                  At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

                  In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

                  Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

                  632 Application of IoT based big data

                  IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

                  Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

                  Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

                  633 Application of online social network-oriented big data

                  Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

                  198 Mobile Netw Appl (2014) 19171ndash209

                  information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

                  ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

                  ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

                  is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

                  The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

                  In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

                  Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

                  Mobile Netw Appl (2014) 19171ndash209 199

                  or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

                  Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

                  ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

                  ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

                  ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

                  634 Applications of healthcare and medical big data

                  Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

                  effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

                  For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

                  The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

                  HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

                  Fig 6 The correlation between Tweets about rice price and food price inflation

                  200 Mobile Netw Appl (2014) 19171ndash209

                  imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

                  635 Collective intelligence

                  With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

                  Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

                  In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

                  636 Smart grid

                  Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

                  supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

                  ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

                  ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

                  Mobile Netw Appl (2014) 19171ndash209 201

                  according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

                  ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

                  7 Conclusion open issues and outlook

                  In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

                  In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

                  71 Open issues

                  The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

                  711 Theoretical research

                  Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

                  ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

                  ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

                  ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

                  712 Technology development

                  The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

                  ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

                  202 Mobile Netw Appl (2014) 19171ndash209

                  ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

                  ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

                  ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

                  713 Practical implications

                  Although there are already many successful big data appli-cations many practical problems should be solved

                  ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

                  ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

                  ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

                  individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

                  ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

                  714 Data security

                  In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

                  ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

                  ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

                  Mobile Netw Appl (2014) 19171ndash209 203

                  quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

                  ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

                  ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

                  The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

                  72 Outlook

                  The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

                  not predict the future but may take precautions for possibleevents to occur in the future

                  ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

                  ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

                  ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

                  ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

                  204 Mobile Netw Appl (2014) 19171ndash209

                  utilizes relational diagrams to express interpersonalrelationship

                  ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

                  ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

                  ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

                  ndash Compared with accurate data we would like toaccept numerous and complicated data

                  ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

                  ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

                  ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

                  Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

                  increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

                  Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

                  References

                  1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

                  2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

                  3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

                  4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

                  5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

                  httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

                  7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

                  8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

                  9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

                  10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

                  11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

                  12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

                  13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

                  14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

                  15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

                  16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

                  17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

                  18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

                  19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

                  Mobile Netw Appl (2014) 19171ndash209 205

                  20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

                  21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

                  22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

                  23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

                  24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

                  25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

                  26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

                  27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

                  28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

                  29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

                  30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

                  31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

                  32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

                  33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

                  34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

                  35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

                  36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

                  37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

                  38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

                  39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

                  40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

                  41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

                  42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

                  43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

                  44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

                  45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

                  46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

                  47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

                  48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

                  49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

                  50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

                  51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

                  52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

                  53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

                  54 Cisco data center interconnect design and deployment guide(2010)

                  55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

                  56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

                  57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

                  58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

                  59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

                  206 Mobile Netw Appl (2014) 19171ndash209

                  60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

                  61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

                  62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

                  63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

                  64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

                  65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

                  66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

                  67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

                  68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

                  69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

                  70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

                  71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

                  72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

                  73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

                  74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

                  75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

                  76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

                  77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

                  78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

                  79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

                  80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

                  81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

                  82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

                  83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

                  84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

                  85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

                  86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

                  87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

                  88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

                  89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

                  90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

                  Media Inc93 Crockford D (2006) The applicationjson media type for

                  javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

                  SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

                  tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

                  (2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

                  97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

                  98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

                  99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

                  Mobile Netw Appl (2014) 19171ndash209 207

                  100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

                  101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

                  102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

                  103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

                  104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

                  105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

                  106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

                  107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

                  108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

                  109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

                  110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

                  111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

                  112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

                  113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

                  114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

                  115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

                  D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

                  117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

                  118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

                  the 7th ACM international conference on computing frontiersACM pp 277ndash286

                  119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

                  120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

                  121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

                  122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

                  123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

                  124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

                  125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

                  126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

                  127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

                  128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

                  129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

                  130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

                  131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

                  132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

                  133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

                  134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

                  135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

                  136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

                  137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

                  138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

                  139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

                  140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

                  208 Mobile Netw Appl (2014) 19171ndash209

                  141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

                  142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

                  143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

                  144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

                  145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

                  146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

                  147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

                  148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

                  149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

                  150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

                  151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

                  152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

                  153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

                  154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

                  155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

                  156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

                  Mobile Netw Appl (2014) 19171ndash209 209

                  • Big Data A Survey
                    • Abstract
                    • Background
                      • Dawn of big data era
                      • Definition and features of big data
                      • Big data value
                      • The development of big data
                      • Challenges of big data
                        • Related technologies
                          • Relationship between cloud computing and big data
                          • Relationship between IoT and big data
                          • Data center
                          • Relationship between hadoop and big data
                            • Big data generation and acquisition
                              • Data generation
                                • Enterprise data
                                • IoT data
                                • Bio-medical data
                                • Data generation from other fields
                                  • Big data acquisition
                                    • Data collection
                                    • Data transportation
                                    • Data pre-processing
                                        • Big data storage
                                          • Storage system for massive data
                                          • Distributed storage system
                                          • Storage mechanism for big data
                                            • Database technology
                                              • Traditional data analysis
                                              • Big data analytic methods
                                              • Architecture for big data analysis
                                                • Real-time vs offline analysis
                                                • Analysis at different levels
                                                • Analysis with different complexity
                                                  • Tools for big data mining and analysis
                                                    • Big data applications
                                                      • Key applications of big data
                                                        • Application evolutions
                                                        • Structured data analysis
                                                        • Text data analysis
                                                        • Web data analysis
                                                        • Multimedia data analysis
                                                        • Network data analysis
                                                        • Mobile data analysis
                                                          • Key applications of big data
                                                            • Application of big data in enterprises
                                                            • Application of IoT based big data
                                                            • Application of online social network-oriented big data
                                                            • Applications of healthcare and medical big data
                                                            • Collective intelligence
                                                            • Smart grid
                                                                • Conclusion open issues and outlook
                                                                  • Open issues
                                                                    • Theoretical research
                                                                    • Technology development
                                                                    • Practical implications
                                                                    • Data security
                                                                      • Outlook
                                                                        • Acknowledgments
                                                                        • References

                    and traffic accidents are more valuable than those onlycapturing the normal flow of traffic

                    313 Bio-medical data

                    As a series of high-throughput bio-measurement technolo-gies are innovatively developed in the beginning of the21st century the frontier research in the bio-medicine fieldalso enters the era of big data By constructing smartefficient and accurate analytical models and theoretical sys-tems for bio-medicine applications the essential governingmechanism behind complex biological phenomena may berevealed Not only the future development of bio-medicinecan be determined but also the leading roles can be assumedin the development of a series of important strategic indus-tries related to the national economy peoplersquos livelihoodand national security with important applications such asmedical care new drug R amp D and grain production (egtransgenic crops)

                    The completion of HGP (Human Genome Project) andthe continued development of sequencing technology alsolead to widespread applications of big data in the fieldThe masses of data generated by gene sequencing gothrough specialized analysis according to different applica-tion demands to combine it with the clinical gene diag-nosis and provide valuable information for early diagnosisand personalized treatment of disease One sequencing ofhuman gene may generate 100 600GB raw data In theChina National Genebank in Shenzhen there are 13 mil-lion samples including 115 million human samples and150000 animal plant and microorganism samples By theend of 2013 10 million traceable biological samples willbe stored and by the end of 2015 this figure will reach30 million It is predictable that with the development ofbio-medicine technologies gene sequencing will becomefaster and more convenient and thus making big data ofbio-medicine continuously grow beyond all doubt

                    In addition data generated from clinical medical care andmedical R amp D also rise quickly For example the Uni-versity of Pittsburgh Medical Center (UPMC) has stored2TB such data Explorys an American company providesplatforms to collocate clinical data operation and mainte-nance data and financial data At present about 13 millionpeoplersquos information have been collocated with 44 arti-cles of data at the scale of about 60TB which will reach70TB in 2013 Practice Fusion another American com-pany manages electronic medical records of about 200000patients

                    Apart from such small and medium-sized enterprisesother well-known IT companies such as Google Microsoftand IBM have invested extensively in the research and com-putational analysis of methods related to high-throughputbiological big data for shares in the huge market as known

                    as the ldquoNext Internetrdquo IBM forecasts in the 2013 StrategyConference that with the sharp increase of medical imagesand electronic medical records medical professionals mayutilize big data to extract useful clinical information frommasses of data to obtain a medical history and forecast treat-ment effects thus improving patient care and reduce costIt is anticipated that by 2015 the average data volume ofevery hospital will increase from 167TB to 665TB

                    314 Data generation from other fields

                    As scientific applications are increasing the scale ofdatasets is gradually expanding and the development ofsome disciplines greatly relies on the analysis of masses ofdata Here we examine several such applications Althoughbeing in different scientific fields the applications havesimilar and increasing demand on data analysis The firstexample is related to computational biology GenBank isa nucleotide sequence database maintained by the USNational Bio-Technology Innovation Center Data in thisdatabase may double every 10 months By August 2009Genbank has more than 250 billion bases from 150000 dif-ferent organisms [34] The second example is related toastronomy Sloan Digital Sky Survey (SDSS) the biggestsky survey project in astronomy has recorded 25TB datafrom 1998 to 2008 As the resolution of the telescope isimproved by 2004 the data volume generated per night willsurpass 20TB The last application is related to high-energyphysics In the beginning of 2008 the Atlas experiment ofLarge Hadron Collider (LHC) of European Organization forNuclear Research generates raw data at 2PBs and storesabout 10TB processed data per year

                    In addition pervasive sensing and computing amongnature commercial Internet government and social envi-ronments are generating heterogeneous data with unprece-dented complexity These datasets have their unique datacharacteristics in scale time dimension and data categoryFor example mobile data were recorded with respect topositions movement approximation degrees communica-tions multimedia use of applications and audio environ-ment [108] According to the application environment andrequirements such datasets into different categories so asto select the proper and feasible solutions for big data

                    32 Big data acquisition

                    As the second phase of the big data system big data acqui-sition includes data collection data transmission and datapre-processing During big data acquisition once we col-lect the raw data we shall utilize an efficient transmissionmechanism to send it to a proper storage managementsystem to support different analytical applications The col-lected datasets may sometimes include much redundant or

                    180 Mobile Netw Appl (2014) 19171ndash209

                    useless data which unnecessarily increases storage spaceand affects the subsequent data analysis For examplehigh redundancy is very common among datasets collectedby sensors for environment monitoring Data compressiontechnology can be applied to reduce the redundancy There-fore data pre-processing operations are indispensable toensure efficient data storage and exploitation

                    321 Data collection

                    Data collection is to utilize special data collection tech-niques to acquire raw data from a specific data generationenvironment Four common data collection methods areshown as follows

                    ndash Log files As one widely used data collection methodlog files are record files automatically generated by thedata source system so as to record activities in desig-nated file formats for subsequent analysis Log files aretypically used in nearly all digital devices For exam-ple web servers record in log files number of clicksclick rates visits and other property records of webusers [35] To capture activities of users at the web sitesweb servers mainly include the following three log fileformats public log file format (NCSA) expanded logformat (W3C) and IIS log format (Microsoft) All thethree types of log files are in the ASCII text formatDatabases other than text files may sometimes be usedto store log information to improve the query efficiencyof the massive log store [36 37] There are also someother log files based on data collection including stockindicators in financial applications and determinationof operating states in network monitoring and trafficmanagement

                    ndash Sensing Sensors are common in daily life to measurephysical quantities and transform physical quantitiesinto readable digital signals for subsequent process-ing (and storage) Sensory data may be classified assound wave voice vibration automobile chemicalcurrent weather pressure temperature etc Sensedinformation is transferred to a data collection pointthrough wired or wireless networks For applicationsthat may be easily deployed and managed eg videosurveillance system [38] the wired sensor network isa convenient solution to acquire related informationSometimes the accurate position of a specific phe-nomenon is unknown and sometimes the monitoredenvironment does not have the energy or communica-tion infrastructures Then wireless communication mustbe used to enable data transmission among sensor nodesunder limited energy and communication capability Inrecent years WSNs have received considerable inter-est and have been applied to many applications such

                    as environmental research [39 40] water quality mon-itoring [41] civil engineering [42 43] and wildlifehabit monitoring [44] A WSN generally consists ofa large number of geographically distributed sensornodes each being a micro device powered by batterySuch sensors are deployed at designated positions asrequired by the application to collect remote sensingdata Once the sensors are deployed the base stationwill send control information for network configura-tionmanagement or data collection to sensor nodesBased on such control information the sensory data isassembled in different sensor nodes and sent back to thebase station for further processing Interested readersare referred to [45] for more detailed discussions

                    ndash Methods for acquiring network data At present net-work data acquisition is accomplished using a com-bination of web crawler word segmentation systemtask system and index system etc Web crawler isa program used by search engines for downloadingand storing web pages [46] Generally speaking webcrawler starts from the uniform resource locator (URL)of an initial web page to access other linked web pagesduring which it stores and sequences all the retrievedURLs Web crawler acquires a URL in the order ofprecedence through a URL queue and then downloadsweb pages and identifies all URLs in the downloadedweb pages and extracts new URLs to be put in thequeue This process is repeated until the web crawleris stopped Data acquisition through a web crawler iswidely applied in applications based on web pagessuch as search engines or web caching Traditional webpage extraction technologies feature multiple efficientsolutions and considerable research has been done inthis field As more advanced web page applicationsare emerging some extraction strategies are proposedin [47] to cope with rich Internet applications

                    The current network data acquisition technologiesmainly include traditional Libpcap-based packet capturetechnology zero-copy packet capture technology as wellas some specialized network monitoring software such asWireshark SmartSniff and WinNetCap

                    ndash Libpcap-based packet capture technology Libpcap(packet capture library) is a widely used network datapacket capture function library It is a general tool thatdoes not depend on any specific system and is mainlyused to capture data in the data link layer It featuressimplicity easy-to-use and portability but has a rel-atively low efficiency Therefore under a high-speednetwork environment considerable packet losses mayoccur when Libpcap is used

                    Mobile Netw Appl (2014) 19171ndash209 181

                    ndash Zero-copy packet capture technology The so-calledzero-copy (ZC) means that no copies between any inter-nal memories occur during packet receiving and send-ing at a node In sending the data packets directly startfrom the user buffer of applications pass through thenetwork interfaces and arrive at an external networkIn receiving the network interfaces directly send datapackets to the user buffer The basic idea of zero-copyis to reduce data copy times reduce system calls andreduce CPU load while ddatagrams are passed from net-work equipments to user program space The zero-copytechnology first utilizes direct memory access (DMA)technology to directly transmit network datagrams to anaddress space pre-allocated by the system kernel so asto avoid the participation of CPU In the meanwhile itmaps the internal memory of the datagrams in the sys-tem kernel to the that of the detection program or buildsa cache region in the user space and maps it to the ker-nel space Then the detection program directly accessesthe internal memory so as to reduce internal memorycopy from system kernel to user space and reduce theamount of system calls

                    ndash Mobile equipments At present mobile devices aremore widely used As mobile device functions becomeincreasingly stronger they feature more complex andmultiple means of data acquisition as well as morevariety of data Mobile devices may acquire geo-graphical location information through positioning sys-tems acquire audio information through microphonesacquire pictures videos streetscapes two-dimensionalbarcodes and other multimedia information throughcameras acquire user gestures and other body languageinformation through touch screens and gravity sensorsOver the years wireless operators have improved theservice level of the mobile Internet by acquiring andanalyzing such information For example iPhone itselfis a ldquomobile spyrdquo It may collect wireless data andgeographical location information and then send suchinformation back to Apple Inc for processing of whichthe user is not aware Apart from Apple smart phoneoperating systems such as Android of Google and Win-dows Phone of Microsoft can also collect informationin the similar manner

                    In addition to the aforementioned three data acquisitionmethods of main data sources there are many other datacollect methods or systems For example in scientific exper-iments many special tools can be used to collect exper-imental data such as magnetic spectrometers and radiotelescopes We may classify data collection methods fromdifferent perspectives From the perspective of data sourcesdata collection methods can be classified into two cate-gories collection methods recording through data sources

                    and collection methods recording through other auxiliarytools

                    322 Data transportation

                    Upon the completion of raw data collection data will betransferred to a data storage infrastructure for processingand analysis As discussed in Section 23 big data is mainlystored in a data center The data layout should be adjusted toimprove computing efficiency or facilitate hardware mainte-nance In other words internal data transmission may occurin the data center Therefore data transmission consistsof two phases Inter-DCN transmissions and Intra-DCNtransmissions

                    ndash Inter-DCN transmissions Inter-DCN transmissions arefrom data source to data center which is generallyachieved with the existing physical network infrastruc-ture Because of the rapid growth of traffic demandsthe physical network infrastructure in most regionsaround the world are constituted by high-volumn high-rate and cost-effective optic fiber transmission systemsOver the past 20 years advanced management equip-ment and technologies have been developed such asIP-based wavelength division multiplexing (WDM) net-work architecture to conduct smart control and man-agement of optical fiber networks [48 49] WDM isa technology that multiplexes multiple optical carriersignals with different wave lengths and couples themto the same optical fiber of the optical link In suchtechnology lasers with different wave lengths carry dif-ferent signals By far the backbone network have beendeployed with WDM optical transmission systems withsingle channel rate of 40Gbs At present 100Gbs com-mercial interface are available and 100Gbs systems (orTBs systems) will be available in the near future [50]However traditional optical transmission technologiesare limited by the bandwidth of the electronic bot-tleneck [51] Recently orthogonal frequency-divisionmultiplexing (OFDM) initially designed for wirelesssystems is regarded as one of the main candidatetechnologies for future high-speed optical transmis-sion OFDM is a multi-carrier parallel transmissiontechnology It segments a high-speed data flow to trans-form it into low-speed sub-data-flows to be transmittedover multiple orthogonal sub-carriers [52] Comparedwith fixed channel spacing of WDM OFDM allowssub-channel frequency spectrums to overlap with eachother [53] Therefore it is a flexible and efficient opticalnetworking technology

                    ndash Intra-DCN Transmissions Intra-DCN transmissionsare the data communication flows within data centersIntra-DCN transmissions depend on the communication

                    182 Mobile Netw Appl (2014) 19171ndash209

                    mechanism within the data center (ie on physical con-nection plates chips internal memories of data serversnetwork architectures of data centers and communica-tion protocols) A data center consists of multiple inte-grated server racks interconnected with its internal con-nection networks Nowadays the internal connectionnetworks of most data centers are fat-tree two-layeror three-layer structures based on multi-commoditynetwork flows [51 54] In the two-layer topologicalstructure the racks are connected by 1Gbps top rackswitches (TOR) and then such top rack switches areconnected with 10Gbps aggregation switches in thetopological structure The three-layer topological struc-ture is a structure augmented with one layer on the topof the two-layer topological structure and such layeris constituted by 10Gbps or 100Gbps core switchesto connect aggregation switches in the topologicalstructure There are also other topological structureswhich aim to improve the data center networks [55ndash58] Because of the inadequacy of electronic packetswitches it is difficult to increase communication band-widths while keeps energy consumption is low Overthe years due to the huge success achieved by opti-cal technologies the optical interconnection among thenetworks in data centers has drawn great interest Opti-cal interconnection is a high-throughput low-delayand low-energy-consumption solution At present opti-cal technologies are only used for point-to-point linksin data centers Such optical links provide connectionfor the switches using the low-cost multi-mode fiber(MMF) with 10Gbps data rate Optical interconnec-tion (switching in the optical domain) of networks indata centers is a feasible solution which can provideTbps-level transmission bandwidth with low energyconsumption Recently many optical interconnectionplans are proposed for data center networks [59] Someplans add optical paths to upgrade the existing net-works and other plans completely replace the currentswitches [59ndash64] As a strengthening technology Zhouet al in [65] adopt wireless links in the 60GHz fre-quency band to strengthen wired links Network vir-tualization should also be considered to improve theefficiency and utilization of data center networks

                    323 Data pre-processing

                    Because of the wide variety of data sources the collecteddatasets vary with respect to noise redundancy and con-sistency etc and it is undoubtedly a waste to store mean-ingless data In addition some analytical methods haveserious requirements on data quality Therefore in orderto enable effective data analysis we shall pre-process data

                    under many circumstances to integrate the data from differ-ent sources which can not only reduces storage expensebut also improves analysis accuracy Some relational datapre-processing techniques are discussed as follows

                    ndash Integration data integration is the cornerstone of mod-ern commercial informatics which involves the com-bination of data from different sources and providesusers with a uniform view of data [66] This is a matureresearch field for traditional database Historically twomethods have been widely recognized data ware-house and data federation Data warehousing includesa process named ETL (Extract Transform and Load)Extraction involves connecting source systems select-ing collecting analyzing and processing necessarydata Transformation is the execution of a series of rulesto transform the extracted data into standard formatsLoading means importing extracted and transformeddata into the target storage infrastructure Loading isthe most complex procedure among the three whichincludes operations such as transformation copy clear-ing standardization screening and data organizationA virtual database can be built to query and aggregatedata from different data sources but such database doesnot contain data On the contrary it includes informa-tion or metadata related to actual data and its positionsSuch two ldquostorage-readingrdquo approaches do not sat-isfy the high performance requirements of data flowsor search programs and applications Compared withqueries data in such two approaches is more dynamicand must be processed during data transmission Gen-erally data integration methods are accompanied withflow processing engines and search engines [30 67]

                    ndash Cleaning data cleaning is a process to identify inac-curate incomplete or unreasonable data and thenmodify or delete such data to improve data qualityGenerally data cleaning includes five complementaryprocedures [68] defining and determining error typessearching and identifying errors correcting errors doc-umenting error examples and error types and mod-ifying data entry procedures to reduce future errorsDuring cleaning data formats completeness rational-ity and restriction shall be inspected Data cleaning isof vital importance to keep the data consistency whichis widely applied in many fields such as banking insur-ance retail industry telecommunications and trafficcontrol

                    In e-commerce most data is electronically col-lected which may have serious data quality prob-lems Classic data quality problems mainly come fromsoftware defects customized errors or system mis-configuration Authors in [69] discussed data cleaning

                    Mobile Netw Appl (2014) 19171ndash209 183

                    in e-commerce by crawlers and regularly re-copyingcustomer and account information

                    In [70] the problem of cleaning RFID data wasexamined RFID is widely used in many applica-tions eg inventory management and target track-ing However the original RFID features low qualitywhich includes a lot of abnormal data limited by thephysical design and affected by environmental noisesIn [71] a probability model was developed to copewith data loss in mobile environments Khoussainovaet al in [72] proposed a system to automatically cor-rect errors of input data by defining global integrityconstraints

                    Herbert et al [73] proposed a framework called BIO-AJAX to standardize biological data so as to conductfurther computation and improve search quality WithBIO-AJAX some errors and repetitions may be elim-inated and common data mining technologies can beexecuted more effectively

                    ndash Redundancy elimination data redundancy refers to datarepetitions or surplus which usually occurs in manydatasets Data redundancy can increase the unneces-sary data transmission expense and cause defects onstorage systems eg waste of storage space lead-ing to data inconsistency reduction of data reliabil-ity and data damage Therefore various redundancyreduction methods have been proposed such as redun-dancy detection data filtering and data compressionSuch methods may apply to different datasets or appli-cation environments However redundancy reductionmay also bring about certain negative effects Forexample data compression and decompression causeadditional computational burden Therefore the ben-efits of redundancy reduction and the cost should becarefully balanced Data collected from different fieldswill increasingly appear in image or video formatsIt is well-known that images and videos contain con-siderable redundancy including temporal redundancyspacial redundancy statistical redundancy and sens-ing redundancy Video compression is widely usedto reduce redundancy in video data as specified inthe many video coding standards (MPEG-2 MPEG-4H263 and H264AVC) In [74] the authors inves-tigated the problem of video compression in a videosurveillance system with a video sensor network Theauthors propose a new MPEG-4 based method byinvestigating the contextual redundancy related to back-ground and foreground in a scene The low com-plexity and the low compression ratio of the pro-posed approach were demonstrated by the evaluationresults

                    On generalized data transmission or storage re-peated data deletion is a special data compression

                    technology which aims to eliminate repeated datacopies [75] With repeated data deletion individual datablocks or data segments will be assigned with identi-fiers (eg using a hash algorithm) and stored with theidentifiers added to the identification list As the anal-ysis of repeated data deletion continues if a new datablock has an identifier that is identical to that listedin the identification list the new data block will bedeemed as redundant and will be replaced by the cor-responding stored data block Repeated data deletioncan greatly reduce storage requirement which is par-ticularly important to a big data storage system Apartfrom the aforementioned data pre-processing methodsspecific data objects shall go through some other oper-ations such as feature extraction Such operation playsan important role in multimedia search and DNA anal-ysis [76ndash78] Usually high-dimensional feature vec-tors (or high-dimensional feature points) are used todescribe such data objects and the system stores thedimensional feature vectors for future retrieval Datatransfer is usually used to process distributed hetero-geneous data sources especially business datasets [79]As a matter of fact in consideration of various datasetsit is non-trivial or impossible to build a uniform datapre-processing procedure and technology that is appli-cable to all types of datasets on the specific featureproblem performance requirements and other factorsof the datasets should be considered so as to select aproper data pre-processing strategy

                    4 Big data storage

                    The explosive growth of data has more strict requirementson storage and management In this section we focus onthe storage of big data Big data storage refers to the stor-age and management of large-scale datasets while achiev-ing reliability and availability of data accessing We willreview important issues including massive storage systemsdistributed storage systems and big data storage mecha-nisms On one hand the storage infrastructure needs toprovide information storage service with reliable storagespace on the other hand it must provide a powerful accessinterface for query and analysis of a large amount ofdata

                    Traditionally as auxiliary equipment of server data stor-age device is used to store manage look up and analyzedata with structured RDBMSs With the sharp growth ofdata data storage device is becoming increasingly moreimportant and many Internet companies pursue big capac-ity of storage to be competitive Therefore there is acompelling need for research on data storage

                    184 Mobile Netw Appl (2014) 19171ndash209

                    41 Storage system for massive data

                    Various storage systems emerge to meet the demands ofmassive data Existing massive storage technologies can beclassified as Direct Attached Storage (DAS) and networkstorage while network storage can be further classifiedinto Network Attached Storage (NAS) and Storage AreaNetwork (SAN)

                    In DAS various harddisks are directly connected withservers and data management is server-centric such thatstorage devices are peripheral equipments each of whichtakes a certain amount of IO resource and is managed by anindividual application software For this reason DAS is onlysuitable to interconnect servers with a small scale How-ever due to its low scalability DAS will exhibit undesirableefficiency when the storage capacity is increased ie theupgradeability and expandability are greatly limited ThusDAS is mainly used in personal computers and small-sizedservers

                    Network storage is to utilize network to provide userswith a union interface for data access and sharing Networkstorage equipment includes special data exchange equip-ments disk array tap library and other storage media aswell as special storage software It is characterized withstrong expandability

                    NAS is actually an auxillary storage equipment of a net-work It is directly connected to a network through a hub orswitch through TCPIP protocols In NAS data is transmit-ted in the form of files Compared to DAS the IO burdenat a NAS server is reduced extensively since the serveraccesses a storage device indirectly through a network

                    While NAS is network-oriented SAN is especiallydesigned for data storage with a scalable and bandwidthintensive network eg a high-speed network with opticalfiber connections In SAN data storage management is rel-atively independent within a storage local area networkwhere multipath based data switching among any internalnodes is utilized to achieve a maximum degree of datasharing and data management

                    From the organization of a data storage system DASNAS and SAN can all be divided into three parts (i) discarray it is the foundation of a storage system and the fun-damental guarantee for data storage (ii) connection andnetwork sub-systems which provide connection among oneor more disc arrays and servers (iii) storage managementsoftware which handles data sharing disaster recovery andother storage management tasks of multiple servers

                    42 Distributed storage system

                    The first challenge brought about by big data is how todevelop a large scale distributed storage system for effi-ciently data processing and analysis To use a distributed

                    system to store massive data the following factors shouldbe taken into consideration

                    ndash Consistency a distributed storage system requires mul-tiple servers to cooperatively store data As there aremore servers the probability of server failures will belarger Usually data is divided into multiple pieces tobe stored at different servers to ensure availability incase of server failure However server failures and par-allel storage may cause inconsistency among differentcopies of the same data Consistency refers to assuringthat multiple copies of the same data are identical

                    ndash Availability a distributed storage system operates inmultiple sets of servers As more servers are usedserver failures are inevitable It would be desirable ifthe entire system is not seriously affected to satisfy cus-tomerrsquos requests in terms of reading and writing Thisproperty is called availability

                    ndash Partition Tolerance multiple servers in a distributedstorage system are connected by a network The net-work could have linknode failures or temporary con-gestion The distributed system should have a certainlevel of tolerance to problems caused by network fail-ures It would be desirable that the distributed storagestill works well when the network is partitioned

                    Eric Brewer proposed a CAP [80 81] theory in 2000which indicated that a distributed system could not simulta-neously meet the requirements on consistency availabilityand partition tolerance at most two of the three require-ments can be satisfied simultaneously Seth Gilbert andNancy Lynch from MIT proved the correctness of CAP the-ory in 2002 Since consistency availability and partitiontolerance could not be achieved simultaneously we can havea CA system by ignoring partition tolerance a CP system byignoring availability and an AP system that ignores consis-tency according to different design goals The three systemsare discussed in the following

                    CA systems do not have partition tolerance ie theycould not handle network failures Therefore CA sys-tems are generally deemed as storage systems with a sin-gle server such as the traditional small-scale relationaldatabases Such systems feature single copy of data suchthat consistency is easily ensured Availability is guaranteedby the excellent design of relational databases Howeversince CA systems could not handle network failures theycould not be expanded to use many servers Therefore mostlarge-scale storage systems are CP systems and AP systems

                    Compared with CA systems CP systems ensure parti-tion tolerance Therefore CP systems can be expanded tobecome distributed systems CP systems generally main-tain several copies of the same data in order to ensure a

                    Mobile Netw Appl (2014) 19171ndash209 185

                    level of fault tolerance CP systems also ensure data consis-tency ie multiple copies of the same data are guaranteedto be completely identical However CP could not ensuresound availability because of the high cost for consistencyassurance Therefore CP systems are useful for the scenar-ios with moderate load but stringent requirements on dataaccuracy (eg trading data) BigTable and Hbase are twopopular CP systems

                    AP systems also ensure partition tolerance However APsystems are different from CP systems in that AP systemsalso ensure availability However AP systems only ensureeventual consistency rather than strong consistency in theprevious two systems Therefore AP systems only applyto the scenarios with frequent requests but not very highrequirements on accuracy For example in online SocialNetworking Services (SNS) systems there are many con-current visits to the data but a certain amount of data errorsare tolerable Furthermore because AP systems ensureeventual consistency accurate data can still be obtained aftera certain amount of delay Therefore AP systems may alsobe used under the circumstances with no stringent realtimerequirements Dynamo and Cassandra are two popular APsystems

                    43 Storage mechanism for big data

                    Considerable research on big data promotes the develop-ment of storage mechanisms for big data Existing stor-age mechanisms of big data may be classified into threebottom-up levels (i) file systems (ii) databases and (iii)programming models

                    File systems are the foundation of the applications atupper levels Googlersquos GFS is an expandable distributedfile system to support large-scale distributed data-intensiveapplications [25] GFS uses cheap commodity servers toachieve fault-tolerance and provides customers with high-performance services GFS supports large-scale file appli-cations with more frequent reading than writing HoweverGFS also has some limitations such as a single point offailure and poor performances for small files Such limita-tions have been overcome by Colossus [82] the successorof GFS

                    In addition other companies and researchers also havetheir solutions to meet the different demands for storageof big data For example HDFS and Kosmosfs are deriva-tives of open source codes of GFS Microsoft developedCosmos [83] to support its search and advertisement busi-ness Facebook utilizes Haystack [84] to store the largeamount of small-sized photos Taobao also developed TFSand FastDFS In conclusion distributed file systems havebeen relatively mature after years of development and busi-ness operation Therefore we will focus on the other twolevels in the rest of this section

                    431 Database technology

                    The database technology has been evolving for more than30 years Various database systems are developed to handledatasets at different scales and support various applica-tions Traditional relational databases cannot meet the chal-lenges on categories and scales brought about by big dataNoSQL databases (ie non traditional relational databases)are becoming more popular for big data storage NoSQLdatabases feature flexible modes support for simple andeasy copy simple API eventual consistency and supportof large volume data NoSQL databases are becomingthe core technology for of big data We will examinethe following three main NoSQL databases in this sec-tion Key-value databases column-oriented databases anddocument-oriented databases each based on certain datamodels

                    ndash Key-value Databases Key-value Databases are con-stituted by a simple data model and data is storedcorresponding to key-values Every key is unique andcustomers may input queried values according to thekeys Such databases feature a simple structure andthe modern key-value databases are characterized withhigh expandability and shorter query response time thanthose of relational databases Over the past few yearsmany key-value databases have appeared as motivatedby Amazonrsquos Dynamo system [85] We will introduceDynamo and several other representative key-valuedatabases

                    ndash Dynamo Dynamo is a highly available andexpandable distributed key-value data stor-age system It is used to store and managethe status of some core services which canbe realized with key access in the Amazone-Commerce Platform The public mode ofrelational databases may generate invalid dataand limit data scale and availability whileDynamo can resolve these problems with asimple key-object interface which is consti-tuted by simple reading and writing opera-tion Dynamo achieves elasticity and avail-ability through the data partition data copyand object edition mechanisms Dynamo par-tition plan relies on Consistent Hashing [86]which has a main advantage that node pass-ing only affects directly adjacent nodes anddo not affect other nodes to divide the loadfor multiple main storage machines Dynamocopies data to N sets of servers in which Nis a configurable parameter in order to achieve

                    186 Mobile Netw Appl (2014) 19171ndash209

                    high availability and durability Dynamo sys-tem also provides eventual consistency so asto conduct asynchronous update on all copies

                    ndash Voldemort Voldemort is also a key-value stor-age system which was initially developed forand is still used by LinkedIn Key words andvalues in Voldemort are composite objectsconstituted by tables and images Volde-mort interface includes three simple opera-tions reading writing and deletion all ofwhich are confirmed by key words Volde-mort provides asynchronous updating con-current control of multiple editions but doesnot ensure data consistency However Volde-mort supports optimistic locking for consistentmulti-record updating When conflict happensbetween the updating and any other opera-tions the updating operation will quit Thedata copy mechanism of Voldmort is the sameas that of Dynamo Voldemort not only storesdata in RAM but allows data be inserted intoa storage engine Especially Voldemort sup-ports two storage engines including BerkeleyDB and Random Access Files

                    The key-value database emerged a few years agoDeeply influenced by Amazon Dynamo DB other key-value storage systems include Redis Tokyo Canbinetand Tokyo Tyrant Memcached and Memcache DBRiak and Scalaris all of which provide expandabilityby distributing key words into nodes Voldemort RiakTokyo Cabinet and Memecached can utilize attachedstorage devices to store data in RAM or disks Otherstorage systems store data at RAM and provide diskbackup or rely on copy and recovery to avoid backup

                    ndash Column-oriented Database The column-orienteddatabases store and process data according to columnsother than rows Both columns and rows are segmentedin multiple nodes to realize expandability The column-oriented databases are mainly inspired by GooglersquosBigTable In this Section we first discuss BigTable andthen introduce several derivative tools

                    ndash BigTable BigTable is a distributed structureddata storage system which is designed to pro-cess the large-scale (PB class) data amongthousands commercial servers [87] The basicdata structure of Bigtable is a multi-dimensionsequenced mapping with sparse distributedand persistent storage Indexes of mappingare row key column key and timestampsand every value in mapping is an unana-lyzed byte array Each row key in BigTable

                    is a 64KB character string By lexicograph-ical order rows are stored and continuallysegmented into Tablets (ie units of distribu-tion) for load balance Thus reading a shortrow of data can be highly effective since itonly involves communication with a small por-tion of machines The columns are groupedaccording to the prefixes of keys and thusforming column families These column fami-lies are the basic units for access control Thetimestamps are 64-bit integers to distinguishdifferent editions of cell values Clients mayflexibly determine the number of cell editionsstored These editions are sequenced in thedescending order of timestamps so the latestedition will always be read

                    The BigTable API features the creation anddeletion of Tablets and column families as wellas modification of metadata of clusters tablesand column families Client applications mayinsert or delete values of BigTable query val-ues from columns or browse sub-datasets in atable Bigtable also supports some other char-acteristics such as transaction processing in asingle row Users may utilize such features toconduct more complex data processing

                    Every procedure executed by BigTableincludes three main components Masterserver Tablet server and client libraryBigtable only allows one set of Master serverbe distributed to be responsible for distribut-ing tablets for Tablet server detecting addedor removed Tablet servers and conductingload balance In addition it can also mod-ify BigTable schema eg creating tables andcolumn families and collecting garbage savedin GFS as well as deleted or disabled filesand using them in specific BigTable instancesEvery tablet server manages a Tablet set andis responsible for the reading and writing of aloaded Tablet When Tablets are too big theywill be segmented by the server The applica-tion client library is used to communicate withBigTable instances

                    BigTable is based on many fundamentalcomponents of Google including GFS [25]cluster management system SSTable file for-mat and Chubby [88] GFS is use to store dataand log files The cluster management systemis responsible for task scheduling resourcessharing processing of machine failures andmonitoring of machine statuses SSTable fileformat is used to store BigTable data internally

                    Mobile Netw Appl (2014) 19171ndash209 187

                    and it provides mapping between persistentsequenced and unchangeable keys and valuesas any byte strings BigTable utilizes Chubbyfor the following tasks in server 1) ensurethere is at most one active Master copy atany time 2) store the bootstrap location ofBigTable data 3) look up Tablet server 4) con-duct error recovery in case of Table server fail-ures 5) store BigTable schema information 6)store the access control table

                    ndash Cassandra Cassandra is a distributed storagesystem to manage the huge amount of struc-tured data distributed among multiple commer-cial servers [89] The system was developedby Facebook and became an open source toolin 2008 It adopts the ideas and concepts ofboth Amazon Dynamo and Google BigTableespecially integrating the distributed systemtechnology of Dynamo with the BigTable datamodel Tables in Cassandra are in the form ofdistributed four-dimensional structured map-ping where the four dimensions including rowcolumn column family and super column Arow is distinguished by a string-key with arbi-trary length No matter the amount of columnsto be read or written the operation on rowsis an auto Columns may constitute clusterswhich is called column families and are sim-ilar to the data model of Bigtable Cassandraprovides two kinds of column families columnfamilies and super columns The super columnincludes arbitrary number of columns relatedto same names A column family includescolumns and super columns which may becontinuously inserted to the column familyduring runtime The partition and copy mecha-nisms of Cassandra are very similar to those ofDynamo so as to achieve consistency

                    ndash Derivative tools of BigTable since theBigTable code cannot be obtained throughthe open source license some open sourceprojects compete to implement the BigTableconcept to develop similar systems such asHBase and Hypertable

                    HBase is a BigTable cloned version pro-grammed with Java and is a part of Hadoop ofApachersquos MapReduce framework [90] HBasereplaces GFS with HDFS It writes updatedcontents into RAM and regularly writes theminto files on disks The row operations areatomic operations equipped with row-levellocking and transaction processing which is

                    optional for large scale Partition and distribu-tion are transparently operated and have spacefor client hash or fixed key

                    HyperTable was developed similar toBigTable to obtain a set of high-performanceexpandable distributed storage and process-ing systems for structured and unstructureddata [91] HyperTable relies on distributedfile systems eg HDFS and distributed lockmanager Data representation processing andpartition mechanism are similar to that inBigTable HyperTable has its own query lan-guage called HyperTable query language(HQL) and allows users to create modify andquery underlying tables

                    Since the column-oriented storage databases mainlyemulate BigTable their designs are all similar exceptfor the concurrency mechanism and several other fea-tures For example Cassandra emphasizes weak consis-tency of concurrent control of multiple editions whileHBase and HyperTable focus on strong consistencythrough locks or log records

                    ndash Document Database Compared with key-value stor-age document storage can support more complex dataforms Since documents do not follow strict modesthere is no need to conduct mode migration In additionkey-value pairs can still be saved We will examine threeimportant representatives of document storage systemsie MongoDB SimpleDB and CouchDB

                    ndash MongoDB MongoDB is open-source anddocument-oriented database [92] MongoDBstores documents as Binary JSON (BSON)objects [93] which is similar to object Everydocument has an ID field as the primary keyQuery in MongoDB is expressed with syn-tax similar to JSON A database driver sendsthe query as a BSON object to MongoDBThe system allows query on all documentsincluding embedded objects and arrays Toenable rapid query indexes can be createdin the queryable fields of documents Thecopy operation in MongoDB can be executedwith log files in the main nodes that supportall the high-level operations conducted in thedatabase During copying the slavers queryall the writing operations since the last syn-chronization to the master and execute opera-tions in log files in local databases MongoDBsupports horizontal expansion with automaticsharing to distribute data among thousandsof nodes by automatically balancing load andfailover

                    188 Mobile Netw Appl (2014) 19171ndash209

                    ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

                    ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

                    Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

                    ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

                    functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

                    Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

                    ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

                    The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

                    In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

                    Mobile Netw Appl (2014) 19171ndash209 189

                    DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

                    ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

                    All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

                    ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

                    The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

                    Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

                    The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

                    Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

                    ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

                    ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

                    51 Traditional data analysis

                    5 Big data analysis

                    190 Mobile Netw Appl (2014) 19171ndash209

                    ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

                    ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

                    ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

                    ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

                    ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

                    52 Big data analytic methods

                    In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

                    ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

                    ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

                    ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

                    ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

                    ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

                    Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

                    53 Architecture for big data analysis

                    Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

                    Mobile Netw Appl (2014) 19171ndash209 191

                    Table 1 Comparison of MPI MapReduce and Dryad

                    MPI MapReduce Dryad

                    Deployment Computing node and data Computing and data storage Computing and data storage

                    storage arranged separately arranged at the same node arranged at the same node

                    (Data should be moved (Computing should (Computing should

                    computing node) be close to data) be close to data)

                    Resource management ndash Workqueue(google) Not clear

                    scheduling HOD(Yahoo)

                    Low level programming MPI API MapReduce API Dryad API

                    High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

                    Data storage The local file system GFS(google) NTFS

                    NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

                    Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

                    the tasks

                    Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

                    memory access Shared-memory FIFOs

                    Fault-tolerant Checkpoint Task re-execute Task re-execute

                    531 Real-time vs offline analysis

                    According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

                    ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

                    ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

                    532 Analysis at different levels

                    Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

                    ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

                    ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

                    ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

                    192 Mobile Netw Appl (2014) 19171ndash209

                    533 Analysis with different complexity

                    The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

                    54 Tools for big data mining and analysis

                    Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

                    ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

                    ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

                    ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

                    The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

                    ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

                    ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

                    6 Big data applications

                    In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

                    Mobile Netw Appl (2014) 19171ndash209 193

                    However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

                    61 Application evolutions

                    Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

                    ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

                    ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

                    most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

                    ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

                    As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

                    62 Big data analysis fields

                    webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

                    194 Mobile Netw Appl (2014) 19171ndash209

                    621 Structured data analysis

                    Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

                    However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

                    622 Text data analysis

                    The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

                    623 Web data analysis

                    Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

                    mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

                    Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

                    Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

                    Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

                    Mobile Netw Appl (2014) 19171ndash209 195

                    624 Multimedia data analysis

                    Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

                    Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

                    Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

                    Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

                    segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

                    Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

                    The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

                    625 Network data analysis

                    Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

                    196 Mobile Netw Appl (2014) 19171ndash209

                    and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

                    The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

                    Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

                    Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

                    Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

                    is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

                    626 Mobile data analysis

                    By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

                    With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

                    Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

                    Mobile Netw Appl (2014) 19171ndash209 197

                    In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

                    Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

                    63 Key applications of big data

                    631 Application of big data in enterprises

                    At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

                    In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

                    Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

                    632 Application of IoT based big data

                    IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

                    Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

                    Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

                    633 Application of online social network-oriented big data

                    Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

                    198 Mobile Netw Appl (2014) 19171ndash209

                    information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

                    ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

                    ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

                    is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

                    The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

                    In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

                    Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

                    Mobile Netw Appl (2014) 19171ndash209 199

                    or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

                    Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

                    ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

                    ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

                    ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

                    634 Applications of healthcare and medical big data

                    Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

                    effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

                    For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

                    The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

                    HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

                    Fig 6 The correlation between Tweets about rice price and food price inflation

                    200 Mobile Netw Appl (2014) 19171ndash209

                    imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

                    635 Collective intelligence

                    With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

                    Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

                    In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

                    636 Smart grid

                    Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

                    supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

                    ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

                    ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

                    Mobile Netw Appl (2014) 19171ndash209 201

                    according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

                    ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

                    7 Conclusion open issues and outlook

                    In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

                    In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

                    71 Open issues

                    The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

                    711 Theoretical research

                    Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

                    ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

                    ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

                    ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

                    712 Technology development

                    The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

                    ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

                    202 Mobile Netw Appl (2014) 19171ndash209

                    ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

                    ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

                    ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

                    713 Practical implications

                    Although there are already many successful big data appli-cations many practical problems should be solved

                    ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

                    ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

                    ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

                    individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

                    ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

                    714 Data security

                    In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

                    ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

                    ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

                    Mobile Netw Appl (2014) 19171ndash209 203

                    quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

                    ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

                    ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

                    The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

                    72 Outlook

                    The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

                    not predict the future but may take precautions for possibleevents to occur in the future

                    ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

                    ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

                    ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

                    ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

                    204 Mobile Netw Appl (2014) 19171ndash209

                    utilizes relational diagrams to express interpersonalrelationship

                    ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

                    ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

                    ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

                    ndash Compared with accurate data we would like toaccept numerous and complicated data

                    ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

                    ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

                    ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

                    Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

                    increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

                    Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

                    References

                    1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

                    2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

                    3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

                    4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

                    5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

                    httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

                    7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

                    8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

                    9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

                    10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

                    11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

                    12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

                    13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

                    14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

                    15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

                    16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

                    17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

                    18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

                    19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

                    Mobile Netw Appl (2014) 19171ndash209 205

                    20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

                    21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

                    22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

                    23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

                    24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

                    25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

                    26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

                    27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

                    28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

                    29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

                    30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

                    31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

                    32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

                    33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

                    34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

                    35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

                    36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

                    37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

                    38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

                    39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

                    40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

                    41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

                    42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

                    43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

                    44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

                    45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

                    46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

                    47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

                    48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

                    49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

                    50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

                    51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

                    52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

                    53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

                    54 Cisco data center interconnect design and deployment guide(2010)

                    55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

                    56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

                    57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

                    58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

                    59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

                    206 Mobile Netw Appl (2014) 19171ndash209

                    60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

                    61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

                    62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

                    63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

                    64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

                    65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

                    66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

                    67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

                    68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

                    69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

                    70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

                    71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

                    72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

                    73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

                    74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

                    75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

                    76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

                    77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

                    78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

                    79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

                    80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

                    81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

                    82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

                    83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

                    84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

                    85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

                    86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

                    87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

                    88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

                    89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

                    90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

                    Media Inc93 Crockford D (2006) The applicationjson media type for

                    javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

                    SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

                    tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

                    (2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

                    97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

                    98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

                    99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

                    Mobile Netw Appl (2014) 19171ndash209 207

                    100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

                    101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

                    102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

                    103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

                    104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

                    105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

                    106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

                    107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

                    108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

                    109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

                    110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

                    111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

                    112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

                    113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

                    114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

                    115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

                    D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

                    117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

                    118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

                    the 7th ACM international conference on computing frontiersACM pp 277ndash286

                    119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

                    120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

                    121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

                    122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

                    123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

                    124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

                    125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

                    126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

                    127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

                    128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

                    129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

                    130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

                    131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

                    132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

                    133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

                    134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

                    135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

                    136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

                    137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

                    138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

                    139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

                    140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

                    208 Mobile Netw Appl (2014) 19171ndash209

                    141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

                    142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

                    143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

                    144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

                    145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

                    146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

                    147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

                    148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

                    149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

                    150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

                    151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

                    152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

                    153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

                    154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

                    155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

                    156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

                    Mobile Netw Appl (2014) 19171ndash209 209

                    • Big Data A Survey
                      • Abstract
                      • Background
                        • Dawn of big data era
                        • Definition and features of big data
                        • Big data value
                        • The development of big data
                        • Challenges of big data
                          • Related technologies
                            • Relationship between cloud computing and big data
                            • Relationship between IoT and big data
                            • Data center
                            • Relationship between hadoop and big data
                              • Big data generation and acquisition
                                • Data generation
                                  • Enterprise data
                                  • IoT data
                                  • Bio-medical data
                                  • Data generation from other fields
                                    • Big data acquisition
                                      • Data collection
                                      • Data transportation
                                      • Data pre-processing
                                          • Big data storage
                                            • Storage system for massive data
                                            • Distributed storage system
                                            • Storage mechanism for big data
                                              • Database technology
                                                • Traditional data analysis
                                                • Big data analytic methods
                                                • Architecture for big data analysis
                                                  • Real-time vs offline analysis
                                                  • Analysis at different levels
                                                  • Analysis with different complexity
                                                    • Tools for big data mining and analysis
                                                      • Big data applications
                                                        • Key applications of big data
                                                          • Application evolutions
                                                          • Structured data analysis
                                                          • Text data analysis
                                                          • Web data analysis
                                                          • Multimedia data analysis
                                                          • Network data analysis
                                                          • Mobile data analysis
                                                            • Key applications of big data
                                                              • Application of big data in enterprises
                                                              • Application of IoT based big data
                                                              • Application of online social network-oriented big data
                                                              • Applications of healthcare and medical big data
                                                              • Collective intelligence
                                                              • Smart grid
                                                                  • Conclusion open issues and outlook
                                                                    • Open issues
                                                                      • Theoretical research
                                                                      • Technology development
                                                                      • Practical implications
                                                                      • Data security
                                                                        • Outlook
                                                                          • Acknowledgments
                                                                          • References

                      useless data which unnecessarily increases storage spaceand affects the subsequent data analysis For examplehigh redundancy is very common among datasets collectedby sensors for environment monitoring Data compressiontechnology can be applied to reduce the redundancy There-fore data pre-processing operations are indispensable toensure efficient data storage and exploitation

                      321 Data collection

                      Data collection is to utilize special data collection tech-niques to acquire raw data from a specific data generationenvironment Four common data collection methods areshown as follows

                      ndash Log files As one widely used data collection methodlog files are record files automatically generated by thedata source system so as to record activities in desig-nated file formats for subsequent analysis Log files aretypically used in nearly all digital devices For exam-ple web servers record in log files number of clicksclick rates visits and other property records of webusers [35] To capture activities of users at the web sitesweb servers mainly include the following three log fileformats public log file format (NCSA) expanded logformat (W3C) and IIS log format (Microsoft) All thethree types of log files are in the ASCII text formatDatabases other than text files may sometimes be usedto store log information to improve the query efficiencyof the massive log store [36 37] There are also someother log files based on data collection including stockindicators in financial applications and determinationof operating states in network monitoring and trafficmanagement

                      ndash Sensing Sensors are common in daily life to measurephysical quantities and transform physical quantitiesinto readable digital signals for subsequent process-ing (and storage) Sensory data may be classified assound wave voice vibration automobile chemicalcurrent weather pressure temperature etc Sensedinformation is transferred to a data collection pointthrough wired or wireless networks For applicationsthat may be easily deployed and managed eg videosurveillance system [38] the wired sensor network isa convenient solution to acquire related informationSometimes the accurate position of a specific phe-nomenon is unknown and sometimes the monitoredenvironment does not have the energy or communica-tion infrastructures Then wireless communication mustbe used to enable data transmission among sensor nodesunder limited energy and communication capability Inrecent years WSNs have received considerable inter-est and have been applied to many applications such

                      as environmental research [39 40] water quality mon-itoring [41] civil engineering [42 43] and wildlifehabit monitoring [44] A WSN generally consists ofa large number of geographically distributed sensornodes each being a micro device powered by batterySuch sensors are deployed at designated positions asrequired by the application to collect remote sensingdata Once the sensors are deployed the base stationwill send control information for network configura-tionmanagement or data collection to sensor nodesBased on such control information the sensory data isassembled in different sensor nodes and sent back to thebase station for further processing Interested readersare referred to [45] for more detailed discussions

                      ndash Methods for acquiring network data At present net-work data acquisition is accomplished using a com-bination of web crawler word segmentation systemtask system and index system etc Web crawler isa program used by search engines for downloadingand storing web pages [46] Generally speaking webcrawler starts from the uniform resource locator (URL)of an initial web page to access other linked web pagesduring which it stores and sequences all the retrievedURLs Web crawler acquires a URL in the order ofprecedence through a URL queue and then downloadsweb pages and identifies all URLs in the downloadedweb pages and extracts new URLs to be put in thequeue This process is repeated until the web crawleris stopped Data acquisition through a web crawler iswidely applied in applications based on web pagessuch as search engines or web caching Traditional webpage extraction technologies feature multiple efficientsolutions and considerable research has been done inthis field As more advanced web page applicationsare emerging some extraction strategies are proposedin [47] to cope with rich Internet applications

                      The current network data acquisition technologiesmainly include traditional Libpcap-based packet capturetechnology zero-copy packet capture technology as wellas some specialized network monitoring software such asWireshark SmartSniff and WinNetCap

                      ndash Libpcap-based packet capture technology Libpcap(packet capture library) is a widely used network datapacket capture function library It is a general tool thatdoes not depend on any specific system and is mainlyused to capture data in the data link layer It featuressimplicity easy-to-use and portability but has a rel-atively low efficiency Therefore under a high-speednetwork environment considerable packet losses mayoccur when Libpcap is used

                      Mobile Netw Appl (2014) 19171ndash209 181

                      ndash Zero-copy packet capture technology The so-calledzero-copy (ZC) means that no copies between any inter-nal memories occur during packet receiving and send-ing at a node In sending the data packets directly startfrom the user buffer of applications pass through thenetwork interfaces and arrive at an external networkIn receiving the network interfaces directly send datapackets to the user buffer The basic idea of zero-copyis to reduce data copy times reduce system calls andreduce CPU load while ddatagrams are passed from net-work equipments to user program space The zero-copytechnology first utilizes direct memory access (DMA)technology to directly transmit network datagrams to anaddress space pre-allocated by the system kernel so asto avoid the participation of CPU In the meanwhile itmaps the internal memory of the datagrams in the sys-tem kernel to the that of the detection program or buildsa cache region in the user space and maps it to the ker-nel space Then the detection program directly accessesthe internal memory so as to reduce internal memorycopy from system kernel to user space and reduce theamount of system calls

                      ndash Mobile equipments At present mobile devices aremore widely used As mobile device functions becomeincreasingly stronger they feature more complex andmultiple means of data acquisition as well as morevariety of data Mobile devices may acquire geo-graphical location information through positioning sys-tems acquire audio information through microphonesacquire pictures videos streetscapes two-dimensionalbarcodes and other multimedia information throughcameras acquire user gestures and other body languageinformation through touch screens and gravity sensorsOver the years wireless operators have improved theservice level of the mobile Internet by acquiring andanalyzing such information For example iPhone itselfis a ldquomobile spyrdquo It may collect wireless data andgeographical location information and then send suchinformation back to Apple Inc for processing of whichthe user is not aware Apart from Apple smart phoneoperating systems such as Android of Google and Win-dows Phone of Microsoft can also collect informationin the similar manner

                      In addition to the aforementioned three data acquisitionmethods of main data sources there are many other datacollect methods or systems For example in scientific exper-iments many special tools can be used to collect exper-imental data such as magnetic spectrometers and radiotelescopes We may classify data collection methods fromdifferent perspectives From the perspective of data sourcesdata collection methods can be classified into two cate-gories collection methods recording through data sources

                      and collection methods recording through other auxiliarytools

                      322 Data transportation

                      Upon the completion of raw data collection data will betransferred to a data storage infrastructure for processingand analysis As discussed in Section 23 big data is mainlystored in a data center The data layout should be adjusted toimprove computing efficiency or facilitate hardware mainte-nance In other words internal data transmission may occurin the data center Therefore data transmission consistsof two phases Inter-DCN transmissions and Intra-DCNtransmissions

                      ndash Inter-DCN transmissions Inter-DCN transmissions arefrom data source to data center which is generallyachieved with the existing physical network infrastruc-ture Because of the rapid growth of traffic demandsthe physical network infrastructure in most regionsaround the world are constituted by high-volumn high-rate and cost-effective optic fiber transmission systemsOver the past 20 years advanced management equip-ment and technologies have been developed such asIP-based wavelength division multiplexing (WDM) net-work architecture to conduct smart control and man-agement of optical fiber networks [48 49] WDM isa technology that multiplexes multiple optical carriersignals with different wave lengths and couples themto the same optical fiber of the optical link In suchtechnology lasers with different wave lengths carry dif-ferent signals By far the backbone network have beendeployed with WDM optical transmission systems withsingle channel rate of 40Gbs At present 100Gbs com-mercial interface are available and 100Gbs systems (orTBs systems) will be available in the near future [50]However traditional optical transmission technologiesare limited by the bandwidth of the electronic bot-tleneck [51] Recently orthogonal frequency-divisionmultiplexing (OFDM) initially designed for wirelesssystems is regarded as one of the main candidatetechnologies for future high-speed optical transmis-sion OFDM is a multi-carrier parallel transmissiontechnology It segments a high-speed data flow to trans-form it into low-speed sub-data-flows to be transmittedover multiple orthogonal sub-carriers [52] Comparedwith fixed channel spacing of WDM OFDM allowssub-channel frequency spectrums to overlap with eachother [53] Therefore it is a flexible and efficient opticalnetworking technology

                      ndash Intra-DCN Transmissions Intra-DCN transmissionsare the data communication flows within data centersIntra-DCN transmissions depend on the communication

                      182 Mobile Netw Appl (2014) 19171ndash209

                      mechanism within the data center (ie on physical con-nection plates chips internal memories of data serversnetwork architectures of data centers and communica-tion protocols) A data center consists of multiple inte-grated server racks interconnected with its internal con-nection networks Nowadays the internal connectionnetworks of most data centers are fat-tree two-layeror three-layer structures based on multi-commoditynetwork flows [51 54] In the two-layer topologicalstructure the racks are connected by 1Gbps top rackswitches (TOR) and then such top rack switches areconnected with 10Gbps aggregation switches in thetopological structure The three-layer topological struc-ture is a structure augmented with one layer on the topof the two-layer topological structure and such layeris constituted by 10Gbps or 100Gbps core switchesto connect aggregation switches in the topologicalstructure There are also other topological structureswhich aim to improve the data center networks [55ndash58] Because of the inadequacy of electronic packetswitches it is difficult to increase communication band-widths while keeps energy consumption is low Overthe years due to the huge success achieved by opti-cal technologies the optical interconnection among thenetworks in data centers has drawn great interest Opti-cal interconnection is a high-throughput low-delayand low-energy-consumption solution At present opti-cal technologies are only used for point-to-point linksin data centers Such optical links provide connectionfor the switches using the low-cost multi-mode fiber(MMF) with 10Gbps data rate Optical interconnec-tion (switching in the optical domain) of networks indata centers is a feasible solution which can provideTbps-level transmission bandwidth with low energyconsumption Recently many optical interconnectionplans are proposed for data center networks [59] Someplans add optical paths to upgrade the existing net-works and other plans completely replace the currentswitches [59ndash64] As a strengthening technology Zhouet al in [65] adopt wireless links in the 60GHz fre-quency band to strengthen wired links Network vir-tualization should also be considered to improve theefficiency and utilization of data center networks

                      323 Data pre-processing

                      Because of the wide variety of data sources the collecteddatasets vary with respect to noise redundancy and con-sistency etc and it is undoubtedly a waste to store mean-ingless data In addition some analytical methods haveserious requirements on data quality Therefore in orderto enable effective data analysis we shall pre-process data

                      under many circumstances to integrate the data from differ-ent sources which can not only reduces storage expensebut also improves analysis accuracy Some relational datapre-processing techniques are discussed as follows

                      ndash Integration data integration is the cornerstone of mod-ern commercial informatics which involves the com-bination of data from different sources and providesusers with a uniform view of data [66] This is a matureresearch field for traditional database Historically twomethods have been widely recognized data ware-house and data federation Data warehousing includesa process named ETL (Extract Transform and Load)Extraction involves connecting source systems select-ing collecting analyzing and processing necessarydata Transformation is the execution of a series of rulesto transform the extracted data into standard formatsLoading means importing extracted and transformeddata into the target storage infrastructure Loading isthe most complex procedure among the three whichincludes operations such as transformation copy clear-ing standardization screening and data organizationA virtual database can be built to query and aggregatedata from different data sources but such database doesnot contain data On the contrary it includes informa-tion or metadata related to actual data and its positionsSuch two ldquostorage-readingrdquo approaches do not sat-isfy the high performance requirements of data flowsor search programs and applications Compared withqueries data in such two approaches is more dynamicand must be processed during data transmission Gen-erally data integration methods are accompanied withflow processing engines and search engines [30 67]

                      ndash Cleaning data cleaning is a process to identify inac-curate incomplete or unreasonable data and thenmodify or delete such data to improve data qualityGenerally data cleaning includes five complementaryprocedures [68] defining and determining error typessearching and identifying errors correcting errors doc-umenting error examples and error types and mod-ifying data entry procedures to reduce future errorsDuring cleaning data formats completeness rational-ity and restriction shall be inspected Data cleaning isof vital importance to keep the data consistency whichis widely applied in many fields such as banking insur-ance retail industry telecommunications and trafficcontrol

                      In e-commerce most data is electronically col-lected which may have serious data quality prob-lems Classic data quality problems mainly come fromsoftware defects customized errors or system mis-configuration Authors in [69] discussed data cleaning

                      Mobile Netw Appl (2014) 19171ndash209 183

                      in e-commerce by crawlers and regularly re-copyingcustomer and account information

                      In [70] the problem of cleaning RFID data wasexamined RFID is widely used in many applica-tions eg inventory management and target track-ing However the original RFID features low qualitywhich includes a lot of abnormal data limited by thephysical design and affected by environmental noisesIn [71] a probability model was developed to copewith data loss in mobile environments Khoussainovaet al in [72] proposed a system to automatically cor-rect errors of input data by defining global integrityconstraints

                      Herbert et al [73] proposed a framework called BIO-AJAX to standardize biological data so as to conductfurther computation and improve search quality WithBIO-AJAX some errors and repetitions may be elim-inated and common data mining technologies can beexecuted more effectively

                      ndash Redundancy elimination data redundancy refers to datarepetitions or surplus which usually occurs in manydatasets Data redundancy can increase the unneces-sary data transmission expense and cause defects onstorage systems eg waste of storage space lead-ing to data inconsistency reduction of data reliabil-ity and data damage Therefore various redundancyreduction methods have been proposed such as redun-dancy detection data filtering and data compressionSuch methods may apply to different datasets or appli-cation environments However redundancy reductionmay also bring about certain negative effects Forexample data compression and decompression causeadditional computational burden Therefore the ben-efits of redundancy reduction and the cost should becarefully balanced Data collected from different fieldswill increasingly appear in image or video formatsIt is well-known that images and videos contain con-siderable redundancy including temporal redundancyspacial redundancy statistical redundancy and sens-ing redundancy Video compression is widely usedto reduce redundancy in video data as specified inthe many video coding standards (MPEG-2 MPEG-4H263 and H264AVC) In [74] the authors inves-tigated the problem of video compression in a videosurveillance system with a video sensor network Theauthors propose a new MPEG-4 based method byinvestigating the contextual redundancy related to back-ground and foreground in a scene The low com-plexity and the low compression ratio of the pro-posed approach were demonstrated by the evaluationresults

                      On generalized data transmission or storage re-peated data deletion is a special data compression

                      technology which aims to eliminate repeated datacopies [75] With repeated data deletion individual datablocks or data segments will be assigned with identi-fiers (eg using a hash algorithm) and stored with theidentifiers added to the identification list As the anal-ysis of repeated data deletion continues if a new datablock has an identifier that is identical to that listedin the identification list the new data block will bedeemed as redundant and will be replaced by the cor-responding stored data block Repeated data deletioncan greatly reduce storage requirement which is par-ticularly important to a big data storage system Apartfrom the aforementioned data pre-processing methodsspecific data objects shall go through some other oper-ations such as feature extraction Such operation playsan important role in multimedia search and DNA anal-ysis [76ndash78] Usually high-dimensional feature vec-tors (or high-dimensional feature points) are used todescribe such data objects and the system stores thedimensional feature vectors for future retrieval Datatransfer is usually used to process distributed hetero-geneous data sources especially business datasets [79]As a matter of fact in consideration of various datasetsit is non-trivial or impossible to build a uniform datapre-processing procedure and technology that is appli-cable to all types of datasets on the specific featureproblem performance requirements and other factorsof the datasets should be considered so as to select aproper data pre-processing strategy

                      4 Big data storage

                      The explosive growth of data has more strict requirementson storage and management In this section we focus onthe storage of big data Big data storage refers to the stor-age and management of large-scale datasets while achiev-ing reliability and availability of data accessing We willreview important issues including massive storage systemsdistributed storage systems and big data storage mecha-nisms On one hand the storage infrastructure needs toprovide information storage service with reliable storagespace on the other hand it must provide a powerful accessinterface for query and analysis of a large amount ofdata

                      Traditionally as auxiliary equipment of server data stor-age device is used to store manage look up and analyzedata with structured RDBMSs With the sharp growth ofdata data storage device is becoming increasingly moreimportant and many Internet companies pursue big capac-ity of storage to be competitive Therefore there is acompelling need for research on data storage

                      184 Mobile Netw Appl (2014) 19171ndash209

                      41 Storage system for massive data

                      Various storage systems emerge to meet the demands ofmassive data Existing massive storage technologies can beclassified as Direct Attached Storage (DAS) and networkstorage while network storage can be further classifiedinto Network Attached Storage (NAS) and Storage AreaNetwork (SAN)

                      In DAS various harddisks are directly connected withservers and data management is server-centric such thatstorage devices are peripheral equipments each of whichtakes a certain amount of IO resource and is managed by anindividual application software For this reason DAS is onlysuitable to interconnect servers with a small scale How-ever due to its low scalability DAS will exhibit undesirableefficiency when the storage capacity is increased ie theupgradeability and expandability are greatly limited ThusDAS is mainly used in personal computers and small-sizedservers

                      Network storage is to utilize network to provide userswith a union interface for data access and sharing Networkstorage equipment includes special data exchange equip-ments disk array tap library and other storage media aswell as special storage software It is characterized withstrong expandability

                      NAS is actually an auxillary storage equipment of a net-work It is directly connected to a network through a hub orswitch through TCPIP protocols In NAS data is transmit-ted in the form of files Compared to DAS the IO burdenat a NAS server is reduced extensively since the serveraccesses a storage device indirectly through a network

                      While NAS is network-oriented SAN is especiallydesigned for data storage with a scalable and bandwidthintensive network eg a high-speed network with opticalfiber connections In SAN data storage management is rel-atively independent within a storage local area networkwhere multipath based data switching among any internalnodes is utilized to achieve a maximum degree of datasharing and data management

                      From the organization of a data storage system DASNAS and SAN can all be divided into three parts (i) discarray it is the foundation of a storage system and the fun-damental guarantee for data storage (ii) connection andnetwork sub-systems which provide connection among oneor more disc arrays and servers (iii) storage managementsoftware which handles data sharing disaster recovery andother storage management tasks of multiple servers

                      42 Distributed storage system

                      The first challenge brought about by big data is how todevelop a large scale distributed storage system for effi-ciently data processing and analysis To use a distributed

                      system to store massive data the following factors shouldbe taken into consideration

                      ndash Consistency a distributed storage system requires mul-tiple servers to cooperatively store data As there aremore servers the probability of server failures will belarger Usually data is divided into multiple pieces tobe stored at different servers to ensure availability incase of server failure However server failures and par-allel storage may cause inconsistency among differentcopies of the same data Consistency refers to assuringthat multiple copies of the same data are identical

                      ndash Availability a distributed storage system operates inmultiple sets of servers As more servers are usedserver failures are inevitable It would be desirable ifthe entire system is not seriously affected to satisfy cus-tomerrsquos requests in terms of reading and writing Thisproperty is called availability

                      ndash Partition Tolerance multiple servers in a distributedstorage system are connected by a network The net-work could have linknode failures or temporary con-gestion The distributed system should have a certainlevel of tolerance to problems caused by network fail-ures It would be desirable that the distributed storagestill works well when the network is partitioned

                      Eric Brewer proposed a CAP [80 81] theory in 2000which indicated that a distributed system could not simulta-neously meet the requirements on consistency availabilityand partition tolerance at most two of the three require-ments can be satisfied simultaneously Seth Gilbert andNancy Lynch from MIT proved the correctness of CAP the-ory in 2002 Since consistency availability and partitiontolerance could not be achieved simultaneously we can havea CA system by ignoring partition tolerance a CP system byignoring availability and an AP system that ignores consis-tency according to different design goals The three systemsare discussed in the following

                      CA systems do not have partition tolerance ie theycould not handle network failures Therefore CA sys-tems are generally deemed as storage systems with a sin-gle server such as the traditional small-scale relationaldatabases Such systems feature single copy of data suchthat consistency is easily ensured Availability is guaranteedby the excellent design of relational databases Howeversince CA systems could not handle network failures theycould not be expanded to use many servers Therefore mostlarge-scale storage systems are CP systems and AP systems

                      Compared with CA systems CP systems ensure parti-tion tolerance Therefore CP systems can be expanded tobecome distributed systems CP systems generally main-tain several copies of the same data in order to ensure a

                      Mobile Netw Appl (2014) 19171ndash209 185

                      level of fault tolerance CP systems also ensure data consis-tency ie multiple copies of the same data are guaranteedto be completely identical However CP could not ensuresound availability because of the high cost for consistencyassurance Therefore CP systems are useful for the scenar-ios with moderate load but stringent requirements on dataaccuracy (eg trading data) BigTable and Hbase are twopopular CP systems

                      AP systems also ensure partition tolerance However APsystems are different from CP systems in that AP systemsalso ensure availability However AP systems only ensureeventual consistency rather than strong consistency in theprevious two systems Therefore AP systems only applyto the scenarios with frequent requests but not very highrequirements on accuracy For example in online SocialNetworking Services (SNS) systems there are many con-current visits to the data but a certain amount of data errorsare tolerable Furthermore because AP systems ensureeventual consistency accurate data can still be obtained aftera certain amount of delay Therefore AP systems may alsobe used under the circumstances with no stringent realtimerequirements Dynamo and Cassandra are two popular APsystems

                      43 Storage mechanism for big data

                      Considerable research on big data promotes the develop-ment of storage mechanisms for big data Existing stor-age mechanisms of big data may be classified into threebottom-up levels (i) file systems (ii) databases and (iii)programming models

                      File systems are the foundation of the applications atupper levels Googlersquos GFS is an expandable distributedfile system to support large-scale distributed data-intensiveapplications [25] GFS uses cheap commodity servers toachieve fault-tolerance and provides customers with high-performance services GFS supports large-scale file appli-cations with more frequent reading than writing HoweverGFS also has some limitations such as a single point offailure and poor performances for small files Such limita-tions have been overcome by Colossus [82] the successorof GFS

                      In addition other companies and researchers also havetheir solutions to meet the different demands for storageof big data For example HDFS and Kosmosfs are deriva-tives of open source codes of GFS Microsoft developedCosmos [83] to support its search and advertisement busi-ness Facebook utilizes Haystack [84] to store the largeamount of small-sized photos Taobao also developed TFSand FastDFS In conclusion distributed file systems havebeen relatively mature after years of development and busi-ness operation Therefore we will focus on the other twolevels in the rest of this section

                      431 Database technology

                      The database technology has been evolving for more than30 years Various database systems are developed to handledatasets at different scales and support various applica-tions Traditional relational databases cannot meet the chal-lenges on categories and scales brought about by big dataNoSQL databases (ie non traditional relational databases)are becoming more popular for big data storage NoSQLdatabases feature flexible modes support for simple andeasy copy simple API eventual consistency and supportof large volume data NoSQL databases are becomingthe core technology for of big data We will examinethe following three main NoSQL databases in this sec-tion Key-value databases column-oriented databases anddocument-oriented databases each based on certain datamodels

                      ndash Key-value Databases Key-value Databases are con-stituted by a simple data model and data is storedcorresponding to key-values Every key is unique andcustomers may input queried values according to thekeys Such databases feature a simple structure andthe modern key-value databases are characterized withhigh expandability and shorter query response time thanthose of relational databases Over the past few yearsmany key-value databases have appeared as motivatedby Amazonrsquos Dynamo system [85] We will introduceDynamo and several other representative key-valuedatabases

                      ndash Dynamo Dynamo is a highly available andexpandable distributed key-value data stor-age system It is used to store and managethe status of some core services which canbe realized with key access in the Amazone-Commerce Platform The public mode ofrelational databases may generate invalid dataand limit data scale and availability whileDynamo can resolve these problems with asimple key-object interface which is consti-tuted by simple reading and writing opera-tion Dynamo achieves elasticity and avail-ability through the data partition data copyand object edition mechanisms Dynamo par-tition plan relies on Consistent Hashing [86]which has a main advantage that node pass-ing only affects directly adjacent nodes anddo not affect other nodes to divide the loadfor multiple main storage machines Dynamocopies data to N sets of servers in which Nis a configurable parameter in order to achieve

                      186 Mobile Netw Appl (2014) 19171ndash209

                      high availability and durability Dynamo sys-tem also provides eventual consistency so asto conduct asynchronous update on all copies

                      ndash Voldemort Voldemort is also a key-value stor-age system which was initially developed forand is still used by LinkedIn Key words andvalues in Voldemort are composite objectsconstituted by tables and images Volde-mort interface includes three simple opera-tions reading writing and deletion all ofwhich are confirmed by key words Volde-mort provides asynchronous updating con-current control of multiple editions but doesnot ensure data consistency However Volde-mort supports optimistic locking for consistentmulti-record updating When conflict happensbetween the updating and any other opera-tions the updating operation will quit Thedata copy mechanism of Voldmort is the sameas that of Dynamo Voldemort not only storesdata in RAM but allows data be inserted intoa storage engine Especially Voldemort sup-ports two storage engines including BerkeleyDB and Random Access Files

                      The key-value database emerged a few years agoDeeply influenced by Amazon Dynamo DB other key-value storage systems include Redis Tokyo Canbinetand Tokyo Tyrant Memcached and Memcache DBRiak and Scalaris all of which provide expandabilityby distributing key words into nodes Voldemort RiakTokyo Cabinet and Memecached can utilize attachedstorage devices to store data in RAM or disks Otherstorage systems store data at RAM and provide diskbackup or rely on copy and recovery to avoid backup

                      ndash Column-oriented Database The column-orienteddatabases store and process data according to columnsother than rows Both columns and rows are segmentedin multiple nodes to realize expandability The column-oriented databases are mainly inspired by GooglersquosBigTable In this Section we first discuss BigTable andthen introduce several derivative tools

                      ndash BigTable BigTable is a distributed structureddata storage system which is designed to pro-cess the large-scale (PB class) data amongthousands commercial servers [87] The basicdata structure of Bigtable is a multi-dimensionsequenced mapping with sparse distributedand persistent storage Indexes of mappingare row key column key and timestampsand every value in mapping is an unana-lyzed byte array Each row key in BigTable

                      is a 64KB character string By lexicograph-ical order rows are stored and continuallysegmented into Tablets (ie units of distribu-tion) for load balance Thus reading a shortrow of data can be highly effective since itonly involves communication with a small por-tion of machines The columns are groupedaccording to the prefixes of keys and thusforming column families These column fami-lies are the basic units for access control Thetimestamps are 64-bit integers to distinguishdifferent editions of cell values Clients mayflexibly determine the number of cell editionsstored These editions are sequenced in thedescending order of timestamps so the latestedition will always be read

                      The BigTable API features the creation anddeletion of Tablets and column families as wellas modification of metadata of clusters tablesand column families Client applications mayinsert or delete values of BigTable query val-ues from columns or browse sub-datasets in atable Bigtable also supports some other char-acteristics such as transaction processing in asingle row Users may utilize such features toconduct more complex data processing

                      Every procedure executed by BigTableincludes three main components Masterserver Tablet server and client libraryBigtable only allows one set of Master serverbe distributed to be responsible for distribut-ing tablets for Tablet server detecting addedor removed Tablet servers and conductingload balance In addition it can also mod-ify BigTable schema eg creating tables andcolumn families and collecting garbage savedin GFS as well as deleted or disabled filesand using them in specific BigTable instancesEvery tablet server manages a Tablet set andis responsible for the reading and writing of aloaded Tablet When Tablets are too big theywill be segmented by the server The applica-tion client library is used to communicate withBigTable instances

                      BigTable is based on many fundamentalcomponents of Google including GFS [25]cluster management system SSTable file for-mat and Chubby [88] GFS is use to store dataand log files The cluster management systemis responsible for task scheduling resourcessharing processing of machine failures andmonitoring of machine statuses SSTable fileformat is used to store BigTable data internally

                      Mobile Netw Appl (2014) 19171ndash209 187

                      and it provides mapping between persistentsequenced and unchangeable keys and valuesas any byte strings BigTable utilizes Chubbyfor the following tasks in server 1) ensurethere is at most one active Master copy atany time 2) store the bootstrap location ofBigTable data 3) look up Tablet server 4) con-duct error recovery in case of Table server fail-ures 5) store BigTable schema information 6)store the access control table

                      ndash Cassandra Cassandra is a distributed storagesystem to manage the huge amount of struc-tured data distributed among multiple commer-cial servers [89] The system was developedby Facebook and became an open source toolin 2008 It adopts the ideas and concepts ofboth Amazon Dynamo and Google BigTableespecially integrating the distributed systemtechnology of Dynamo with the BigTable datamodel Tables in Cassandra are in the form ofdistributed four-dimensional structured map-ping where the four dimensions including rowcolumn column family and super column Arow is distinguished by a string-key with arbi-trary length No matter the amount of columnsto be read or written the operation on rowsis an auto Columns may constitute clusterswhich is called column families and are sim-ilar to the data model of Bigtable Cassandraprovides two kinds of column families columnfamilies and super columns The super columnincludes arbitrary number of columns relatedto same names A column family includescolumns and super columns which may becontinuously inserted to the column familyduring runtime The partition and copy mecha-nisms of Cassandra are very similar to those ofDynamo so as to achieve consistency

                      ndash Derivative tools of BigTable since theBigTable code cannot be obtained throughthe open source license some open sourceprojects compete to implement the BigTableconcept to develop similar systems such asHBase and Hypertable

                      HBase is a BigTable cloned version pro-grammed with Java and is a part of Hadoop ofApachersquos MapReduce framework [90] HBasereplaces GFS with HDFS It writes updatedcontents into RAM and regularly writes theminto files on disks The row operations areatomic operations equipped with row-levellocking and transaction processing which is

                      optional for large scale Partition and distribu-tion are transparently operated and have spacefor client hash or fixed key

                      HyperTable was developed similar toBigTable to obtain a set of high-performanceexpandable distributed storage and process-ing systems for structured and unstructureddata [91] HyperTable relies on distributedfile systems eg HDFS and distributed lockmanager Data representation processing andpartition mechanism are similar to that inBigTable HyperTable has its own query lan-guage called HyperTable query language(HQL) and allows users to create modify andquery underlying tables

                      Since the column-oriented storage databases mainlyemulate BigTable their designs are all similar exceptfor the concurrency mechanism and several other fea-tures For example Cassandra emphasizes weak consis-tency of concurrent control of multiple editions whileHBase and HyperTable focus on strong consistencythrough locks or log records

                      ndash Document Database Compared with key-value stor-age document storage can support more complex dataforms Since documents do not follow strict modesthere is no need to conduct mode migration In additionkey-value pairs can still be saved We will examine threeimportant representatives of document storage systemsie MongoDB SimpleDB and CouchDB

                      ndash MongoDB MongoDB is open-source anddocument-oriented database [92] MongoDBstores documents as Binary JSON (BSON)objects [93] which is similar to object Everydocument has an ID field as the primary keyQuery in MongoDB is expressed with syn-tax similar to JSON A database driver sendsthe query as a BSON object to MongoDBThe system allows query on all documentsincluding embedded objects and arrays Toenable rapid query indexes can be createdin the queryable fields of documents Thecopy operation in MongoDB can be executedwith log files in the main nodes that supportall the high-level operations conducted in thedatabase During copying the slavers queryall the writing operations since the last syn-chronization to the master and execute opera-tions in log files in local databases MongoDBsupports horizontal expansion with automaticsharing to distribute data among thousandsof nodes by automatically balancing load andfailover

                      188 Mobile Netw Appl (2014) 19171ndash209

                      ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

                      ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

                      Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

                      ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

                      functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

                      Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

                      ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

                      The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

                      In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

                      Mobile Netw Appl (2014) 19171ndash209 189

                      DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

                      ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

                      All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

                      ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

                      The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

                      Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

                      The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

                      Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

                      ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

                      ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

                      51 Traditional data analysis

                      5 Big data analysis

                      190 Mobile Netw Appl (2014) 19171ndash209

                      ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

                      ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

                      ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

                      ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

                      ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

                      52 Big data analytic methods

                      In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

                      ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

                      ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

                      ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

                      ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

                      ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

                      Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

                      53 Architecture for big data analysis

                      Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

                      Mobile Netw Appl (2014) 19171ndash209 191

                      Table 1 Comparison of MPI MapReduce and Dryad

                      MPI MapReduce Dryad

                      Deployment Computing node and data Computing and data storage Computing and data storage

                      storage arranged separately arranged at the same node arranged at the same node

                      (Data should be moved (Computing should (Computing should

                      computing node) be close to data) be close to data)

                      Resource management ndash Workqueue(google) Not clear

                      scheduling HOD(Yahoo)

                      Low level programming MPI API MapReduce API Dryad API

                      High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

                      Data storage The local file system GFS(google) NTFS

                      NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

                      Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

                      the tasks

                      Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

                      memory access Shared-memory FIFOs

                      Fault-tolerant Checkpoint Task re-execute Task re-execute

                      531 Real-time vs offline analysis

                      According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

                      ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

                      ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

                      532 Analysis at different levels

                      Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

                      ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

                      ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

                      ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

                      192 Mobile Netw Appl (2014) 19171ndash209

                      533 Analysis with different complexity

                      The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

                      54 Tools for big data mining and analysis

                      Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

                      ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

                      ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

                      ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

                      The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

                      ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

                      ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

                      6 Big data applications

                      In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

                      Mobile Netw Appl (2014) 19171ndash209 193

                      However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

                      61 Application evolutions

                      Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

                      ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

                      ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

                      most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

                      ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

                      As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

                      62 Big data analysis fields

                      webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

                      194 Mobile Netw Appl (2014) 19171ndash209

                      621 Structured data analysis

                      Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

                      However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

                      622 Text data analysis

                      The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

                      623 Web data analysis

                      Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

                      mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

                      Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

                      Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

                      Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

                      Mobile Netw Appl (2014) 19171ndash209 195

                      624 Multimedia data analysis

                      Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

                      Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

                      Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

                      Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

                      segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

                      Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

                      The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

                      625 Network data analysis

                      Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

                      196 Mobile Netw Appl (2014) 19171ndash209

                      and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

                      The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

                      Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

                      Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

                      Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

                      is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

                      626 Mobile data analysis

                      By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

                      With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

                      Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

                      Mobile Netw Appl (2014) 19171ndash209 197

                      In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

                      Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

                      63 Key applications of big data

                      631 Application of big data in enterprises

                      At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

                      In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

                      Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

                      632 Application of IoT based big data

                      IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

                      Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

                      Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

                      633 Application of online social network-oriented big data

                      Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

                      198 Mobile Netw Appl (2014) 19171ndash209

                      information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

                      ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

                      ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

                      is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

                      The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

                      In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

                      Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

                      Mobile Netw Appl (2014) 19171ndash209 199

                      or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

                      Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

                      ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

                      ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

                      ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

                      634 Applications of healthcare and medical big data

                      Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

                      effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

                      For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

                      The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

                      HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

                      Fig 6 The correlation between Tweets about rice price and food price inflation

                      200 Mobile Netw Appl (2014) 19171ndash209

                      imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

                      635 Collective intelligence

                      With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

                      Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

                      In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

                      636 Smart grid

                      Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

                      supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

                      ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

                      ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

                      Mobile Netw Appl (2014) 19171ndash209 201

                      according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

                      ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

                      7 Conclusion open issues and outlook

                      In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

                      In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

                      71 Open issues

                      The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

                      711 Theoretical research

                      Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

                      ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

                      ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

                      ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

                      712 Technology development

                      The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

                      ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

                      202 Mobile Netw Appl (2014) 19171ndash209

                      ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

                      ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

                      ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

                      713 Practical implications

                      Although there are already many successful big data appli-cations many practical problems should be solved

                      ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

                      ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

                      ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

                      individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

                      ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

                      714 Data security

                      In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

                      ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

                      ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

                      Mobile Netw Appl (2014) 19171ndash209 203

                      quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

                      ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

                      ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

                      The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

                      72 Outlook

                      The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

                      not predict the future but may take precautions for possibleevents to occur in the future

                      ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

                      ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

                      ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

                      ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

                      204 Mobile Netw Appl (2014) 19171ndash209

                      utilizes relational diagrams to express interpersonalrelationship

                      ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

                      ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

                      ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

                      ndash Compared with accurate data we would like toaccept numerous and complicated data

                      ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

                      ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

                      ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

                      Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

                      increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

                      Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

                      References

                      1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

                      2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

                      3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

                      4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

                      5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

                      httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

                      7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

                      8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

                      9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

                      10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

                      11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

                      12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

                      13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

                      14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

                      15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

                      16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

                      17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

                      18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

                      19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

                      Mobile Netw Appl (2014) 19171ndash209 205

                      20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

                      21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

                      22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

                      23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

                      24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

                      25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

                      26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

                      27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

                      28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

                      29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

                      30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

                      31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

                      32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

                      33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

                      34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

                      35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

                      36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

                      37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

                      38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

                      39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

                      40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

                      41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

                      42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

                      43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

                      44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

                      45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

                      46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

                      47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

                      48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

                      49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

                      50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

                      51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

                      52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

                      53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

                      54 Cisco data center interconnect design and deployment guide(2010)

                      55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

                      56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

                      57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

                      58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

                      59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

                      206 Mobile Netw Appl (2014) 19171ndash209

                      60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

                      61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

                      62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

                      63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

                      64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

                      65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

                      66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

                      67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

                      68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

                      69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

                      70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

                      71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

                      72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

                      73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

                      74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

                      75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

                      76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

                      77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

                      78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

                      79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

                      80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

                      81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

                      82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

                      83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

                      84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

                      85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

                      86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

                      87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

                      88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

                      89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

                      90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

                      Media Inc93 Crockford D (2006) The applicationjson media type for

                      javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

                      SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

                      tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

                      (2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

                      97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

                      98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

                      99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

                      Mobile Netw Appl (2014) 19171ndash209 207

                      100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

                      101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

                      102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

                      103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

                      104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

                      105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

                      106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

                      107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

                      108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

                      109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

                      110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

                      111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

                      112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

                      113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

                      114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

                      115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

                      D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

                      117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

                      118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

                      the 7th ACM international conference on computing frontiersACM pp 277ndash286

                      119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

                      120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

                      121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

                      122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

                      123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

                      124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

                      125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

                      126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

                      127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

                      128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

                      129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

                      130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

                      131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

                      132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

                      133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

                      134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

                      135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

                      136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

                      137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

                      138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

                      139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

                      140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

                      208 Mobile Netw Appl (2014) 19171ndash209

                      141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

                      142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

                      143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

                      144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

                      145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

                      146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

                      147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

                      148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

                      149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

                      150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

                      151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

                      152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

                      153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

                      154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

                      155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

                      156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

                      Mobile Netw Appl (2014) 19171ndash209 209

                      • Big Data A Survey
                        • Abstract
                        • Background
                          • Dawn of big data era
                          • Definition and features of big data
                          • Big data value
                          • The development of big data
                          • Challenges of big data
                            • Related technologies
                              • Relationship between cloud computing and big data
                              • Relationship between IoT and big data
                              • Data center
                              • Relationship between hadoop and big data
                                • Big data generation and acquisition
                                  • Data generation
                                    • Enterprise data
                                    • IoT data
                                    • Bio-medical data
                                    • Data generation from other fields
                                      • Big data acquisition
                                        • Data collection
                                        • Data transportation
                                        • Data pre-processing
                                            • Big data storage
                                              • Storage system for massive data
                                              • Distributed storage system
                                              • Storage mechanism for big data
                                                • Database technology
                                                  • Traditional data analysis
                                                  • Big data analytic methods
                                                  • Architecture for big data analysis
                                                    • Real-time vs offline analysis
                                                    • Analysis at different levels
                                                    • Analysis with different complexity
                                                      • Tools for big data mining and analysis
                                                        • Big data applications
                                                          • Key applications of big data
                                                            • Application evolutions
                                                            • Structured data analysis
                                                            • Text data analysis
                                                            • Web data analysis
                                                            • Multimedia data analysis
                                                            • Network data analysis
                                                            • Mobile data analysis
                                                              • Key applications of big data
                                                                • Application of big data in enterprises
                                                                • Application of IoT based big data
                                                                • Application of online social network-oriented big data
                                                                • Applications of healthcare and medical big data
                                                                • Collective intelligence
                                                                • Smart grid
                                                                    • Conclusion open issues and outlook
                                                                      • Open issues
                                                                        • Theoretical research
                                                                        • Technology development
                                                                        • Practical implications
                                                                        • Data security
                                                                          • Outlook
                                                                            • Acknowledgments
                                                                            • References

                        ndash Zero-copy packet capture technology The so-calledzero-copy (ZC) means that no copies between any inter-nal memories occur during packet receiving and send-ing at a node In sending the data packets directly startfrom the user buffer of applications pass through thenetwork interfaces and arrive at an external networkIn receiving the network interfaces directly send datapackets to the user buffer The basic idea of zero-copyis to reduce data copy times reduce system calls andreduce CPU load while ddatagrams are passed from net-work equipments to user program space The zero-copytechnology first utilizes direct memory access (DMA)technology to directly transmit network datagrams to anaddress space pre-allocated by the system kernel so asto avoid the participation of CPU In the meanwhile itmaps the internal memory of the datagrams in the sys-tem kernel to the that of the detection program or buildsa cache region in the user space and maps it to the ker-nel space Then the detection program directly accessesthe internal memory so as to reduce internal memorycopy from system kernel to user space and reduce theamount of system calls

                        ndash Mobile equipments At present mobile devices aremore widely used As mobile device functions becomeincreasingly stronger they feature more complex andmultiple means of data acquisition as well as morevariety of data Mobile devices may acquire geo-graphical location information through positioning sys-tems acquire audio information through microphonesacquire pictures videos streetscapes two-dimensionalbarcodes and other multimedia information throughcameras acquire user gestures and other body languageinformation through touch screens and gravity sensorsOver the years wireless operators have improved theservice level of the mobile Internet by acquiring andanalyzing such information For example iPhone itselfis a ldquomobile spyrdquo It may collect wireless data andgeographical location information and then send suchinformation back to Apple Inc for processing of whichthe user is not aware Apart from Apple smart phoneoperating systems such as Android of Google and Win-dows Phone of Microsoft can also collect informationin the similar manner

                        In addition to the aforementioned three data acquisitionmethods of main data sources there are many other datacollect methods or systems For example in scientific exper-iments many special tools can be used to collect exper-imental data such as magnetic spectrometers and radiotelescopes We may classify data collection methods fromdifferent perspectives From the perspective of data sourcesdata collection methods can be classified into two cate-gories collection methods recording through data sources

                        and collection methods recording through other auxiliarytools

                        322 Data transportation

                        Upon the completion of raw data collection data will betransferred to a data storage infrastructure for processingand analysis As discussed in Section 23 big data is mainlystored in a data center The data layout should be adjusted toimprove computing efficiency or facilitate hardware mainte-nance In other words internal data transmission may occurin the data center Therefore data transmission consistsof two phases Inter-DCN transmissions and Intra-DCNtransmissions

                        ndash Inter-DCN transmissions Inter-DCN transmissions arefrom data source to data center which is generallyachieved with the existing physical network infrastruc-ture Because of the rapid growth of traffic demandsthe physical network infrastructure in most regionsaround the world are constituted by high-volumn high-rate and cost-effective optic fiber transmission systemsOver the past 20 years advanced management equip-ment and technologies have been developed such asIP-based wavelength division multiplexing (WDM) net-work architecture to conduct smart control and man-agement of optical fiber networks [48 49] WDM isa technology that multiplexes multiple optical carriersignals with different wave lengths and couples themto the same optical fiber of the optical link In suchtechnology lasers with different wave lengths carry dif-ferent signals By far the backbone network have beendeployed with WDM optical transmission systems withsingle channel rate of 40Gbs At present 100Gbs com-mercial interface are available and 100Gbs systems (orTBs systems) will be available in the near future [50]However traditional optical transmission technologiesare limited by the bandwidth of the electronic bot-tleneck [51] Recently orthogonal frequency-divisionmultiplexing (OFDM) initially designed for wirelesssystems is regarded as one of the main candidatetechnologies for future high-speed optical transmis-sion OFDM is a multi-carrier parallel transmissiontechnology It segments a high-speed data flow to trans-form it into low-speed sub-data-flows to be transmittedover multiple orthogonal sub-carriers [52] Comparedwith fixed channel spacing of WDM OFDM allowssub-channel frequency spectrums to overlap with eachother [53] Therefore it is a flexible and efficient opticalnetworking technology

                        ndash Intra-DCN Transmissions Intra-DCN transmissionsare the data communication flows within data centersIntra-DCN transmissions depend on the communication

                        182 Mobile Netw Appl (2014) 19171ndash209

                        mechanism within the data center (ie on physical con-nection plates chips internal memories of data serversnetwork architectures of data centers and communica-tion protocols) A data center consists of multiple inte-grated server racks interconnected with its internal con-nection networks Nowadays the internal connectionnetworks of most data centers are fat-tree two-layeror three-layer structures based on multi-commoditynetwork flows [51 54] In the two-layer topologicalstructure the racks are connected by 1Gbps top rackswitches (TOR) and then such top rack switches areconnected with 10Gbps aggregation switches in thetopological structure The three-layer topological struc-ture is a structure augmented with one layer on the topof the two-layer topological structure and such layeris constituted by 10Gbps or 100Gbps core switchesto connect aggregation switches in the topologicalstructure There are also other topological structureswhich aim to improve the data center networks [55ndash58] Because of the inadequacy of electronic packetswitches it is difficult to increase communication band-widths while keeps energy consumption is low Overthe years due to the huge success achieved by opti-cal technologies the optical interconnection among thenetworks in data centers has drawn great interest Opti-cal interconnection is a high-throughput low-delayand low-energy-consumption solution At present opti-cal technologies are only used for point-to-point linksin data centers Such optical links provide connectionfor the switches using the low-cost multi-mode fiber(MMF) with 10Gbps data rate Optical interconnec-tion (switching in the optical domain) of networks indata centers is a feasible solution which can provideTbps-level transmission bandwidth with low energyconsumption Recently many optical interconnectionplans are proposed for data center networks [59] Someplans add optical paths to upgrade the existing net-works and other plans completely replace the currentswitches [59ndash64] As a strengthening technology Zhouet al in [65] adopt wireless links in the 60GHz fre-quency band to strengthen wired links Network vir-tualization should also be considered to improve theefficiency and utilization of data center networks

                        323 Data pre-processing

                        Because of the wide variety of data sources the collecteddatasets vary with respect to noise redundancy and con-sistency etc and it is undoubtedly a waste to store mean-ingless data In addition some analytical methods haveserious requirements on data quality Therefore in orderto enable effective data analysis we shall pre-process data

                        under many circumstances to integrate the data from differ-ent sources which can not only reduces storage expensebut also improves analysis accuracy Some relational datapre-processing techniques are discussed as follows

                        ndash Integration data integration is the cornerstone of mod-ern commercial informatics which involves the com-bination of data from different sources and providesusers with a uniform view of data [66] This is a matureresearch field for traditional database Historically twomethods have been widely recognized data ware-house and data federation Data warehousing includesa process named ETL (Extract Transform and Load)Extraction involves connecting source systems select-ing collecting analyzing and processing necessarydata Transformation is the execution of a series of rulesto transform the extracted data into standard formatsLoading means importing extracted and transformeddata into the target storage infrastructure Loading isthe most complex procedure among the three whichincludes operations such as transformation copy clear-ing standardization screening and data organizationA virtual database can be built to query and aggregatedata from different data sources but such database doesnot contain data On the contrary it includes informa-tion or metadata related to actual data and its positionsSuch two ldquostorage-readingrdquo approaches do not sat-isfy the high performance requirements of data flowsor search programs and applications Compared withqueries data in such two approaches is more dynamicand must be processed during data transmission Gen-erally data integration methods are accompanied withflow processing engines and search engines [30 67]

                        ndash Cleaning data cleaning is a process to identify inac-curate incomplete or unreasonable data and thenmodify or delete such data to improve data qualityGenerally data cleaning includes five complementaryprocedures [68] defining and determining error typessearching and identifying errors correcting errors doc-umenting error examples and error types and mod-ifying data entry procedures to reduce future errorsDuring cleaning data formats completeness rational-ity and restriction shall be inspected Data cleaning isof vital importance to keep the data consistency whichis widely applied in many fields such as banking insur-ance retail industry telecommunications and trafficcontrol

                        In e-commerce most data is electronically col-lected which may have serious data quality prob-lems Classic data quality problems mainly come fromsoftware defects customized errors or system mis-configuration Authors in [69] discussed data cleaning

                        Mobile Netw Appl (2014) 19171ndash209 183

                        in e-commerce by crawlers and regularly re-copyingcustomer and account information

                        In [70] the problem of cleaning RFID data wasexamined RFID is widely used in many applica-tions eg inventory management and target track-ing However the original RFID features low qualitywhich includes a lot of abnormal data limited by thephysical design and affected by environmental noisesIn [71] a probability model was developed to copewith data loss in mobile environments Khoussainovaet al in [72] proposed a system to automatically cor-rect errors of input data by defining global integrityconstraints

                        Herbert et al [73] proposed a framework called BIO-AJAX to standardize biological data so as to conductfurther computation and improve search quality WithBIO-AJAX some errors and repetitions may be elim-inated and common data mining technologies can beexecuted more effectively

                        ndash Redundancy elimination data redundancy refers to datarepetitions or surplus which usually occurs in manydatasets Data redundancy can increase the unneces-sary data transmission expense and cause defects onstorage systems eg waste of storage space lead-ing to data inconsistency reduction of data reliabil-ity and data damage Therefore various redundancyreduction methods have been proposed such as redun-dancy detection data filtering and data compressionSuch methods may apply to different datasets or appli-cation environments However redundancy reductionmay also bring about certain negative effects Forexample data compression and decompression causeadditional computational burden Therefore the ben-efits of redundancy reduction and the cost should becarefully balanced Data collected from different fieldswill increasingly appear in image or video formatsIt is well-known that images and videos contain con-siderable redundancy including temporal redundancyspacial redundancy statistical redundancy and sens-ing redundancy Video compression is widely usedto reduce redundancy in video data as specified inthe many video coding standards (MPEG-2 MPEG-4H263 and H264AVC) In [74] the authors inves-tigated the problem of video compression in a videosurveillance system with a video sensor network Theauthors propose a new MPEG-4 based method byinvestigating the contextual redundancy related to back-ground and foreground in a scene The low com-plexity and the low compression ratio of the pro-posed approach were demonstrated by the evaluationresults

                        On generalized data transmission or storage re-peated data deletion is a special data compression

                        technology which aims to eliminate repeated datacopies [75] With repeated data deletion individual datablocks or data segments will be assigned with identi-fiers (eg using a hash algorithm) and stored with theidentifiers added to the identification list As the anal-ysis of repeated data deletion continues if a new datablock has an identifier that is identical to that listedin the identification list the new data block will bedeemed as redundant and will be replaced by the cor-responding stored data block Repeated data deletioncan greatly reduce storage requirement which is par-ticularly important to a big data storage system Apartfrom the aforementioned data pre-processing methodsspecific data objects shall go through some other oper-ations such as feature extraction Such operation playsan important role in multimedia search and DNA anal-ysis [76ndash78] Usually high-dimensional feature vec-tors (or high-dimensional feature points) are used todescribe such data objects and the system stores thedimensional feature vectors for future retrieval Datatransfer is usually used to process distributed hetero-geneous data sources especially business datasets [79]As a matter of fact in consideration of various datasetsit is non-trivial or impossible to build a uniform datapre-processing procedure and technology that is appli-cable to all types of datasets on the specific featureproblem performance requirements and other factorsof the datasets should be considered so as to select aproper data pre-processing strategy

                        4 Big data storage

                        The explosive growth of data has more strict requirementson storage and management In this section we focus onthe storage of big data Big data storage refers to the stor-age and management of large-scale datasets while achiev-ing reliability and availability of data accessing We willreview important issues including massive storage systemsdistributed storage systems and big data storage mecha-nisms On one hand the storage infrastructure needs toprovide information storage service with reliable storagespace on the other hand it must provide a powerful accessinterface for query and analysis of a large amount ofdata

                        Traditionally as auxiliary equipment of server data stor-age device is used to store manage look up and analyzedata with structured RDBMSs With the sharp growth ofdata data storage device is becoming increasingly moreimportant and many Internet companies pursue big capac-ity of storage to be competitive Therefore there is acompelling need for research on data storage

                        184 Mobile Netw Appl (2014) 19171ndash209

                        41 Storage system for massive data

                        Various storage systems emerge to meet the demands ofmassive data Existing massive storage technologies can beclassified as Direct Attached Storage (DAS) and networkstorage while network storage can be further classifiedinto Network Attached Storage (NAS) and Storage AreaNetwork (SAN)

                        In DAS various harddisks are directly connected withservers and data management is server-centric such thatstorage devices are peripheral equipments each of whichtakes a certain amount of IO resource and is managed by anindividual application software For this reason DAS is onlysuitable to interconnect servers with a small scale How-ever due to its low scalability DAS will exhibit undesirableefficiency when the storage capacity is increased ie theupgradeability and expandability are greatly limited ThusDAS is mainly used in personal computers and small-sizedservers

                        Network storage is to utilize network to provide userswith a union interface for data access and sharing Networkstorage equipment includes special data exchange equip-ments disk array tap library and other storage media aswell as special storage software It is characterized withstrong expandability

                        NAS is actually an auxillary storage equipment of a net-work It is directly connected to a network through a hub orswitch through TCPIP protocols In NAS data is transmit-ted in the form of files Compared to DAS the IO burdenat a NAS server is reduced extensively since the serveraccesses a storage device indirectly through a network

                        While NAS is network-oriented SAN is especiallydesigned for data storage with a scalable and bandwidthintensive network eg a high-speed network with opticalfiber connections In SAN data storage management is rel-atively independent within a storage local area networkwhere multipath based data switching among any internalnodes is utilized to achieve a maximum degree of datasharing and data management

                        From the organization of a data storage system DASNAS and SAN can all be divided into three parts (i) discarray it is the foundation of a storage system and the fun-damental guarantee for data storage (ii) connection andnetwork sub-systems which provide connection among oneor more disc arrays and servers (iii) storage managementsoftware which handles data sharing disaster recovery andother storage management tasks of multiple servers

                        42 Distributed storage system

                        The first challenge brought about by big data is how todevelop a large scale distributed storage system for effi-ciently data processing and analysis To use a distributed

                        system to store massive data the following factors shouldbe taken into consideration

                        ndash Consistency a distributed storage system requires mul-tiple servers to cooperatively store data As there aremore servers the probability of server failures will belarger Usually data is divided into multiple pieces tobe stored at different servers to ensure availability incase of server failure However server failures and par-allel storage may cause inconsistency among differentcopies of the same data Consistency refers to assuringthat multiple copies of the same data are identical

                        ndash Availability a distributed storage system operates inmultiple sets of servers As more servers are usedserver failures are inevitable It would be desirable ifthe entire system is not seriously affected to satisfy cus-tomerrsquos requests in terms of reading and writing Thisproperty is called availability

                        ndash Partition Tolerance multiple servers in a distributedstorage system are connected by a network The net-work could have linknode failures or temporary con-gestion The distributed system should have a certainlevel of tolerance to problems caused by network fail-ures It would be desirable that the distributed storagestill works well when the network is partitioned

                        Eric Brewer proposed a CAP [80 81] theory in 2000which indicated that a distributed system could not simulta-neously meet the requirements on consistency availabilityand partition tolerance at most two of the three require-ments can be satisfied simultaneously Seth Gilbert andNancy Lynch from MIT proved the correctness of CAP the-ory in 2002 Since consistency availability and partitiontolerance could not be achieved simultaneously we can havea CA system by ignoring partition tolerance a CP system byignoring availability and an AP system that ignores consis-tency according to different design goals The three systemsare discussed in the following

                        CA systems do not have partition tolerance ie theycould not handle network failures Therefore CA sys-tems are generally deemed as storage systems with a sin-gle server such as the traditional small-scale relationaldatabases Such systems feature single copy of data suchthat consistency is easily ensured Availability is guaranteedby the excellent design of relational databases Howeversince CA systems could not handle network failures theycould not be expanded to use many servers Therefore mostlarge-scale storage systems are CP systems and AP systems

                        Compared with CA systems CP systems ensure parti-tion tolerance Therefore CP systems can be expanded tobecome distributed systems CP systems generally main-tain several copies of the same data in order to ensure a

                        Mobile Netw Appl (2014) 19171ndash209 185

                        level of fault tolerance CP systems also ensure data consis-tency ie multiple copies of the same data are guaranteedto be completely identical However CP could not ensuresound availability because of the high cost for consistencyassurance Therefore CP systems are useful for the scenar-ios with moderate load but stringent requirements on dataaccuracy (eg trading data) BigTable and Hbase are twopopular CP systems

                        AP systems also ensure partition tolerance However APsystems are different from CP systems in that AP systemsalso ensure availability However AP systems only ensureeventual consistency rather than strong consistency in theprevious two systems Therefore AP systems only applyto the scenarios with frequent requests but not very highrequirements on accuracy For example in online SocialNetworking Services (SNS) systems there are many con-current visits to the data but a certain amount of data errorsare tolerable Furthermore because AP systems ensureeventual consistency accurate data can still be obtained aftera certain amount of delay Therefore AP systems may alsobe used under the circumstances with no stringent realtimerequirements Dynamo and Cassandra are two popular APsystems

                        43 Storage mechanism for big data

                        Considerable research on big data promotes the develop-ment of storage mechanisms for big data Existing stor-age mechanisms of big data may be classified into threebottom-up levels (i) file systems (ii) databases and (iii)programming models

                        File systems are the foundation of the applications atupper levels Googlersquos GFS is an expandable distributedfile system to support large-scale distributed data-intensiveapplications [25] GFS uses cheap commodity servers toachieve fault-tolerance and provides customers with high-performance services GFS supports large-scale file appli-cations with more frequent reading than writing HoweverGFS also has some limitations such as a single point offailure and poor performances for small files Such limita-tions have been overcome by Colossus [82] the successorof GFS

                        In addition other companies and researchers also havetheir solutions to meet the different demands for storageof big data For example HDFS and Kosmosfs are deriva-tives of open source codes of GFS Microsoft developedCosmos [83] to support its search and advertisement busi-ness Facebook utilizes Haystack [84] to store the largeamount of small-sized photos Taobao also developed TFSand FastDFS In conclusion distributed file systems havebeen relatively mature after years of development and busi-ness operation Therefore we will focus on the other twolevels in the rest of this section

                        431 Database technology

                        The database technology has been evolving for more than30 years Various database systems are developed to handledatasets at different scales and support various applica-tions Traditional relational databases cannot meet the chal-lenges on categories and scales brought about by big dataNoSQL databases (ie non traditional relational databases)are becoming more popular for big data storage NoSQLdatabases feature flexible modes support for simple andeasy copy simple API eventual consistency and supportof large volume data NoSQL databases are becomingthe core technology for of big data We will examinethe following three main NoSQL databases in this sec-tion Key-value databases column-oriented databases anddocument-oriented databases each based on certain datamodels

                        ndash Key-value Databases Key-value Databases are con-stituted by a simple data model and data is storedcorresponding to key-values Every key is unique andcustomers may input queried values according to thekeys Such databases feature a simple structure andthe modern key-value databases are characterized withhigh expandability and shorter query response time thanthose of relational databases Over the past few yearsmany key-value databases have appeared as motivatedby Amazonrsquos Dynamo system [85] We will introduceDynamo and several other representative key-valuedatabases

                        ndash Dynamo Dynamo is a highly available andexpandable distributed key-value data stor-age system It is used to store and managethe status of some core services which canbe realized with key access in the Amazone-Commerce Platform The public mode ofrelational databases may generate invalid dataand limit data scale and availability whileDynamo can resolve these problems with asimple key-object interface which is consti-tuted by simple reading and writing opera-tion Dynamo achieves elasticity and avail-ability through the data partition data copyand object edition mechanisms Dynamo par-tition plan relies on Consistent Hashing [86]which has a main advantage that node pass-ing only affects directly adjacent nodes anddo not affect other nodes to divide the loadfor multiple main storage machines Dynamocopies data to N sets of servers in which Nis a configurable parameter in order to achieve

                        186 Mobile Netw Appl (2014) 19171ndash209

                        high availability and durability Dynamo sys-tem also provides eventual consistency so asto conduct asynchronous update on all copies

                        ndash Voldemort Voldemort is also a key-value stor-age system which was initially developed forand is still used by LinkedIn Key words andvalues in Voldemort are composite objectsconstituted by tables and images Volde-mort interface includes three simple opera-tions reading writing and deletion all ofwhich are confirmed by key words Volde-mort provides asynchronous updating con-current control of multiple editions but doesnot ensure data consistency However Volde-mort supports optimistic locking for consistentmulti-record updating When conflict happensbetween the updating and any other opera-tions the updating operation will quit Thedata copy mechanism of Voldmort is the sameas that of Dynamo Voldemort not only storesdata in RAM but allows data be inserted intoa storage engine Especially Voldemort sup-ports two storage engines including BerkeleyDB and Random Access Files

                        The key-value database emerged a few years agoDeeply influenced by Amazon Dynamo DB other key-value storage systems include Redis Tokyo Canbinetand Tokyo Tyrant Memcached and Memcache DBRiak and Scalaris all of which provide expandabilityby distributing key words into nodes Voldemort RiakTokyo Cabinet and Memecached can utilize attachedstorage devices to store data in RAM or disks Otherstorage systems store data at RAM and provide diskbackup or rely on copy and recovery to avoid backup

                        ndash Column-oriented Database The column-orienteddatabases store and process data according to columnsother than rows Both columns and rows are segmentedin multiple nodes to realize expandability The column-oriented databases are mainly inspired by GooglersquosBigTable In this Section we first discuss BigTable andthen introduce several derivative tools

                        ndash BigTable BigTable is a distributed structureddata storage system which is designed to pro-cess the large-scale (PB class) data amongthousands commercial servers [87] The basicdata structure of Bigtable is a multi-dimensionsequenced mapping with sparse distributedand persistent storage Indexes of mappingare row key column key and timestampsand every value in mapping is an unana-lyzed byte array Each row key in BigTable

                        is a 64KB character string By lexicograph-ical order rows are stored and continuallysegmented into Tablets (ie units of distribu-tion) for load balance Thus reading a shortrow of data can be highly effective since itonly involves communication with a small por-tion of machines The columns are groupedaccording to the prefixes of keys and thusforming column families These column fami-lies are the basic units for access control Thetimestamps are 64-bit integers to distinguishdifferent editions of cell values Clients mayflexibly determine the number of cell editionsstored These editions are sequenced in thedescending order of timestamps so the latestedition will always be read

                        The BigTable API features the creation anddeletion of Tablets and column families as wellas modification of metadata of clusters tablesand column families Client applications mayinsert or delete values of BigTable query val-ues from columns or browse sub-datasets in atable Bigtable also supports some other char-acteristics such as transaction processing in asingle row Users may utilize such features toconduct more complex data processing

                        Every procedure executed by BigTableincludes three main components Masterserver Tablet server and client libraryBigtable only allows one set of Master serverbe distributed to be responsible for distribut-ing tablets for Tablet server detecting addedor removed Tablet servers and conductingload balance In addition it can also mod-ify BigTable schema eg creating tables andcolumn families and collecting garbage savedin GFS as well as deleted or disabled filesand using them in specific BigTable instancesEvery tablet server manages a Tablet set andis responsible for the reading and writing of aloaded Tablet When Tablets are too big theywill be segmented by the server The applica-tion client library is used to communicate withBigTable instances

                        BigTable is based on many fundamentalcomponents of Google including GFS [25]cluster management system SSTable file for-mat and Chubby [88] GFS is use to store dataand log files The cluster management systemis responsible for task scheduling resourcessharing processing of machine failures andmonitoring of machine statuses SSTable fileformat is used to store BigTable data internally

                        Mobile Netw Appl (2014) 19171ndash209 187

                        and it provides mapping between persistentsequenced and unchangeable keys and valuesas any byte strings BigTable utilizes Chubbyfor the following tasks in server 1) ensurethere is at most one active Master copy atany time 2) store the bootstrap location ofBigTable data 3) look up Tablet server 4) con-duct error recovery in case of Table server fail-ures 5) store BigTable schema information 6)store the access control table

                        ndash Cassandra Cassandra is a distributed storagesystem to manage the huge amount of struc-tured data distributed among multiple commer-cial servers [89] The system was developedby Facebook and became an open source toolin 2008 It adopts the ideas and concepts ofboth Amazon Dynamo and Google BigTableespecially integrating the distributed systemtechnology of Dynamo with the BigTable datamodel Tables in Cassandra are in the form ofdistributed four-dimensional structured map-ping where the four dimensions including rowcolumn column family and super column Arow is distinguished by a string-key with arbi-trary length No matter the amount of columnsto be read or written the operation on rowsis an auto Columns may constitute clusterswhich is called column families and are sim-ilar to the data model of Bigtable Cassandraprovides two kinds of column families columnfamilies and super columns The super columnincludes arbitrary number of columns relatedto same names A column family includescolumns and super columns which may becontinuously inserted to the column familyduring runtime The partition and copy mecha-nisms of Cassandra are very similar to those ofDynamo so as to achieve consistency

                        ndash Derivative tools of BigTable since theBigTable code cannot be obtained throughthe open source license some open sourceprojects compete to implement the BigTableconcept to develop similar systems such asHBase and Hypertable

                        HBase is a BigTable cloned version pro-grammed with Java and is a part of Hadoop ofApachersquos MapReduce framework [90] HBasereplaces GFS with HDFS It writes updatedcontents into RAM and regularly writes theminto files on disks The row operations areatomic operations equipped with row-levellocking and transaction processing which is

                        optional for large scale Partition and distribu-tion are transparently operated and have spacefor client hash or fixed key

                        HyperTable was developed similar toBigTable to obtain a set of high-performanceexpandable distributed storage and process-ing systems for structured and unstructureddata [91] HyperTable relies on distributedfile systems eg HDFS and distributed lockmanager Data representation processing andpartition mechanism are similar to that inBigTable HyperTable has its own query lan-guage called HyperTable query language(HQL) and allows users to create modify andquery underlying tables

                        Since the column-oriented storage databases mainlyemulate BigTable their designs are all similar exceptfor the concurrency mechanism and several other fea-tures For example Cassandra emphasizes weak consis-tency of concurrent control of multiple editions whileHBase and HyperTable focus on strong consistencythrough locks or log records

                        ndash Document Database Compared with key-value stor-age document storage can support more complex dataforms Since documents do not follow strict modesthere is no need to conduct mode migration In additionkey-value pairs can still be saved We will examine threeimportant representatives of document storage systemsie MongoDB SimpleDB and CouchDB

                        ndash MongoDB MongoDB is open-source anddocument-oriented database [92] MongoDBstores documents as Binary JSON (BSON)objects [93] which is similar to object Everydocument has an ID field as the primary keyQuery in MongoDB is expressed with syn-tax similar to JSON A database driver sendsthe query as a BSON object to MongoDBThe system allows query on all documentsincluding embedded objects and arrays Toenable rapid query indexes can be createdin the queryable fields of documents Thecopy operation in MongoDB can be executedwith log files in the main nodes that supportall the high-level operations conducted in thedatabase During copying the slavers queryall the writing operations since the last syn-chronization to the master and execute opera-tions in log files in local databases MongoDBsupports horizontal expansion with automaticsharing to distribute data among thousandsof nodes by automatically balancing load andfailover

                        188 Mobile Netw Appl (2014) 19171ndash209

                        ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

                        ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

                        Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

                        ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

                        functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

                        Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

                        ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

                        The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

                        In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

                        Mobile Netw Appl (2014) 19171ndash209 189

                        DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

                        ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

                        All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

                        ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

                        The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

                        Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

                        The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

                        Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

                        ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

                        ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

                        51 Traditional data analysis

                        5 Big data analysis

                        190 Mobile Netw Appl (2014) 19171ndash209

                        ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

                        ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

                        ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

                        ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

                        ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

                        52 Big data analytic methods

                        In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

                        ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

                        ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

                        ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

                        ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

                        ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

                        Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

                        53 Architecture for big data analysis

                        Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

                        Mobile Netw Appl (2014) 19171ndash209 191

                        Table 1 Comparison of MPI MapReduce and Dryad

                        MPI MapReduce Dryad

                        Deployment Computing node and data Computing and data storage Computing and data storage

                        storage arranged separately arranged at the same node arranged at the same node

                        (Data should be moved (Computing should (Computing should

                        computing node) be close to data) be close to data)

                        Resource management ndash Workqueue(google) Not clear

                        scheduling HOD(Yahoo)

                        Low level programming MPI API MapReduce API Dryad API

                        High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

                        Data storage The local file system GFS(google) NTFS

                        NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

                        Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

                        the tasks

                        Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

                        memory access Shared-memory FIFOs

                        Fault-tolerant Checkpoint Task re-execute Task re-execute

                        531 Real-time vs offline analysis

                        According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

                        ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

                        ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

                        532 Analysis at different levels

                        Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

                        ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

                        ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

                        ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

                        192 Mobile Netw Appl (2014) 19171ndash209

                        533 Analysis with different complexity

                        The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

                        54 Tools for big data mining and analysis

                        Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

                        ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

                        ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

                        ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

                        The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

                        ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

                        ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

                        6 Big data applications

                        In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

                        Mobile Netw Appl (2014) 19171ndash209 193

                        However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

                        61 Application evolutions

                        Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

                        ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

                        ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

                        most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

                        ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

                        As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

                        62 Big data analysis fields

                        webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

                        194 Mobile Netw Appl (2014) 19171ndash209

                        621 Structured data analysis

                        Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

                        However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

                        622 Text data analysis

                        The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

                        623 Web data analysis

                        Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

                        mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

                        Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

                        Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

                        Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

                        Mobile Netw Appl (2014) 19171ndash209 195

                        624 Multimedia data analysis

                        Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

                        Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

                        Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

                        Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

                        segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

                        Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

                        The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

                        625 Network data analysis

                        Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

                        196 Mobile Netw Appl (2014) 19171ndash209

                        and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

                        The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

                        Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

                        Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

                        Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

                        is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

                        626 Mobile data analysis

                        By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

                        With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

                        Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

                        Mobile Netw Appl (2014) 19171ndash209 197

                        In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

                        Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

                        63 Key applications of big data

                        631 Application of big data in enterprises

                        At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

                        In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

                        Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

                        632 Application of IoT based big data

                        IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

                        Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

                        Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

                        633 Application of online social network-oriented big data

                        Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

                        198 Mobile Netw Appl (2014) 19171ndash209

                        information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

                        ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

                        ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

                        is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

                        The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

                        In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

                        Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

                        Mobile Netw Appl (2014) 19171ndash209 199

                        or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

                        Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

                        ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

                        ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

                        ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

                        634 Applications of healthcare and medical big data

                        Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

                        effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

                        For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

                        The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

                        HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

                        Fig 6 The correlation between Tweets about rice price and food price inflation

                        200 Mobile Netw Appl (2014) 19171ndash209

                        imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

                        635 Collective intelligence

                        With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

                        Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

                        In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

                        636 Smart grid

                        Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

                        supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

                        ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

                        ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

                        Mobile Netw Appl (2014) 19171ndash209 201

                        according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

                        ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

                        7 Conclusion open issues and outlook

                        In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

                        In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

                        71 Open issues

                        The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

                        711 Theoretical research

                        Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

                        ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

                        ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

                        ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

                        712 Technology development

                        The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

                        ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

                        202 Mobile Netw Appl (2014) 19171ndash209

                        ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

                        ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

                        ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

                        713 Practical implications

                        Although there are already many successful big data appli-cations many practical problems should be solved

                        ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

                        ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

                        ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

                        individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

                        ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

                        714 Data security

                        In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

                        ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

                        ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

                        Mobile Netw Appl (2014) 19171ndash209 203

                        quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

                        ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

                        ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

                        The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

                        72 Outlook

                        The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

                        not predict the future but may take precautions for possibleevents to occur in the future

                        ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

                        ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

                        ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

                        ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

                        204 Mobile Netw Appl (2014) 19171ndash209

                        utilizes relational diagrams to express interpersonalrelationship

                        ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

                        ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

                        ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

                        ndash Compared with accurate data we would like toaccept numerous and complicated data

                        ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

                        ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

                        ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

                        Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

                        increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

                        Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

                        References

                        1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

                        2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

                        3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

                        4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

                        5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

                        httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

                        7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

                        8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

                        9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

                        10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

                        11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

                        12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

                        13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

                        14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

                        15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

                        16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

                        17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

                        18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

                        19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

                        Mobile Netw Appl (2014) 19171ndash209 205

                        20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

                        21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

                        22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

                        23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

                        24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

                        25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

                        26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

                        27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

                        28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

                        29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

                        30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

                        31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

                        32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

                        33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

                        34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

                        35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

                        36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

                        37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

                        38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

                        39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

                        40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

                        41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

                        42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

                        43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

                        44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

                        45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

                        46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

                        47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

                        48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

                        49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

                        50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

                        51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

                        52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

                        53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

                        54 Cisco data center interconnect design and deployment guide(2010)

                        55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

                        56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

                        57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

                        58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

                        59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

                        206 Mobile Netw Appl (2014) 19171ndash209

                        60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

                        61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

                        62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

                        63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

                        64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

                        65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

                        66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

                        67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

                        68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

                        69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

                        70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

                        71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

                        72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

                        73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

                        74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

                        75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

                        76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

                        77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

                        78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

                        79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

                        80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

                        81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

                        82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

                        83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

                        84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

                        85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

                        86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

                        87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

                        88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

                        89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

                        90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

                        Media Inc93 Crockford D (2006) The applicationjson media type for

                        javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

                        SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

                        tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

                        (2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

                        97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

                        98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

                        99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

                        Mobile Netw Appl (2014) 19171ndash209 207

                        100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

                        101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

                        102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

                        103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

                        104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

                        105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

                        106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

                        107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

                        108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

                        109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

                        110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

                        111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

                        112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

                        113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

                        114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

                        115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

                        D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

                        117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

                        118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

                        the 7th ACM international conference on computing frontiersACM pp 277ndash286

                        119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

                        120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

                        121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

                        122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

                        123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

                        124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

                        125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

                        126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

                        127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

                        128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

                        129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

                        130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

                        131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

                        132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

                        133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

                        134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

                        135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

                        136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

                        137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

                        138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

                        139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

                        140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

                        208 Mobile Netw Appl (2014) 19171ndash209

                        141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

                        142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

                        143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

                        144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

                        145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

                        146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

                        147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

                        148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

                        149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

                        150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

                        151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

                        152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

                        153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

                        154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

                        155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

                        156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

                        Mobile Netw Appl (2014) 19171ndash209 209

                        • Big Data A Survey
                          • Abstract
                          • Background
                            • Dawn of big data era
                            • Definition and features of big data
                            • Big data value
                            • The development of big data
                            • Challenges of big data
                              • Related technologies
                                • Relationship between cloud computing and big data
                                • Relationship between IoT and big data
                                • Data center
                                • Relationship between hadoop and big data
                                  • Big data generation and acquisition
                                    • Data generation
                                      • Enterprise data
                                      • IoT data
                                      • Bio-medical data
                                      • Data generation from other fields
                                        • Big data acquisition
                                          • Data collection
                                          • Data transportation
                                          • Data pre-processing
                                              • Big data storage
                                                • Storage system for massive data
                                                • Distributed storage system
                                                • Storage mechanism for big data
                                                  • Database technology
                                                    • Traditional data analysis
                                                    • Big data analytic methods
                                                    • Architecture for big data analysis
                                                      • Real-time vs offline analysis
                                                      • Analysis at different levels
                                                      • Analysis with different complexity
                                                        • Tools for big data mining and analysis
                                                          • Big data applications
                                                            • Key applications of big data
                                                              • Application evolutions
                                                              • Structured data analysis
                                                              • Text data analysis
                                                              • Web data analysis
                                                              • Multimedia data analysis
                                                              • Network data analysis
                                                              • Mobile data analysis
                                                                • Key applications of big data
                                                                  • Application of big data in enterprises
                                                                  • Application of IoT based big data
                                                                  • Application of online social network-oriented big data
                                                                  • Applications of healthcare and medical big data
                                                                  • Collective intelligence
                                                                  • Smart grid
                                                                      • Conclusion open issues and outlook
                                                                        • Open issues
                                                                          • Theoretical research
                                                                          • Technology development
                                                                          • Practical implications
                                                                          • Data security
                                                                            • Outlook
                                                                              • Acknowledgments
                                                                              • References

                          mechanism within the data center (ie on physical con-nection plates chips internal memories of data serversnetwork architectures of data centers and communica-tion protocols) A data center consists of multiple inte-grated server racks interconnected with its internal con-nection networks Nowadays the internal connectionnetworks of most data centers are fat-tree two-layeror three-layer structures based on multi-commoditynetwork flows [51 54] In the two-layer topologicalstructure the racks are connected by 1Gbps top rackswitches (TOR) and then such top rack switches areconnected with 10Gbps aggregation switches in thetopological structure The three-layer topological struc-ture is a structure augmented with one layer on the topof the two-layer topological structure and such layeris constituted by 10Gbps or 100Gbps core switchesto connect aggregation switches in the topologicalstructure There are also other topological structureswhich aim to improve the data center networks [55ndash58] Because of the inadequacy of electronic packetswitches it is difficult to increase communication band-widths while keeps energy consumption is low Overthe years due to the huge success achieved by opti-cal technologies the optical interconnection among thenetworks in data centers has drawn great interest Opti-cal interconnection is a high-throughput low-delayand low-energy-consumption solution At present opti-cal technologies are only used for point-to-point linksin data centers Such optical links provide connectionfor the switches using the low-cost multi-mode fiber(MMF) with 10Gbps data rate Optical interconnec-tion (switching in the optical domain) of networks indata centers is a feasible solution which can provideTbps-level transmission bandwidth with low energyconsumption Recently many optical interconnectionplans are proposed for data center networks [59] Someplans add optical paths to upgrade the existing net-works and other plans completely replace the currentswitches [59ndash64] As a strengthening technology Zhouet al in [65] adopt wireless links in the 60GHz fre-quency band to strengthen wired links Network vir-tualization should also be considered to improve theefficiency and utilization of data center networks

                          323 Data pre-processing

                          Because of the wide variety of data sources the collecteddatasets vary with respect to noise redundancy and con-sistency etc and it is undoubtedly a waste to store mean-ingless data In addition some analytical methods haveserious requirements on data quality Therefore in orderto enable effective data analysis we shall pre-process data

                          under many circumstances to integrate the data from differ-ent sources which can not only reduces storage expensebut also improves analysis accuracy Some relational datapre-processing techniques are discussed as follows

                          ndash Integration data integration is the cornerstone of mod-ern commercial informatics which involves the com-bination of data from different sources and providesusers with a uniform view of data [66] This is a matureresearch field for traditional database Historically twomethods have been widely recognized data ware-house and data federation Data warehousing includesa process named ETL (Extract Transform and Load)Extraction involves connecting source systems select-ing collecting analyzing and processing necessarydata Transformation is the execution of a series of rulesto transform the extracted data into standard formatsLoading means importing extracted and transformeddata into the target storage infrastructure Loading isthe most complex procedure among the three whichincludes operations such as transformation copy clear-ing standardization screening and data organizationA virtual database can be built to query and aggregatedata from different data sources but such database doesnot contain data On the contrary it includes informa-tion or metadata related to actual data and its positionsSuch two ldquostorage-readingrdquo approaches do not sat-isfy the high performance requirements of data flowsor search programs and applications Compared withqueries data in such two approaches is more dynamicand must be processed during data transmission Gen-erally data integration methods are accompanied withflow processing engines and search engines [30 67]

                          ndash Cleaning data cleaning is a process to identify inac-curate incomplete or unreasonable data and thenmodify or delete such data to improve data qualityGenerally data cleaning includes five complementaryprocedures [68] defining and determining error typessearching and identifying errors correcting errors doc-umenting error examples and error types and mod-ifying data entry procedures to reduce future errorsDuring cleaning data formats completeness rational-ity and restriction shall be inspected Data cleaning isof vital importance to keep the data consistency whichis widely applied in many fields such as banking insur-ance retail industry telecommunications and trafficcontrol

                          In e-commerce most data is electronically col-lected which may have serious data quality prob-lems Classic data quality problems mainly come fromsoftware defects customized errors or system mis-configuration Authors in [69] discussed data cleaning

                          Mobile Netw Appl (2014) 19171ndash209 183

                          in e-commerce by crawlers and regularly re-copyingcustomer and account information

                          In [70] the problem of cleaning RFID data wasexamined RFID is widely used in many applica-tions eg inventory management and target track-ing However the original RFID features low qualitywhich includes a lot of abnormal data limited by thephysical design and affected by environmental noisesIn [71] a probability model was developed to copewith data loss in mobile environments Khoussainovaet al in [72] proposed a system to automatically cor-rect errors of input data by defining global integrityconstraints

                          Herbert et al [73] proposed a framework called BIO-AJAX to standardize biological data so as to conductfurther computation and improve search quality WithBIO-AJAX some errors and repetitions may be elim-inated and common data mining technologies can beexecuted more effectively

                          ndash Redundancy elimination data redundancy refers to datarepetitions or surplus which usually occurs in manydatasets Data redundancy can increase the unneces-sary data transmission expense and cause defects onstorage systems eg waste of storage space lead-ing to data inconsistency reduction of data reliabil-ity and data damage Therefore various redundancyreduction methods have been proposed such as redun-dancy detection data filtering and data compressionSuch methods may apply to different datasets or appli-cation environments However redundancy reductionmay also bring about certain negative effects Forexample data compression and decompression causeadditional computational burden Therefore the ben-efits of redundancy reduction and the cost should becarefully balanced Data collected from different fieldswill increasingly appear in image or video formatsIt is well-known that images and videos contain con-siderable redundancy including temporal redundancyspacial redundancy statistical redundancy and sens-ing redundancy Video compression is widely usedto reduce redundancy in video data as specified inthe many video coding standards (MPEG-2 MPEG-4H263 and H264AVC) In [74] the authors inves-tigated the problem of video compression in a videosurveillance system with a video sensor network Theauthors propose a new MPEG-4 based method byinvestigating the contextual redundancy related to back-ground and foreground in a scene The low com-plexity and the low compression ratio of the pro-posed approach were demonstrated by the evaluationresults

                          On generalized data transmission or storage re-peated data deletion is a special data compression

                          technology which aims to eliminate repeated datacopies [75] With repeated data deletion individual datablocks or data segments will be assigned with identi-fiers (eg using a hash algorithm) and stored with theidentifiers added to the identification list As the anal-ysis of repeated data deletion continues if a new datablock has an identifier that is identical to that listedin the identification list the new data block will bedeemed as redundant and will be replaced by the cor-responding stored data block Repeated data deletioncan greatly reduce storage requirement which is par-ticularly important to a big data storage system Apartfrom the aforementioned data pre-processing methodsspecific data objects shall go through some other oper-ations such as feature extraction Such operation playsan important role in multimedia search and DNA anal-ysis [76ndash78] Usually high-dimensional feature vec-tors (or high-dimensional feature points) are used todescribe such data objects and the system stores thedimensional feature vectors for future retrieval Datatransfer is usually used to process distributed hetero-geneous data sources especially business datasets [79]As a matter of fact in consideration of various datasetsit is non-trivial or impossible to build a uniform datapre-processing procedure and technology that is appli-cable to all types of datasets on the specific featureproblem performance requirements and other factorsof the datasets should be considered so as to select aproper data pre-processing strategy

                          4 Big data storage

                          The explosive growth of data has more strict requirementson storage and management In this section we focus onthe storage of big data Big data storage refers to the stor-age and management of large-scale datasets while achiev-ing reliability and availability of data accessing We willreview important issues including massive storage systemsdistributed storage systems and big data storage mecha-nisms On one hand the storage infrastructure needs toprovide information storage service with reliable storagespace on the other hand it must provide a powerful accessinterface for query and analysis of a large amount ofdata

                          Traditionally as auxiliary equipment of server data stor-age device is used to store manage look up and analyzedata with structured RDBMSs With the sharp growth ofdata data storage device is becoming increasingly moreimportant and many Internet companies pursue big capac-ity of storage to be competitive Therefore there is acompelling need for research on data storage

                          184 Mobile Netw Appl (2014) 19171ndash209

                          41 Storage system for massive data

                          Various storage systems emerge to meet the demands ofmassive data Existing massive storage technologies can beclassified as Direct Attached Storage (DAS) and networkstorage while network storage can be further classifiedinto Network Attached Storage (NAS) and Storage AreaNetwork (SAN)

                          In DAS various harddisks are directly connected withservers and data management is server-centric such thatstorage devices are peripheral equipments each of whichtakes a certain amount of IO resource and is managed by anindividual application software For this reason DAS is onlysuitable to interconnect servers with a small scale How-ever due to its low scalability DAS will exhibit undesirableefficiency when the storage capacity is increased ie theupgradeability and expandability are greatly limited ThusDAS is mainly used in personal computers and small-sizedservers

                          Network storage is to utilize network to provide userswith a union interface for data access and sharing Networkstorage equipment includes special data exchange equip-ments disk array tap library and other storage media aswell as special storage software It is characterized withstrong expandability

                          NAS is actually an auxillary storage equipment of a net-work It is directly connected to a network through a hub orswitch through TCPIP protocols In NAS data is transmit-ted in the form of files Compared to DAS the IO burdenat a NAS server is reduced extensively since the serveraccesses a storage device indirectly through a network

                          While NAS is network-oriented SAN is especiallydesigned for data storage with a scalable and bandwidthintensive network eg a high-speed network with opticalfiber connections In SAN data storage management is rel-atively independent within a storage local area networkwhere multipath based data switching among any internalnodes is utilized to achieve a maximum degree of datasharing and data management

                          From the organization of a data storage system DASNAS and SAN can all be divided into three parts (i) discarray it is the foundation of a storage system and the fun-damental guarantee for data storage (ii) connection andnetwork sub-systems which provide connection among oneor more disc arrays and servers (iii) storage managementsoftware which handles data sharing disaster recovery andother storage management tasks of multiple servers

                          42 Distributed storage system

                          The first challenge brought about by big data is how todevelop a large scale distributed storage system for effi-ciently data processing and analysis To use a distributed

                          system to store massive data the following factors shouldbe taken into consideration

                          ndash Consistency a distributed storage system requires mul-tiple servers to cooperatively store data As there aremore servers the probability of server failures will belarger Usually data is divided into multiple pieces tobe stored at different servers to ensure availability incase of server failure However server failures and par-allel storage may cause inconsistency among differentcopies of the same data Consistency refers to assuringthat multiple copies of the same data are identical

                          ndash Availability a distributed storage system operates inmultiple sets of servers As more servers are usedserver failures are inevitable It would be desirable ifthe entire system is not seriously affected to satisfy cus-tomerrsquos requests in terms of reading and writing Thisproperty is called availability

                          ndash Partition Tolerance multiple servers in a distributedstorage system are connected by a network The net-work could have linknode failures or temporary con-gestion The distributed system should have a certainlevel of tolerance to problems caused by network fail-ures It would be desirable that the distributed storagestill works well when the network is partitioned

                          Eric Brewer proposed a CAP [80 81] theory in 2000which indicated that a distributed system could not simulta-neously meet the requirements on consistency availabilityand partition tolerance at most two of the three require-ments can be satisfied simultaneously Seth Gilbert andNancy Lynch from MIT proved the correctness of CAP the-ory in 2002 Since consistency availability and partitiontolerance could not be achieved simultaneously we can havea CA system by ignoring partition tolerance a CP system byignoring availability and an AP system that ignores consis-tency according to different design goals The three systemsare discussed in the following

                          CA systems do not have partition tolerance ie theycould not handle network failures Therefore CA sys-tems are generally deemed as storage systems with a sin-gle server such as the traditional small-scale relationaldatabases Such systems feature single copy of data suchthat consistency is easily ensured Availability is guaranteedby the excellent design of relational databases Howeversince CA systems could not handle network failures theycould not be expanded to use many servers Therefore mostlarge-scale storage systems are CP systems and AP systems

                          Compared with CA systems CP systems ensure parti-tion tolerance Therefore CP systems can be expanded tobecome distributed systems CP systems generally main-tain several copies of the same data in order to ensure a

                          Mobile Netw Appl (2014) 19171ndash209 185

                          level of fault tolerance CP systems also ensure data consis-tency ie multiple copies of the same data are guaranteedto be completely identical However CP could not ensuresound availability because of the high cost for consistencyassurance Therefore CP systems are useful for the scenar-ios with moderate load but stringent requirements on dataaccuracy (eg trading data) BigTable and Hbase are twopopular CP systems

                          AP systems also ensure partition tolerance However APsystems are different from CP systems in that AP systemsalso ensure availability However AP systems only ensureeventual consistency rather than strong consistency in theprevious two systems Therefore AP systems only applyto the scenarios with frequent requests but not very highrequirements on accuracy For example in online SocialNetworking Services (SNS) systems there are many con-current visits to the data but a certain amount of data errorsare tolerable Furthermore because AP systems ensureeventual consistency accurate data can still be obtained aftera certain amount of delay Therefore AP systems may alsobe used under the circumstances with no stringent realtimerequirements Dynamo and Cassandra are two popular APsystems

                          43 Storage mechanism for big data

                          Considerable research on big data promotes the develop-ment of storage mechanisms for big data Existing stor-age mechanisms of big data may be classified into threebottom-up levels (i) file systems (ii) databases and (iii)programming models

                          File systems are the foundation of the applications atupper levels Googlersquos GFS is an expandable distributedfile system to support large-scale distributed data-intensiveapplications [25] GFS uses cheap commodity servers toachieve fault-tolerance and provides customers with high-performance services GFS supports large-scale file appli-cations with more frequent reading than writing HoweverGFS also has some limitations such as a single point offailure and poor performances for small files Such limita-tions have been overcome by Colossus [82] the successorof GFS

                          In addition other companies and researchers also havetheir solutions to meet the different demands for storageof big data For example HDFS and Kosmosfs are deriva-tives of open source codes of GFS Microsoft developedCosmos [83] to support its search and advertisement busi-ness Facebook utilizes Haystack [84] to store the largeamount of small-sized photos Taobao also developed TFSand FastDFS In conclusion distributed file systems havebeen relatively mature after years of development and busi-ness operation Therefore we will focus on the other twolevels in the rest of this section

                          431 Database technology

                          The database technology has been evolving for more than30 years Various database systems are developed to handledatasets at different scales and support various applica-tions Traditional relational databases cannot meet the chal-lenges on categories and scales brought about by big dataNoSQL databases (ie non traditional relational databases)are becoming more popular for big data storage NoSQLdatabases feature flexible modes support for simple andeasy copy simple API eventual consistency and supportof large volume data NoSQL databases are becomingthe core technology for of big data We will examinethe following three main NoSQL databases in this sec-tion Key-value databases column-oriented databases anddocument-oriented databases each based on certain datamodels

                          ndash Key-value Databases Key-value Databases are con-stituted by a simple data model and data is storedcorresponding to key-values Every key is unique andcustomers may input queried values according to thekeys Such databases feature a simple structure andthe modern key-value databases are characterized withhigh expandability and shorter query response time thanthose of relational databases Over the past few yearsmany key-value databases have appeared as motivatedby Amazonrsquos Dynamo system [85] We will introduceDynamo and several other representative key-valuedatabases

                          ndash Dynamo Dynamo is a highly available andexpandable distributed key-value data stor-age system It is used to store and managethe status of some core services which canbe realized with key access in the Amazone-Commerce Platform The public mode ofrelational databases may generate invalid dataand limit data scale and availability whileDynamo can resolve these problems with asimple key-object interface which is consti-tuted by simple reading and writing opera-tion Dynamo achieves elasticity and avail-ability through the data partition data copyand object edition mechanisms Dynamo par-tition plan relies on Consistent Hashing [86]which has a main advantage that node pass-ing only affects directly adjacent nodes anddo not affect other nodes to divide the loadfor multiple main storage machines Dynamocopies data to N sets of servers in which Nis a configurable parameter in order to achieve

                          186 Mobile Netw Appl (2014) 19171ndash209

                          high availability and durability Dynamo sys-tem also provides eventual consistency so asto conduct asynchronous update on all copies

                          ndash Voldemort Voldemort is also a key-value stor-age system which was initially developed forand is still used by LinkedIn Key words andvalues in Voldemort are composite objectsconstituted by tables and images Volde-mort interface includes three simple opera-tions reading writing and deletion all ofwhich are confirmed by key words Volde-mort provides asynchronous updating con-current control of multiple editions but doesnot ensure data consistency However Volde-mort supports optimistic locking for consistentmulti-record updating When conflict happensbetween the updating and any other opera-tions the updating operation will quit Thedata copy mechanism of Voldmort is the sameas that of Dynamo Voldemort not only storesdata in RAM but allows data be inserted intoa storage engine Especially Voldemort sup-ports two storage engines including BerkeleyDB and Random Access Files

                          The key-value database emerged a few years agoDeeply influenced by Amazon Dynamo DB other key-value storage systems include Redis Tokyo Canbinetand Tokyo Tyrant Memcached and Memcache DBRiak and Scalaris all of which provide expandabilityby distributing key words into nodes Voldemort RiakTokyo Cabinet and Memecached can utilize attachedstorage devices to store data in RAM or disks Otherstorage systems store data at RAM and provide diskbackup or rely on copy and recovery to avoid backup

                          ndash Column-oriented Database The column-orienteddatabases store and process data according to columnsother than rows Both columns and rows are segmentedin multiple nodes to realize expandability The column-oriented databases are mainly inspired by GooglersquosBigTable In this Section we first discuss BigTable andthen introduce several derivative tools

                          ndash BigTable BigTable is a distributed structureddata storage system which is designed to pro-cess the large-scale (PB class) data amongthousands commercial servers [87] The basicdata structure of Bigtable is a multi-dimensionsequenced mapping with sparse distributedand persistent storage Indexes of mappingare row key column key and timestampsand every value in mapping is an unana-lyzed byte array Each row key in BigTable

                          is a 64KB character string By lexicograph-ical order rows are stored and continuallysegmented into Tablets (ie units of distribu-tion) for load balance Thus reading a shortrow of data can be highly effective since itonly involves communication with a small por-tion of machines The columns are groupedaccording to the prefixes of keys and thusforming column families These column fami-lies are the basic units for access control Thetimestamps are 64-bit integers to distinguishdifferent editions of cell values Clients mayflexibly determine the number of cell editionsstored These editions are sequenced in thedescending order of timestamps so the latestedition will always be read

                          The BigTable API features the creation anddeletion of Tablets and column families as wellas modification of metadata of clusters tablesand column families Client applications mayinsert or delete values of BigTable query val-ues from columns or browse sub-datasets in atable Bigtable also supports some other char-acteristics such as transaction processing in asingle row Users may utilize such features toconduct more complex data processing

                          Every procedure executed by BigTableincludes three main components Masterserver Tablet server and client libraryBigtable only allows one set of Master serverbe distributed to be responsible for distribut-ing tablets for Tablet server detecting addedor removed Tablet servers and conductingload balance In addition it can also mod-ify BigTable schema eg creating tables andcolumn families and collecting garbage savedin GFS as well as deleted or disabled filesand using them in specific BigTable instancesEvery tablet server manages a Tablet set andis responsible for the reading and writing of aloaded Tablet When Tablets are too big theywill be segmented by the server The applica-tion client library is used to communicate withBigTable instances

                          BigTable is based on many fundamentalcomponents of Google including GFS [25]cluster management system SSTable file for-mat and Chubby [88] GFS is use to store dataand log files The cluster management systemis responsible for task scheduling resourcessharing processing of machine failures andmonitoring of machine statuses SSTable fileformat is used to store BigTable data internally

                          Mobile Netw Appl (2014) 19171ndash209 187

                          and it provides mapping between persistentsequenced and unchangeable keys and valuesas any byte strings BigTable utilizes Chubbyfor the following tasks in server 1) ensurethere is at most one active Master copy atany time 2) store the bootstrap location ofBigTable data 3) look up Tablet server 4) con-duct error recovery in case of Table server fail-ures 5) store BigTable schema information 6)store the access control table

                          ndash Cassandra Cassandra is a distributed storagesystem to manage the huge amount of struc-tured data distributed among multiple commer-cial servers [89] The system was developedby Facebook and became an open source toolin 2008 It adopts the ideas and concepts ofboth Amazon Dynamo and Google BigTableespecially integrating the distributed systemtechnology of Dynamo with the BigTable datamodel Tables in Cassandra are in the form ofdistributed four-dimensional structured map-ping where the four dimensions including rowcolumn column family and super column Arow is distinguished by a string-key with arbi-trary length No matter the amount of columnsto be read or written the operation on rowsis an auto Columns may constitute clusterswhich is called column families and are sim-ilar to the data model of Bigtable Cassandraprovides two kinds of column families columnfamilies and super columns The super columnincludes arbitrary number of columns relatedto same names A column family includescolumns and super columns which may becontinuously inserted to the column familyduring runtime The partition and copy mecha-nisms of Cassandra are very similar to those ofDynamo so as to achieve consistency

                          ndash Derivative tools of BigTable since theBigTable code cannot be obtained throughthe open source license some open sourceprojects compete to implement the BigTableconcept to develop similar systems such asHBase and Hypertable

                          HBase is a BigTable cloned version pro-grammed with Java and is a part of Hadoop ofApachersquos MapReduce framework [90] HBasereplaces GFS with HDFS It writes updatedcontents into RAM and regularly writes theminto files on disks The row operations areatomic operations equipped with row-levellocking and transaction processing which is

                          optional for large scale Partition and distribu-tion are transparently operated and have spacefor client hash or fixed key

                          HyperTable was developed similar toBigTable to obtain a set of high-performanceexpandable distributed storage and process-ing systems for structured and unstructureddata [91] HyperTable relies on distributedfile systems eg HDFS and distributed lockmanager Data representation processing andpartition mechanism are similar to that inBigTable HyperTable has its own query lan-guage called HyperTable query language(HQL) and allows users to create modify andquery underlying tables

                          Since the column-oriented storage databases mainlyemulate BigTable their designs are all similar exceptfor the concurrency mechanism and several other fea-tures For example Cassandra emphasizes weak consis-tency of concurrent control of multiple editions whileHBase and HyperTable focus on strong consistencythrough locks or log records

                          ndash Document Database Compared with key-value stor-age document storage can support more complex dataforms Since documents do not follow strict modesthere is no need to conduct mode migration In additionkey-value pairs can still be saved We will examine threeimportant representatives of document storage systemsie MongoDB SimpleDB and CouchDB

                          ndash MongoDB MongoDB is open-source anddocument-oriented database [92] MongoDBstores documents as Binary JSON (BSON)objects [93] which is similar to object Everydocument has an ID field as the primary keyQuery in MongoDB is expressed with syn-tax similar to JSON A database driver sendsthe query as a BSON object to MongoDBThe system allows query on all documentsincluding embedded objects and arrays Toenable rapid query indexes can be createdin the queryable fields of documents Thecopy operation in MongoDB can be executedwith log files in the main nodes that supportall the high-level operations conducted in thedatabase During copying the slavers queryall the writing operations since the last syn-chronization to the master and execute opera-tions in log files in local databases MongoDBsupports horizontal expansion with automaticsharing to distribute data among thousandsof nodes by automatically balancing load andfailover

                          188 Mobile Netw Appl (2014) 19171ndash209

                          ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

                          ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

                          Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

                          ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

                          functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

                          Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

                          ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

                          The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

                          In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

                          Mobile Netw Appl (2014) 19171ndash209 189

                          DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

                          ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

                          All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

                          ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

                          The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

                          Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

                          The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

                          Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

                          ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

                          ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

                          51 Traditional data analysis

                          5 Big data analysis

                          190 Mobile Netw Appl (2014) 19171ndash209

                          ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

                          ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

                          ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

                          ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

                          ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

                          52 Big data analytic methods

                          In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

                          ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

                          ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

                          ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

                          ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

                          ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

                          Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

                          53 Architecture for big data analysis

                          Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

                          Mobile Netw Appl (2014) 19171ndash209 191

                          Table 1 Comparison of MPI MapReduce and Dryad

                          MPI MapReduce Dryad

                          Deployment Computing node and data Computing and data storage Computing and data storage

                          storage arranged separately arranged at the same node arranged at the same node

                          (Data should be moved (Computing should (Computing should

                          computing node) be close to data) be close to data)

                          Resource management ndash Workqueue(google) Not clear

                          scheduling HOD(Yahoo)

                          Low level programming MPI API MapReduce API Dryad API

                          High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

                          Data storage The local file system GFS(google) NTFS

                          NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

                          Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

                          the tasks

                          Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

                          memory access Shared-memory FIFOs

                          Fault-tolerant Checkpoint Task re-execute Task re-execute

                          531 Real-time vs offline analysis

                          According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

                          ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

                          ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

                          532 Analysis at different levels

                          Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

                          ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

                          ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

                          ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

                          192 Mobile Netw Appl (2014) 19171ndash209

                          533 Analysis with different complexity

                          The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

                          54 Tools for big data mining and analysis

                          Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

                          ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

                          ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

                          ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

                          The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

                          ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

                          ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

                          6 Big data applications

                          In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

                          Mobile Netw Appl (2014) 19171ndash209 193

                          However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

                          61 Application evolutions

                          Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

                          ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

                          ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

                          most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

                          ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

                          As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

                          62 Big data analysis fields

                          webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

                          194 Mobile Netw Appl (2014) 19171ndash209

                          621 Structured data analysis

                          Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

                          However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

                          622 Text data analysis

                          The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

                          623 Web data analysis

                          Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

                          mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

                          Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

                          Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

                          Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

                          Mobile Netw Appl (2014) 19171ndash209 195

                          624 Multimedia data analysis

                          Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

                          Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

                          Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

                          Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

                          segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

                          Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

                          The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

                          625 Network data analysis

                          Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

                          196 Mobile Netw Appl (2014) 19171ndash209

                          and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

                          The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

                          Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

                          Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

                          Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

                          is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

                          626 Mobile data analysis

                          By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

                          With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

                          Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

                          Mobile Netw Appl (2014) 19171ndash209 197

                          In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

                          Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

                          63 Key applications of big data

                          631 Application of big data in enterprises

                          At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

                          In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

                          Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

                          632 Application of IoT based big data

                          IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

                          Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

                          Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

                          633 Application of online social network-oriented big data

                          Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

                          198 Mobile Netw Appl (2014) 19171ndash209

                          information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

                          ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

                          ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

                          is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

                          The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

                          In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

                          Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

                          Mobile Netw Appl (2014) 19171ndash209 199

                          or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

                          Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

                          ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

                          ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

                          ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

                          634 Applications of healthcare and medical big data

                          Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

                          effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

                          For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

                          The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

                          HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

                          Fig 6 The correlation between Tweets about rice price and food price inflation

                          200 Mobile Netw Appl (2014) 19171ndash209

                          imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

                          635 Collective intelligence

                          With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

                          Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

                          In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

                          636 Smart grid

                          Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

                          supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

                          ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

                          ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

                          Mobile Netw Appl (2014) 19171ndash209 201

                          according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

                          ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

                          7 Conclusion open issues and outlook

                          In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

                          In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

                          71 Open issues

                          The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

                          711 Theoretical research

                          Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

                          ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

                          ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

                          ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

                          712 Technology development

                          The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

                          ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

                          202 Mobile Netw Appl (2014) 19171ndash209

                          ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

                          ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

                          ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

                          713 Practical implications

                          Although there are already many successful big data appli-cations many practical problems should be solved

                          ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

                          ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

                          ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

                          individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

                          ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

                          714 Data security

                          In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

                          ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

                          ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

                          Mobile Netw Appl (2014) 19171ndash209 203

                          quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

                          ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

                          ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

                          The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

                          72 Outlook

                          The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

                          not predict the future but may take precautions for possibleevents to occur in the future

                          ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

                          ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

                          ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

                          ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

                          204 Mobile Netw Appl (2014) 19171ndash209

                          utilizes relational diagrams to express interpersonalrelationship

                          ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

                          ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

                          ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

                          ndash Compared with accurate data we would like toaccept numerous and complicated data

                          ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

                          ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

                          ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

                          Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

                          increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

                          Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

                          References

                          1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

                          2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

                          3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

                          4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

                          5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

                          httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

                          7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

                          8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

                          9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

                          10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

                          11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

                          12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

                          13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

                          14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

                          15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

                          16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

                          17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

                          18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

                          19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

                          Mobile Netw Appl (2014) 19171ndash209 205

                          20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

                          21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

                          22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

                          23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

                          24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

                          25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

                          26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

                          27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

                          28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

                          29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

                          30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

                          31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

                          32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

                          33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

                          34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

                          35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

                          36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

                          37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

                          38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

                          39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

                          40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

                          41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

                          42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

                          43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

                          44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

                          45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

                          46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

                          47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

                          48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

                          49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

                          50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

                          51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

                          52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

                          53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

                          54 Cisco data center interconnect design and deployment guide(2010)

                          55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

                          56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

                          57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

                          58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

                          59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

                          206 Mobile Netw Appl (2014) 19171ndash209

                          60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

                          61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

                          62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

                          63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

                          64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

                          65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

                          66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

                          67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

                          68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

                          69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

                          70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

                          71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

                          72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

                          73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

                          74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

                          75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

                          76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

                          77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

                          78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

                          79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

                          80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

                          81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

                          82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

                          83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

                          84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

                          85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

                          86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

                          87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

                          88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

                          89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

                          90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

                          Media Inc93 Crockford D (2006) The applicationjson media type for

                          javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

                          SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

                          tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

                          (2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

                          97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

                          98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

                          99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

                          Mobile Netw Appl (2014) 19171ndash209 207

                          100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

                          101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

                          102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

                          103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

                          104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

                          105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

                          106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

                          107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

                          108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

                          109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

                          110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

                          111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

                          112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

                          113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

                          114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

                          115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

                          D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

                          117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

                          118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

                          the 7th ACM international conference on computing frontiersACM pp 277ndash286

                          119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

                          120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

                          121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

                          122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

                          123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

                          124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

                          125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

                          126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

                          127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

                          128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

                          129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

                          130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

                          131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

                          132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

                          133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

                          134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

                          135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

                          136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

                          137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

                          138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

                          139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

                          140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

                          208 Mobile Netw Appl (2014) 19171ndash209

                          141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

                          142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

                          143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

                          144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

                          145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

                          146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

                          147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

                          148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

                          149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

                          150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

                          151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

                          152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

                          153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

                          154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

                          155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

                          156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

                          Mobile Netw Appl (2014) 19171ndash209 209

                          • Big Data A Survey
                            • Abstract
                            • Background
                              • Dawn of big data era
                              • Definition and features of big data
                              • Big data value
                              • The development of big data
                              • Challenges of big data
                                • Related technologies
                                  • Relationship between cloud computing and big data
                                  • Relationship between IoT and big data
                                  • Data center
                                  • Relationship between hadoop and big data
                                    • Big data generation and acquisition
                                      • Data generation
                                        • Enterprise data
                                        • IoT data
                                        • Bio-medical data
                                        • Data generation from other fields
                                          • Big data acquisition
                                            • Data collection
                                            • Data transportation
                                            • Data pre-processing
                                                • Big data storage
                                                  • Storage system for massive data
                                                  • Distributed storage system
                                                  • Storage mechanism for big data
                                                    • Database technology
                                                      • Traditional data analysis
                                                      • Big data analytic methods
                                                      • Architecture for big data analysis
                                                        • Real-time vs offline analysis
                                                        • Analysis at different levels
                                                        • Analysis with different complexity
                                                          • Tools for big data mining and analysis
                                                            • Big data applications
                                                              • Key applications of big data
                                                                • Application evolutions
                                                                • Structured data analysis
                                                                • Text data analysis
                                                                • Web data analysis
                                                                • Multimedia data analysis
                                                                • Network data analysis
                                                                • Mobile data analysis
                                                                  • Key applications of big data
                                                                    • Application of big data in enterprises
                                                                    • Application of IoT based big data
                                                                    • Application of online social network-oriented big data
                                                                    • Applications of healthcare and medical big data
                                                                    • Collective intelligence
                                                                    • Smart grid
                                                                        • Conclusion open issues and outlook
                                                                          • Open issues
                                                                            • Theoretical research
                                                                            • Technology development
                                                                            • Practical implications
                                                                            • Data security
                                                                              • Outlook
                                                                                • Acknowledgments
                                                                                • References

                            in e-commerce by crawlers and regularly re-copyingcustomer and account information

                            In [70] the problem of cleaning RFID data wasexamined RFID is widely used in many applica-tions eg inventory management and target track-ing However the original RFID features low qualitywhich includes a lot of abnormal data limited by thephysical design and affected by environmental noisesIn [71] a probability model was developed to copewith data loss in mobile environments Khoussainovaet al in [72] proposed a system to automatically cor-rect errors of input data by defining global integrityconstraints

                            Herbert et al [73] proposed a framework called BIO-AJAX to standardize biological data so as to conductfurther computation and improve search quality WithBIO-AJAX some errors and repetitions may be elim-inated and common data mining technologies can beexecuted more effectively

                            ndash Redundancy elimination data redundancy refers to datarepetitions or surplus which usually occurs in manydatasets Data redundancy can increase the unneces-sary data transmission expense and cause defects onstorage systems eg waste of storage space lead-ing to data inconsistency reduction of data reliabil-ity and data damage Therefore various redundancyreduction methods have been proposed such as redun-dancy detection data filtering and data compressionSuch methods may apply to different datasets or appli-cation environments However redundancy reductionmay also bring about certain negative effects Forexample data compression and decompression causeadditional computational burden Therefore the ben-efits of redundancy reduction and the cost should becarefully balanced Data collected from different fieldswill increasingly appear in image or video formatsIt is well-known that images and videos contain con-siderable redundancy including temporal redundancyspacial redundancy statistical redundancy and sens-ing redundancy Video compression is widely usedto reduce redundancy in video data as specified inthe many video coding standards (MPEG-2 MPEG-4H263 and H264AVC) In [74] the authors inves-tigated the problem of video compression in a videosurveillance system with a video sensor network Theauthors propose a new MPEG-4 based method byinvestigating the contextual redundancy related to back-ground and foreground in a scene The low com-plexity and the low compression ratio of the pro-posed approach were demonstrated by the evaluationresults

                            On generalized data transmission or storage re-peated data deletion is a special data compression

                            technology which aims to eliminate repeated datacopies [75] With repeated data deletion individual datablocks or data segments will be assigned with identi-fiers (eg using a hash algorithm) and stored with theidentifiers added to the identification list As the anal-ysis of repeated data deletion continues if a new datablock has an identifier that is identical to that listedin the identification list the new data block will bedeemed as redundant and will be replaced by the cor-responding stored data block Repeated data deletioncan greatly reduce storage requirement which is par-ticularly important to a big data storage system Apartfrom the aforementioned data pre-processing methodsspecific data objects shall go through some other oper-ations such as feature extraction Such operation playsan important role in multimedia search and DNA anal-ysis [76ndash78] Usually high-dimensional feature vec-tors (or high-dimensional feature points) are used todescribe such data objects and the system stores thedimensional feature vectors for future retrieval Datatransfer is usually used to process distributed hetero-geneous data sources especially business datasets [79]As a matter of fact in consideration of various datasetsit is non-trivial or impossible to build a uniform datapre-processing procedure and technology that is appli-cable to all types of datasets on the specific featureproblem performance requirements and other factorsof the datasets should be considered so as to select aproper data pre-processing strategy

                            4 Big data storage

                            The explosive growth of data has more strict requirementson storage and management In this section we focus onthe storage of big data Big data storage refers to the stor-age and management of large-scale datasets while achiev-ing reliability and availability of data accessing We willreview important issues including massive storage systemsdistributed storage systems and big data storage mecha-nisms On one hand the storage infrastructure needs toprovide information storage service with reliable storagespace on the other hand it must provide a powerful accessinterface for query and analysis of a large amount ofdata

                            Traditionally as auxiliary equipment of server data stor-age device is used to store manage look up and analyzedata with structured RDBMSs With the sharp growth ofdata data storage device is becoming increasingly moreimportant and many Internet companies pursue big capac-ity of storage to be competitive Therefore there is acompelling need for research on data storage

                            184 Mobile Netw Appl (2014) 19171ndash209

                            41 Storage system for massive data

                            Various storage systems emerge to meet the demands ofmassive data Existing massive storage technologies can beclassified as Direct Attached Storage (DAS) and networkstorage while network storage can be further classifiedinto Network Attached Storage (NAS) and Storage AreaNetwork (SAN)

                            In DAS various harddisks are directly connected withservers and data management is server-centric such thatstorage devices are peripheral equipments each of whichtakes a certain amount of IO resource and is managed by anindividual application software For this reason DAS is onlysuitable to interconnect servers with a small scale How-ever due to its low scalability DAS will exhibit undesirableefficiency when the storage capacity is increased ie theupgradeability and expandability are greatly limited ThusDAS is mainly used in personal computers and small-sizedservers

                            Network storage is to utilize network to provide userswith a union interface for data access and sharing Networkstorage equipment includes special data exchange equip-ments disk array tap library and other storage media aswell as special storage software It is characterized withstrong expandability

                            NAS is actually an auxillary storage equipment of a net-work It is directly connected to a network through a hub orswitch through TCPIP protocols In NAS data is transmit-ted in the form of files Compared to DAS the IO burdenat a NAS server is reduced extensively since the serveraccesses a storage device indirectly through a network

                            While NAS is network-oriented SAN is especiallydesigned for data storage with a scalable and bandwidthintensive network eg a high-speed network with opticalfiber connections In SAN data storage management is rel-atively independent within a storage local area networkwhere multipath based data switching among any internalnodes is utilized to achieve a maximum degree of datasharing and data management

                            From the organization of a data storage system DASNAS and SAN can all be divided into three parts (i) discarray it is the foundation of a storage system and the fun-damental guarantee for data storage (ii) connection andnetwork sub-systems which provide connection among oneor more disc arrays and servers (iii) storage managementsoftware which handles data sharing disaster recovery andother storage management tasks of multiple servers

                            42 Distributed storage system

                            The first challenge brought about by big data is how todevelop a large scale distributed storage system for effi-ciently data processing and analysis To use a distributed

                            system to store massive data the following factors shouldbe taken into consideration

                            ndash Consistency a distributed storage system requires mul-tiple servers to cooperatively store data As there aremore servers the probability of server failures will belarger Usually data is divided into multiple pieces tobe stored at different servers to ensure availability incase of server failure However server failures and par-allel storage may cause inconsistency among differentcopies of the same data Consistency refers to assuringthat multiple copies of the same data are identical

                            ndash Availability a distributed storage system operates inmultiple sets of servers As more servers are usedserver failures are inevitable It would be desirable ifthe entire system is not seriously affected to satisfy cus-tomerrsquos requests in terms of reading and writing Thisproperty is called availability

                            ndash Partition Tolerance multiple servers in a distributedstorage system are connected by a network The net-work could have linknode failures or temporary con-gestion The distributed system should have a certainlevel of tolerance to problems caused by network fail-ures It would be desirable that the distributed storagestill works well when the network is partitioned

                            Eric Brewer proposed a CAP [80 81] theory in 2000which indicated that a distributed system could not simulta-neously meet the requirements on consistency availabilityand partition tolerance at most two of the three require-ments can be satisfied simultaneously Seth Gilbert andNancy Lynch from MIT proved the correctness of CAP the-ory in 2002 Since consistency availability and partitiontolerance could not be achieved simultaneously we can havea CA system by ignoring partition tolerance a CP system byignoring availability and an AP system that ignores consis-tency according to different design goals The three systemsare discussed in the following

                            CA systems do not have partition tolerance ie theycould not handle network failures Therefore CA sys-tems are generally deemed as storage systems with a sin-gle server such as the traditional small-scale relationaldatabases Such systems feature single copy of data suchthat consistency is easily ensured Availability is guaranteedby the excellent design of relational databases Howeversince CA systems could not handle network failures theycould not be expanded to use many servers Therefore mostlarge-scale storage systems are CP systems and AP systems

                            Compared with CA systems CP systems ensure parti-tion tolerance Therefore CP systems can be expanded tobecome distributed systems CP systems generally main-tain several copies of the same data in order to ensure a

                            Mobile Netw Appl (2014) 19171ndash209 185

                            level of fault tolerance CP systems also ensure data consis-tency ie multiple copies of the same data are guaranteedto be completely identical However CP could not ensuresound availability because of the high cost for consistencyassurance Therefore CP systems are useful for the scenar-ios with moderate load but stringent requirements on dataaccuracy (eg trading data) BigTable and Hbase are twopopular CP systems

                            AP systems also ensure partition tolerance However APsystems are different from CP systems in that AP systemsalso ensure availability However AP systems only ensureeventual consistency rather than strong consistency in theprevious two systems Therefore AP systems only applyto the scenarios with frequent requests but not very highrequirements on accuracy For example in online SocialNetworking Services (SNS) systems there are many con-current visits to the data but a certain amount of data errorsare tolerable Furthermore because AP systems ensureeventual consistency accurate data can still be obtained aftera certain amount of delay Therefore AP systems may alsobe used under the circumstances with no stringent realtimerequirements Dynamo and Cassandra are two popular APsystems

                            43 Storage mechanism for big data

                            Considerable research on big data promotes the develop-ment of storage mechanisms for big data Existing stor-age mechanisms of big data may be classified into threebottom-up levels (i) file systems (ii) databases and (iii)programming models

                            File systems are the foundation of the applications atupper levels Googlersquos GFS is an expandable distributedfile system to support large-scale distributed data-intensiveapplications [25] GFS uses cheap commodity servers toachieve fault-tolerance and provides customers with high-performance services GFS supports large-scale file appli-cations with more frequent reading than writing HoweverGFS also has some limitations such as a single point offailure and poor performances for small files Such limita-tions have been overcome by Colossus [82] the successorof GFS

                            In addition other companies and researchers also havetheir solutions to meet the different demands for storageof big data For example HDFS and Kosmosfs are deriva-tives of open source codes of GFS Microsoft developedCosmos [83] to support its search and advertisement busi-ness Facebook utilizes Haystack [84] to store the largeamount of small-sized photos Taobao also developed TFSand FastDFS In conclusion distributed file systems havebeen relatively mature after years of development and busi-ness operation Therefore we will focus on the other twolevels in the rest of this section

                            431 Database technology

                            The database technology has been evolving for more than30 years Various database systems are developed to handledatasets at different scales and support various applica-tions Traditional relational databases cannot meet the chal-lenges on categories and scales brought about by big dataNoSQL databases (ie non traditional relational databases)are becoming more popular for big data storage NoSQLdatabases feature flexible modes support for simple andeasy copy simple API eventual consistency and supportof large volume data NoSQL databases are becomingthe core technology for of big data We will examinethe following three main NoSQL databases in this sec-tion Key-value databases column-oriented databases anddocument-oriented databases each based on certain datamodels

                            ndash Key-value Databases Key-value Databases are con-stituted by a simple data model and data is storedcorresponding to key-values Every key is unique andcustomers may input queried values according to thekeys Such databases feature a simple structure andthe modern key-value databases are characterized withhigh expandability and shorter query response time thanthose of relational databases Over the past few yearsmany key-value databases have appeared as motivatedby Amazonrsquos Dynamo system [85] We will introduceDynamo and several other representative key-valuedatabases

                            ndash Dynamo Dynamo is a highly available andexpandable distributed key-value data stor-age system It is used to store and managethe status of some core services which canbe realized with key access in the Amazone-Commerce Platform The public mode ofrelational databases may generate invalid dataand limit data scale and availability whileDynamo can resolve these problems with asimple key-object interface which is consti-tuted by simple reading and writing opera-tion Dynamo achieves elasticity and avail-ability through the data partition data copyand object edition mechanisms Dynamo par-tition plan relies on Consistent Hashing [86]which has a main advantage that node pass-ing only affects directly adjacent nodes anddo not affect other nodes to divide the loadfor multiple main storage machines Dynamocopies data to N sets of servers in which Nis a configurable parameter in order to achieve

                            186 Mobile Netw Appl (2014) 19171ndash209

                            high availability and durability Dynamo sys-tem also provides eventual consistency so asto conduct asynchronous update on all copies

                            ndash Voldemort Voldemort is also a key-value stor-age system which was initially developed forand is still used by LinkedIn Key words andvalues in Voldemort are composite objectsconstituted by tables and images Volde-mort interface includes three simple opera-tions reading writing and deletion all ofwhich are confirmed by key words Volde-mort provides asynchronous updating con-current control of multiple editions but doesnot ensure data consistency However Volde-mort supports optimistic locking for consistentmulti-record updating When conflict happensbetween the updating and any other opera-tions the updating operation will quit Thedata copy mechanism of Voldmort is the sameas that of Dynamo Voldemort not only storesdata in RAM but allows data be inserted intoa storage engine Especially Voldemort sup-ports two storage engines including BerkeleyDB and Random Access Files

                            The key-value database emerged a few years agoDeeply influenced by Amazon Dynamo DB other key-value storage systems include Redis Tokyo Canbinetand Tokyo Tyrant Memcached and Memcache DBRiak and Scalaris all of which provide expandabilityby distributing key words into nodes Voldemort RiakTokyo Cabinet and Memecached can utilize attachedstorage devices to store data in RAM or disks Otherstorage systems store data at RAM and provide diskbackup or rely on copy and recovery to avoid backup

                            ndash Column-oriented Database The column-orienteddatabases store and process data according to columnsother than rows Both columns and rows are segmentedin multiple nodes to realize expandability The column-oriented databases are mainly inspired by GooglersquosBigTable In this Section we first discuss BigTable andthen introduce several derivative tools

                            ndash BigTable BigTable is a distributed structureddata storage system which is designed to pro-cess the large-scale (PB class) data amongthousands commercial servers [87] The basicdata structure of Bigtable is a multi-dimensionsequenced mapping with sparse distributedand persistent storage Indexes of mappingare row key column key and timestampsand every value in mapping is an unana-lyzed byte array Each row key in BigTable

                            is a 64KB character string By lexicograph-ical order rows are stored and continuallysegmented into Tablets (ie units of distribu-tion) for load balance Thus reading a shortrow of data can be highly effective since itonly involves communication with a small por-tion of machines The columns are groupedaccording to the prefixes of keys and thusforming column families These column fami-lies are the basic units for access control Thetimestamps are 64-bit integers to distinguishdifferent editions of cell values Clients mayflexibly determine the number of cell editionsstored These editions are sequenced in thedescending order of timestamps so the latestedition will always be read

                            The BigTable API features the creation anddeletion of Tablets and column families as wellas modification of metadata of clusters tablesand column families Client applications mayinsert or delete values of BigTable query val-ues from columns or browse sub-datasets in atable Bigtable also supports some other char-acteristics such as transaction processing in asingle row Users may utilize such features toconduct more complex data processing

                            Every procedure executed by BigTableincludes three main components Masterserver Tablet server and client libraryBigtable only allows one set of Master serverbe distributed to be responsible for distribut-ing tablets for Tablet server detecting addedor removed Tablet servers and conductingload balance In addition it can also mod-ify BigTable schema eg creating tables andcolumn families and collecting garbage savedin GFS as well as deleted or disabled filesand using them in specific BigTable instancesEvery tablet server manages a Tablet set andis responsible for the reading and writing of aloaded Tablet When Tablets are too big theywill be segmented by the server The applica-tion client library is used to communicate withBigTable instances

                            BigTable is based on many fundamentalcomponents of Google including GFS [25]cluster management system SSTable file for-mat and Chubby [88] GFS is use to store dataand log files The cluster management systemis responsible for task scheduling resourcessharing processing of machine failures andmonitoring of machine statuses SSTable fileformat is used to store BigTable data internally

                            Mobile Netw Appl (2014) 19171ndash209 187

                            and it provides mapping between persistentsequenced and unchangeable keys and valuesas any byte strings BigTable utilizes Chubbyfor the following tasks in server 1) ensurethere is at most one active Master copy atany time 2) store the bootstrap location ofBigTable data 3) look up Tablet server 4) con-duct error recovery in case of Table server fail-ures 5) store BigTable schema information 6)store the access control table

                            ndash Cassandra Cassandra is a distributed storagesystem to manage the huge amount of struc-tured data distributed among multiple commer-cial servers [89] The system was developedby Facebook and became an open source toolin 2008 It adopts the ideas and concepts ofboth Amazon Dynamo and Google BigTableespecially integrating the distributed systemtechnology of Dynamo with the BigTable datamodel Tables in Cassandra are in the form ofdistributed four-dimensional structured map-ping where the four dimensions including rowcolumn column family and super column Arow is distinguished by a string-key with arbi-trary length No matter the amount of columnsto be read or written the operation on rowsis an auto Columns may constitute clusterswhich is called column families and are sim-ilar to the data model of Bigtable Cassandraprovides two kinds of column families columnfamilies and super columns The super columnincludes arbitrary number of columns relatedto same names A column family includescolumns and super columns which may becontinuously inserted to the column familyduring runtime The partition and copy mecha-nisms of Cassandra are very similar to those ofDynamo so as to achieve consistency

                            ndash Derivative tools of BigTable since theBigTable code cannot be obtained throughthe open source license some open sourceprojects compete to implement the BigTableconcept to develop similar systems such asHBase and Hypertable

                            HBase is a BigTable cloned version pro-grammed with Java and is a part of Hadoop ofApachersquos MapReduce framework [90] HBasereplaces GFS with HDFS It writes updatedcontents into RAM and regularly writes theminto files on disks The row operations areatomic operations equipped with row-levellocking and transaction processing which is

                            optional for large scale Partition and distribu-tion are transparently operated and have spacefor client hash or fixed key

                            HyperTable was developed similar toBigTable to obtain a set of high-performanceexpandable distributed storage and process-ing systems for structured and unstructureddata [91] HyperTable relies on distributedfile systems eg HDFS and distributed lockmanager Data representation processing andpartition mechanism are similar to that inBigTable HyperTable has its own query lan-guage called HyperTable query language(HQL) and allows users to create modify andquery underlying tables

                            Since the column-oriented storage databases mainlyemulate BigTable their designs are all similar exceptfor the concurrency mechanism and several other fea-tures For example Cassandra emphasizes weak consis-tency of concurrent control of multiple editions whileHBase and HyperTable focus on strong consistencythrough locks or log records

                            ndash Document Database Compared with key-value stor-age document storage can support more complex dataforms Since documents do not follow strict modesthere is no need to conduct mode migration In additionkey-value pairs can still be saved We will examine threeimportant representatives of document storage systemsie MongoDB SimpleDB and CouchDB

                            ndash MongoDB MongoDB is open-source anddocument-oriented database [92] MongoDBstores documents as Binary JSON (BSON)objects [93] which is similar to object Everydocument has an ID field as the primary keyQuery in MongoDB is expressed with syn-tax similar to JSON A database driver sendsthe query as a BSON object to MongoDBThe system allows query on all documentsincluding embedded objects and arrays Toenable rapid query indexes can be createdin the queryable fields of documents Thecopy operation in MongoDB can be executedwith log files in the main nodes that supportall the high-level operations conducted in thedatabase During copying the slavers queryall the writing operations since the last syn-chronization to the master and execute opera-tions in log files in local databases MongoDBsupports horizontal expansion with automaticsharing to distribute data among thousandsof nodes by automatically balancing load andfailover

                            188 Mobile Netw Appl (2014) 19171ndash209

                            ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

                            ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

                            Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

                            ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

                            functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

                            Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

                            ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

                            The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

                            In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

                            Mobile Netw Appl (2014) 19171ndash209 189

                            DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

                            ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

                            All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

                            ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

                            The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

                            Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

                            The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

                            Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

                            ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

                            ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

                            51 Traditional data analysis

                            5 Big data analysis

                            190 Mobile Netw Appl (2014) 19171ndash209

                            ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

                            ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

                            ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

                            ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

                            ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

                            52 Big data analytic methods

                            In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

                            ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

                            ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

                            ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

                            ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

                            ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

                            Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

                            53 Architecture for big data analysis

                            Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

                            Mobile Netw Appl (2014) 19171ndash209 191

                            Table 1 Comparison of MPI MapReduce and Dryad

                            MPI MapReduce Dryad

                            Deployment Computing node and data Computing and data storage Computing and data storage

                            storage arranged separately arranged at the same node arranged at the same node

                            (Data should be moved (Computing should (Computing should

                            computing node) be close to data) be close to data)

                            Resource management ndash Workqueue(google) Not clear

                            scheduling HOD(Yahoo)

                            Low level programming MPI API MapReduce API Dryad API

                            High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

                            Data storage The local file system GFS(google) NTFS

                            NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

                            Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

                            the tasks

                            Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

                            memory access Shared-memory FIFOs

                            Fault-tolerant Checkpoint Task re-execute Task re-execute

                            531 Real-time vs offline analysis

                            According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

                            ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

                            ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

                            532 Analysis at different levels

                            Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

                            ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

                            ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

                            ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

                            192 Mobile Netw Appl (2014) 19171ndash209

                            533 Analysis with different complexity

                            The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

                            54 Tools for big data mining and analysis

                            Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

                            ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

                            ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

                            ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

                            The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

                            ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

                            ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

                            6 Big data applications

                            In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

                            Mobile Netw Appl (2014) 19171ndash209 193

                            However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

                            61 Application evolutions

                            Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

                            ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

                            ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

                            most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

                            ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

                            As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

                            62 Big data analysis fields

                            webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

                            194 Mobile Netw Appl (2014) 19171ndash209

                            621 Structured data analysis

                            Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

                            However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

                            622 Text data analysis

                            The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

                            623 Web data analysis

                            Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

                            mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

                            Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

                            Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

                            Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

                            Mobile Netw Appl (2014) 19171ndash209 195

                            624 Multimedia data analysis

                            Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

                            Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

                            Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

                            Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

                            segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

                            Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

                            The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

                            625 Network data analysis

                            Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

                            196 Mobile Netw Appl (2014) 19171ndash209

                            and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

                            The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

                            Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

                            Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

                            Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

                            is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

                            626 Mobile data analysis

                            By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

                            With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

                            Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

                            Mobile Netw Appl (2014) 19171ndash209 197

                            In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

                            Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

                            63 Key applications of big data

                            631 Application of big data in enterprises

                            At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

                            In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

                            Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

                            632 Application of IoT based big data

                            IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

                            Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

                            Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

                            633 Application of online social network-oriented big data

                            Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

                            198 Mobile Netw Appl (2014) 19171ndash209

                            information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

                            ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

                            ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

                            is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

                            The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

                            In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

                            Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

                            Mobile Netw Appl (2014) 19171ndash209 199

                            or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

                            Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

                            ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

                            ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

                            ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

                            634 Applications of healthcare and medical big data

                            Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

                            effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

                            For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

                            The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

                            HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

                            Fig 6 The correlation between Tweets about rice price and food price inflation

                            200 Mobile Netw Appl (2014) 19171ndash209

                            imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

                            635 Collective intelligence

                            With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

                            Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

                            In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

                            636 Smart grid

                            Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

                            supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

                            ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

                            ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

                            Mobile Netw Appl (2014) 19171ndash209 201

                            according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

                            ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

                            7 Conclusion open issues and outlook

                            In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

                            In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

                            71 Open issues

                            The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

                            711 Theoretical research

                            Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

                            ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

                            ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

                            ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

                            712 Technology development

                            The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

                            ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

                            202 Mobile Netw Appl (2014) 19171ndash209

                            ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

                            ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

                            ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

                            713 Practical implications

                            Although there are already many successful big data appli-cations many practical problems should be solved

                            ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

                            ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

                            ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

                            individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

                            ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

                            714 Data security

                            In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

                            ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

                            ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

                            Mobile Netw Appl (2014) 19171ndash209 203

                            quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

                            ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

                            ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

                            The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

                            72 Outlook

                            The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

                            not predict the future but may take precautions for possibleevents to occur in the future

                            ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

                            ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

                            ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

                            ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

                            204 Mobile Netw Appl (2014) 19171ndash209

                            utilizes relational diagrams to express interpersonalrelationship

                            ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

                            ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

                            ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

                            ndash Compared with accurate data we would like toaccept numerous and complicated data

                            ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

                            ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

                            ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

                            Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

                            increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

                            Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

                            References

                            1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

                            2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

                            3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

                            4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

                            5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

                            httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

                            7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

                            8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

                            9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

                            10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

                            11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

                            12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

                            13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

                            14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

                            15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

                            16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

                            17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

                            18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

                            19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

                            Mobile Netw Appl (2014) 19171ndash209 205

                            20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

                            21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

                            22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

                            23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

                            24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

                            25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

                            26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

                            27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

                            28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

                            29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

                            30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

                            31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

                            32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

                            33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

                            34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

                            35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

                            36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

                            37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

                            38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

                            39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

                            40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

                            41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

                            42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

                            43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

                            44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

                            45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

                            46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

                            47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

                            48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

                            49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

                            50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

                            51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

                            52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

                            53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

                            54 Cisco data center interconnect design and deployment guide(2010)

                            55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

                            56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

                            57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

                            58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

                            59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

                            206 Mobile Netw Appl (2014) 19171ndash209

                            60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

                            61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

                            62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

                            63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

                            64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

                            65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

                            66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

                            67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

                            68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

                            69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

                            70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

                            71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

                            72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

                            73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

                            74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

                            75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

                            76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

                            77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

                            78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

                            79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

                            80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

                            81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

                            82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

                            83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

                            84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

                            85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

                            86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

                            87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

                            88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

                            89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

                            90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

                            Media Inc93 Crockford D (2006) The applicationjson media type for

                            javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

                            SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

                            tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

                            (2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

                            97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

                            98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

                            99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

                            Mobile Netw Appl (2014) 19171ndash209 207

                            100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

                            101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

                            102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

                            103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

                            104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

                            105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

                            106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

                            107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

                            108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

                            109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

                            110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

                            111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

                            112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

                            113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

                            114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

                            115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

                            D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

                            117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

                            118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

                            the 7th ACM international conference on computing frontiersACM pp 277ndash286

                            119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

                            120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

                            121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

                            122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

                            123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

                            124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

                            125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

                            126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

                            127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

                            128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

                            129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

                            130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

                            131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

                            132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

                            133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

                            134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

                            135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

                            136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

                            137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

                            138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

                            139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

                            140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

                            208 Mobile Netw Appl (2014) 19171ndash209

                            141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

                            142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

                            143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

                            144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

                            145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

                            146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

                            147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

                            148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

                            149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

                            150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

                            151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

                            152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

                            153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

                            154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

                            155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

                            156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

                            Mobile Netw Appl (2014) 19171ndash209 209

                            • Big Data A Survey
                              • Abstract
                              • Background
                                • Dawn of big data era
                                • Definition and features of big data
                                • Big data value
                                • The development of big data
                                • Challenges of big data
                                  • Related technologies
                                    • Relationship between cloud computing and big data
                                    • Relationship between IoT and big data
                                    • Data center
                                    • Relationship between hadoop and big data
                                      • Big data generation and acquisition
                                        • Data generation
                                          • Enterprise data
                                          • IoT data
                                          • Bio-medical data
                                          • Data generation from other fields
                                            • Big data acquisition
                                              • Data collection
                                              • Data transportation
                                              • Data pre-processing
                                                  • Big data storage
                                                    • Storage system for massive data
                                                    • Distributed storage system
                                                    • Storage mechanism for big data
                                                      • Database technology
                                                        • Traditional data analysis
                                                        • Big data analytic methods
                                                        • Architecture for big data analysis
                                                          • Real-time vs offline analysis
                                                          • Analysis at different levels
                                                          • Analysis with different complexity
                                                            • Tools for big data mining and analysis
                                                              • Big data applications
                                                                • Key applications of big data
                                                                  • Application evolutions
                                                                  • Structured data analysis
                                                                  • Text data analysis
                                                                  • Web data analysis
                                                                  • Multimedia data analysis
                                                                  • Network data analysis
                                                                  • Mobile data analysis
                                                                    • Key applications of big data
                                                                      • Application of big data in enterprises
                                                                      • Application of IoT based big data
                                                                      • Application of online social network-oriented big data
                                                                      • Applications of healthcare and medical big data
                                                                      • Collective intelligence
                                                                      • Smart grid
                                                                          • Conclusion open issues and outlook
                                                                            • Open issues
                                                                              • Theoretical research
                                                                              • Technology development
                                                                              • Practical implications
                                                                              • Data security
                                                                                • Outlook
                                                                                  • Acknowledgments
                                                                                  • References

                              41 Storage system for massive data

                              Various storage systems emerge to meet the demands ofmassive data Existing massive storage technologies can beclassified as Direct Attached Storage (DAS) and networkstorage while network storage can be further classifiedinto Network Attached Storage (NAS) and Storage AreaNetwork (SAN)

                              In DAS various harddisks are directly connected withservers and data management is server-centric such thatstorage devices are peripheral equipments each of whichtakes a certain amount of IO resource and is managed by anindividual application software For this reason DAS is onlysuitable to interconnect servers with a small scale How-ever due to its low scalability DAS will exhibit undesirableefficiency when the storage capacity is increased ie theupgradeability and expandability are greatly limited ThusDAS is mainly used in personal computers and small-sizedservers

                              Network storage is to utilize network to provide userswith a union interface for data access and sharing Networkstorage equipment includes special data exchange equip-ments disk array tap library and other storage media aswell as special storage software It is characterized withstrong expandability

                              NAS is actually an auxillary storage equipment of a net-work It is directly connected to a network through a hub orswitch through TCPIP protocols In NAS data is transmit-ted in the form of files Compared to DAS the IO burdenat a NAS server is reduced extensively since the serveraccesses a storage device indirectly through a network

                              While NAS is network-oriented SAN is especiallydesigned for data storage with a scalable and bandwidthintensive network eg a high-speed network with opticalfiber connections In SAN data storage management is rel-atively independent within a storage local area networkwhere multipath based data switching among any internalnodes is utilized to achieve a maximum degree of datasharing and data management

                              From the organization of a data storage system DASNAS and SAN can all be divided into three parts (i) discarray it is the foundation of a storage system and the fun-damental guarantee for data storage (ii) connection andnetwork sub-systems which provide connection among oneor more disc arrays and servers (iii) storage managementsoftware which handles data sharing disaster recovery andother storage management tasks of multiple servers

                              42 Distributed storage system

                              The first challenge brought about by big data is how todevelop a large scale distributed storage system for effi-ciently data processing and analysis To use a distributed

                              system to store massive data the following factors shouldbe taken into consideration

                              ndash Consistency a distributed storage system requires mul-tiple servers to cooperatively store data As there aremore servers the probability of server failures will belarger Usually data is divided into multiple pieces tobe stored at different servers to ensure availability incase of server failure However server failures and par-allel storage may cause inconsistency among differentcopies of the same data Consistency refers to assuringthat multiple copies of the same data are identical

                              ndash Availability a distributed storage system operates inmultiple sets of servers As more servers are usedserver failures are inevitable It would be desirable ifthe entire system is not seriously affected to satisfy cus-tomerrsquos requests in terms of reading and writing Thisproperty is called availability

                              ndash Partition Tolerance multiple servers in a distributedstorage system are connected by a network The net-work could have linknode failures or temporary con-gestion The distributed system should have a certainlevel of tolerance to problems caused by network fail-ures It would be desirable that the distributed storagestill works well when the network is partitioned

                              Eric Brewer proposed a CAP [80 81] theory in 2000which indicated that a distributed system could not simulta-neously meet the requirements on consistency availabilityand partition tolerance at most two of the three require-ments can be satisfied simultaneously Seth Gilbert andNancy Lynch from MIT proved the correctness of CAP the-ory in 2002 Since consistency availability and partitiontolerance could not be achieved simultaneously we can havea CA system by ignoring partition tolerance a CP system byignoring availability and an AP system that ignores consis-tency according to different design goals The three systemsare discussed in the following

                              CA systems do not have partition tolerance ie theycould not handle network failures Therefore CA sys-tems are generally deemed as storage systems with a sin-gle server such as the traditional small-scale relationaldatabases Such systems feature single copy of data suchthat consistency is easily ensured Availability is guaranteedby the excellent design of relational databases Howeversince CA systems could not handle network failures theycould not be expanded to use many servers Therefore mostlarge-scale storage systems are CP systems and AP systems

                              Compared with CA systems CP systems ensure parti-tion tolerance Therefore CP systems can be expanded tobecome distributed systems CP systems generally main-tain several copies of the same data in order to ensure a

                              Mobile Netw Appl (2014) 19171ndash209 185

                              level of fault tolerance CP systems also ensure data consis-tency ie multiple copies of the same data are guaranteedto be completely identical However CP could not ensuresound availability because of the high cost for consistencyassurance Therefore CP systems are useful for the scenar-ios with moderate load but stringent requirements on dataaccuracy (eg trading data) BigTable and Hbase are twopopular CP systems

                              AP systems also ensure partition tolerance However APsystems are different from CP systems in that AP systemsalso ensure availability However AP systems only ensureeventual consistency rather than strong consistency in theprevious two systems Therefore AP systems only applyto the scenarios with frequent requests but not very highrequirements on accuracy For example in online SocialNetworking Services (SNS) systems there are many con-current visits to the data but a certain amount of data errorsare tolerable Furthermore because AP systems ensureeventual consistency accurate data can still be obtained aftera certain amount of delay Therefore AP systems may alsobe used under the circumstances with no stringent realtimerequirements Dynamo and Cassandra are two popular APsystems

                              43 Storage mechanism for big data

                              Considerable research on big data promotes the develop-ment of storage mechanisms for big data Existing stor-age mechanisms of big data may be classified into threebottom-up levels (i) file systems (ii) databases and (iii)programming models

                              File systems are the foundation of the applications atupper levels Googlersquos GFS is an expandable distributedfile system to support large-scale distributed data-intensiveapplications [25] GFS uses cheap commodity servers toachieve fault-tolerance and provides customers with high-performance services GFS supports large-scale file appli-cations with more frequent reading than writing HoweverGFS also has some limitations such as a single point offailure and poor performances for small files Such limita-tions have been overcome by Colossus [82] the successorof GFS

                              In addition other companies and researchers also havetheir solutions to meet the different demands for storageof big data For example HDFS and Kosmosfs are deriva-tives of open source codes of GFS Microsoft developedCosmos [83] to support its search and advertisement busi-ness Facebook utilizes Haystack [84] to store the largeamount of small-sized photos Taobao also developed TFSand FastDFS In conclusion distributed file systems havebeen relatively mature after years of development and busi-ness operation Therefore we will focus on the other twolevels in the rest of this section

                              431 Database technology

                              The database technology has been evolving for more than30 years Various database systems are developed to handledatasets at different scales and support various applica-tions Traditional relational databases cannot meet the chal-lenges on categories and scales brought about by big dataNoSQL databases (ie non traditional relational databases)are becoming more popular for big data storage NoSQLdatabases feature flexible modes support for simple andeasy copy simple API eventual consistency and supportof large volume data NoSQL databases are becomingthe core technology for of big data We will examinethe following three main NoSQL databases in this sec-tion Key-value databases column-oriented databases anddocument-oriented databases each based on certain datamodels

                              ndash Key-value Databases Key-value Databases are con-stituted by a simple data model and data is storedcorresponding to key-values Every key is unique andcustomers may input queried values according to thekeys Such databases feature a simple structure andthe modern key-value databases are characterized withhigh expandability and shorter query response time thanthose of relational databases Over the past few yearsmany key-value databases have appeared as motivatedby Amazonrsquos Dynamo system [85] We will introduceDynamo and several other representative key-valuedatabases

                              ndash Dynamo Dynamo is a highly available andexpandable distributed key-value data stor-age system It is used to store and managethe status of some core services which canbe realized with key access in the Amazone-Commerce Platform The public mode ofrelational databases may generate invalid dataand limit data scale and availability whileDynamo can resolve these problems with asimple key-object interface which is consti-tuted by simple reading and writing opera-tion Dynamo achieves elasticity and avail-ability through the data partition data copyand object edition mechanisms Dynamo par-tition plan relies on Consistent Hashing [86]which has a main advantage that node pass-ing only affects directly adjacent nodes anddo not affect other nodes to divide the loadfor multiple main storage machines Dynamocopies data to N sets of servers in which Nis a configurable parameter in order to achieve

                              186 Mobile Netw Appl (2014) 19171ndash209

                              high availability and durability Dynamo sys-tem also provides eventual consistency so asto conduct asynchronous update on all copies

                              ndash Voldemort Voldemort is also a key-value stor-age system which was initially developed forand is still used by LinkedIn Key words andvalues in Voldemort are composite objectsconstituted by tables and images Volde-mort interface includes three simple opera-tions reading writing and deletion all ofwhich are confirmed by key words Volde-mort provides asynchronous updating con-current control of multiple editions but doesnot ensure data consistency However Volde-mort supports optimistic locking for consistentmulti-record updating When conflict happensbetween the updating and any other opera-tions the updating operation will quit Thedata copy mechanism of Voldmort is the sameas that of Dynamo Voldemort not only storesdata in RAM but allows data be inserted intoa storage engine Especially Voldemort sup-ports two storage engines including BerkeleyDB and Random Access Files

                              The key-value database emerged a few years agoDeeply influenced by Amazon Dynamo DB other key-value storage systems include Redis Tokyo Canbinetand Tokyo Tyrant Memcached and Memcache DBRiak and Scalaris all of which provide expandabilityby distributing key words into nodes Voldemort RiakTokyo Cabinet and Memecached can utilize attachedstorage devices to store data in RAM or disks Otherstorage systems store data at RAM and provide diskbackup or rely on copy and recovery to avoid backup

                              ndash Column-oriented Database The column-orienteddatabases store and process data according to columnsother than rows Both columns and rows are segmentedin multiple nodes to realize expandability The column-oriented databases are mainly inspired by GooglersquosBigTable In this Section we first discuss BigTable andthen introduce several derivative tools

                              ndash BigTable BigTable is a distributed structureddata storage system which is designed to pro-cess the large-scale (PB class) data amongthousands commercial servers [87] The basicdata structure of Bigtable is a multi-dimensionsequenced mapping with sparse distributedand persistent storage Indexes of mappingare row key column key and timestampsand every value in mapping is an unana-lyzed byte array Each row key in BigTable

                              is a 64KB character string By lexicograph-ical order rows are stored and continuallysegmented into Tablets (ie units of distribu-tion) for load balance Thus reading a shortrow of data can be highly effective since itonly involves communication with a small por-tion of machines The columns are groupedaccording to the prefixes of keys and thusforming column families These column fami-lies are the basic units for access control Thetimestamps are 64-bit integers to distinguishdifferent editions of cell values Clients mayflexibly determine the number of cell editionsstored These editions are sequenced in thedescending order of timestamps so the latestedition will always be read

                              The BigTable API features the creation anddeletion of Tablets and column families as wellas modification of metadata of clusters tablesand column families Client applications mayinsert or delete values of BigTable query val-ues from columns or browse sub-datasets in atable Bigtable also supports some other char-acteristics such as transaction processing in asingle row Users may utilize such features toconduct more complex data processing

                              Every procedure executed by BigTableincludes three main components Masterserver Tablet server and client libraryBigtable only allows one set of Master serverbe distributed to be responsible for distribut-ing tablets for Tablet server detecting addedor removed Tablet servers and conductingload balance In addition it can also mod-ify BigTable schema eg creating tables andcolumn families and collecting garbage savedin GFS as well as deleted or disabled filesand using them in specific BigTable instancesEvery tablet server manages a Tablet set andis responsible for the reading and writing of aloaded Tablet When Tablets are too big theywill be segmented by the server The applica-tion client library is used to communicate withBigTable instances

                              BigTable is based on many fundamentalcomponents of Google including GFS [25]cluster management system SSTable file for-mat and Chubby [88] GFS is use to store dataand log files The cluster management systemis responsible for task scheduling resourcessharing processing of machine failures andmonitoring of machine statuses SSTable fileformat is used to store BigTable data internally

                              Mobile Netw Appl (2014) 19171ndash209 187

                              and it provides mapping between persistentsequenced and unchangeable keys and valuesas any byte strings BigTable utilizes Chubbyfor the following tasks in server 1) ensurethere is at most one active Master copy atany time 2) store the bootstrap location ofBigTable data 3) look up Tablet server 4) con-duct error recovery in case of Table server fail-ures 5) store BigTable schema information 6)store the access control table

                              ndash Cassandra Cassandra is a distributed storagesystem to manage the huge amount of struc-tured data distributed among multiple commer-cial servers [89] The system was developedby Facebook and became an open source toolin 2008 It adopts the ideas and concepts ofboth Amazon Dynamo and Google BigTableespecially integrating the distributed systemtechnology of Dynamo with the BigTable datamodel Tables in Cassandra are in the form ofdistributed four-dimensional structured map-ping where the four dimensions including rowcolumn column family and super column Arow is distinguished by a string-key with arbi-trary length No matter the amount of columnsto be read or written the operation on rowsis an auto Columns may constitute clusterswhich is called column families and are sim-ilar to the data model of Bigtable Cassandraprovides two kinds of column families columnfamilies and super columns The super columnincludes arbitrary number of columns relatedto same names A column family includescolumns and super columns which may becontinuously inserted to the column familyduring runtime The partition and copy mecha-nisms of Cassandra are very similar to those ofDynamo so as to achieve consistency

                              ndash Derivative tools of BigTable since theBigTable code cannot be obtained throughthe open source license some open sourceprojects compete to implement the BigTableconcept to develop similar systems such asHBase and Hypertable

                              HBase is a BigTable cloned version pro-grammed with Java and is a part of Hadoop ofApachersquos MapReduce framework [90] HBasereplaces GFS with HDFS It writes updatedcontents into RAM and regularly writes theminto files on disks The row operations areatomic operations equipped with row-levellocking and transaction processing which is

                              optional for large scale Partition and distribu-tion are transparently operated and have spacefor client hash or fixed key

                              HyperTable was developed similar toBigTable to obtain a set of high-performanceexpandable distributed storage and process-ing systems for structured and unstructureddata [91] HyperTable relies on distributedfile systems eg HDFS and distributed lockmanager Data representation processing andpartition mechanism are similar to that inBigTable HyperTable has its own query lan-guage called HyperTable query language(HQL) and allows users to create modify andquery underlying tables

                              Since the column-oriented storage databases mainlyemulate BigTable their designs are all similar exceptfor the concurrency mechanism and several other fea-tures For example Cassandra emphasizes weak consis-tency of concurrent control of multiple editions whileHBase and HyperTable focus on strong consistencythrough locks or log records

                              ndash Document Database Compared with key-value stor-age document storage can support more complex dataforms Since documents do not follow strict modesthere is no need to conduct mode migration In additionkey-value pairs can still be saved We will examine threeimportant representatives of document storage systemsie MongoDB SimpleDB and CouchDB

                              ndash MongoDB MongoDB is open-source anddocument-oriented database [92] MongoDBstores documents as Binary JSON (BSON)objects [93] which is similar to object Everydocument has an ID field as the primary keyQuery in MongoDB is expressed with syn-tax similar to JSON A database driver sendsthe query as a BSON object to MongoDBThe system allows query on all documentsincluding embedded objects and arrays Toenable rapid query indexes can be createdin the queryable fields of documents Thecopy operation in MongoDB can be executedwith log files in the main nodes that supportall the high-level operations conducted in thedatabase During copying the slavers queryall the writing operations since the last syn-chronization to the master and execute opera-tions in log files in local databases MongoDBsupports horizontal expansion with automaticsharing to distribute data among thousandsof nodes by automatically balancing load andfailover

                              188 Mobile Netw Appl (2014) 19171ndash209

                              ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

                              ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

                              Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

                              ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

                              functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

                              Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

                              ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

                              The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

                              In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

                              Mobile Netw Appl (2014) 19171ndash209 189

                              DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

                              ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

                              All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

                              ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

                              The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

                              Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

                              The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

                              Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

                              ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

                              ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

                              51 Traditional data analysis

                              5 Big data analysis

                              190 Mobile Netw Appl (2014) 19171ndash209

                              ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

                              ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

                              ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

                              ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

                              ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

                              52 Big data analytic methods

                              In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

                              ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

                              ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

                              ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

                              ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

                              ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

                              Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

                              53 Architecture for big data analysis

                              Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

                              Mobile Netw Appl (2014) 19171ndash209 191

                              Table 1 Comparison of MPI MapReduce and Dryad

                              MPI MapReduce Dryad

                              Deployment Computing node and data Computing and data storage Computing and data storage

                              storage arranged separately arranged at the same node arranged at the same node

                              (Data should be moved (Computing should (Computing should

                              computing node) be close to data) be close to data)

                              Resource management ndash Workqueue(google) Not clear

                              scheduling HOD(Yahoo)

                              Low level programming MPI API MapReduce API Dryad API

                              High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

                              Data storage The local file system GFS(google) NTFS

                              NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

                              Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

                              the tasks

                              Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

                              memory access Shared-memory FIFOs

                              Fault-tolerant Checkpoint Task re-execute Task re-execute

                              531 Real-time vs offline analysis

                              According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

                              ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

                              ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

                              532 Analysis at different levels

                              Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

                              ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

                              ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

                              ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

                              192 Mobile Netw Appl (2014) 19171ndash209

                              533 Analysis with different complexity

                              The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

                              54 Tools for big data mining and analysis

                              Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

                              ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

                              ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

                              ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

                              The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

                              ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

                              ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

                              6 Big data applications

                              In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

                              Mobile Netw Appl (2014) 19171ndash209 193

                              However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

                              61 Application evolutions

                              Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

                              ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

                              ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

                              most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

                              ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

                              As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

                              62 Big data analysis fields

                              webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

                              194 Mobile Netw Appl (2014) 19171ndash209

                              621 Structured data analysis

                              Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

                              However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

                              622 Text data analysis

                              The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

                              623 Web data analysis

                              Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

                              mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

                              Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

                              Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

                              Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

                              Mobile Netw Appl (2014) 19171ndash209 195

                              624 Multimedia data analysis

                              Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

                              Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

                              Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

                              Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

                              segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

                              Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

                              The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

                              625 Network data analysis

                              Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

                              196 Mobile Netw Appl (2014) 19171ndash209

                              and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

                              The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

                              Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

                              Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

                              Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

                              is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

                              626 Mobile data analysis

                              By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

                              With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

                              Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

                              Mobile Netw Appl (2014) 19171ndash209 197

                              In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

                              Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

                              63 Key applications of big data

                              631 Application of big data in enterprises

                              At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

                              In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

                              Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

                              632 Application of IoT based big data

                              IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

                              Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

                              Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

                              633 Application of online social network-oriented big data

                              Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

                              198 Mobile Netw Appl (2014) 19171ndash209

                              information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

                              ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

                              ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

                              is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

                              The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

                              In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

                              Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

                              Mobile Netw Appl (2014) 19171ndash209 199

                              or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

                              Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

                              ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

                              ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

                              ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

                              634 Applications of healthcare and medical big data

                              Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

                              effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

                              For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

                              The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

                              HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

                              Fig 6 The correlation between Tweets about rice price and food price inflation

                              200 Mobile Netw Appl (2014) 19171ndash209

                              imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

                              635 Collective intelligence

                              With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

                              Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

                              In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

                              636 Smart grid

                              Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

                              supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

                              ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

                              ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

                              Mobile Netw Appl (2014) 19171ndash209 201

                              according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

                              ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

                              7 Conclusion open issues and outlook

                              In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

                              In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

                              71 Open issues

                              The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

                              711 Theoretical research

                              Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

                              ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

                              ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

                              ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

                              712 Technology development

                              The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

                              ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

                              202 Mobile Netw Appl (2014) 19171ndash209

                              ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

                              ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

                              ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

                              713 Practical implications

                              Although there are already many successful big data appli-cations many practical problems should be solved

                              ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

                              ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

                              ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

                              individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

                              ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

                              714 Data security

                              In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

                              ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

                              ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

                              Mobile Netw Appl (2014) 19171ndash209 203

                              quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

                              ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

                              ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

                              The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

                              72 Outlook

                              The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

                              not predict the future but may take precautions for possibleevents to occur in the future

                              ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

                              ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

                              ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

                              ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

                              204 Mobile Netw Appl (2014) 19171ndash209

                              utilizes relational diagrams to express interpersonalrelationship

                              ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

                              ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

                              ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

                              ndash Compared with accurate data we would like toaccept numerous and complicated data

                              ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

                              ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

                              ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

                              Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

                              increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

                              Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

                              References

                              1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

                              2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

                              3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

                              4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

                              5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

                              httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

                              7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

                              8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

                              9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

                              10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

                              11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

                              12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

                              13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

                              14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

                              15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

                              16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

                              17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

                              18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

                              19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

                              Mobile Netw Appl (2014) 19171ndash209 205

                              20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

                              21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

                              22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

                              23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

                              24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

                              25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

                              26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

                              27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

                              28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

                              29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

                              30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

                              31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

                              32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

                              33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

                              34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

                              35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

                              36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

                              37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

                              38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

                              39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

                              40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

                              41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

                              42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

                              43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

                              44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

                              45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

                              46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

                              47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

                              48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

                              49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

                              50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

                              51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

                              52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

                              53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

                              54 Cisco data center interconnect design and deployment guide(2010)

                              55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

                              56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

                              57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

                              58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

                              59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

                              206 Mobile Netw Appl (2014) 19171ndash209

                              60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

                              61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

                              62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

                              63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

                              64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

                              65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

                              66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

                              67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

                              68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

                              69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

                              70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

                              71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

                              72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

                              73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

                              74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

                              75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

                              76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

                              77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

                              78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

                              79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

                              80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

                              81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

                              82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

                              83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

                              84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

                              85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

                              86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

                              87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

                              88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

                              89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

                              90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

                              Media Inc93 Crockford D (2006) The applicationjson media type for

                              javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

                              SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

                              tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

                              (2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

                              97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

                              98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

                              99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

                              Mobile Netw Appl (2014) 19171ndash209 207

                              100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

                              101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

                              102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

                              103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

                              104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

                              105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

                              106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

                              107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

                              108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

                              109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

                              110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

                              111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

                              112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

                              113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

                              114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

                              115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

                              D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

                              117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

                              118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

                              the 7th ACM international conference on computing frontiersACM pp 277ndash286

                              119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

                              120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

                              121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

                              122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

                              123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

                              124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

                              125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

                              126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

                              127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

                              128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

                              129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

                              130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

                              131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

                              132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

                              133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

                              134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

                              135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

                              136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

                              137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

                              138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

                              139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

                              140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

                              208 Mobile Netw Appl (2014) 19171ndash209

                              141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

                              142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

                              143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

                              144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

                              145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

                              146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

                              147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

                              148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

                              149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

                              150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

                              151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

                              152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

                              153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

                              154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

                              155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

                              156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

                              Mobile Netw Appl (2014) 19171ndash209 209

                              • Big Data A Survey
                                • Abstract
                                • Background
                                  • Dawn of big data era
                                  • Definition and features of big data
                                  • Big data value
                                  • The development of big data
                                  • Challenges of big data
                                    • Related technologies
                                      • Relationship between cloud computing and big data
                                      • Relationship between IoT and big data
                                      • Data center
                                      • Relationship between hadoop and big data
                                        • Big data generation and acquisition
                                          • Data generation
                                            • Enterprise data
                                            • IoT data
                                            • Bio-medical data
                                            • Data generation from other fields
                                              • Big data acquisition
                                                • Data collection
                                                • Data transportation
                                                • Data pre-processing
                                                    • Big data storage
                                                      • Storage system for massive data
                                                      • Distributed storage system
                                                      • Storage mechanism for big data
                                                        • Database technology
                                                          • Traditional data analysis
                                                          • Big data analytic methods
                                                          • Architecture for big data analysis
                                                            • Real-time vs offline analysis
                                                            • Analysis at different levels
                                                            • Analysis with different complexity
                                                              • Tools for big data mining and analysis
                                                                • Big data applications
                                                                  • Key applications of big data
                                                                    • Application evolutions
                                                                    • Structured data analysis
                                                                    • Text data analysis
                                                                    • Web data analysis
                                                                    • Multimedia data analysis
                                                                    • Network data analysis
                                                                    • Mobile data analysis
                                                                      • Key applications of big data
                                                                        • Application of big data in enterprises
                                                                        • Application of IoT based big data
                                                                        • Application of online social network-oriented big data
                                                                        • Applications of healthcare and medical big data
                                                                        • Collective intelligence
                                                                        • Smart grid
                                                                            • Conclusion open issues and outlook
                                                                              • Open issues
                                                                                • Theoretical research
                                                                                • Technology development
                                                                                • Practical implications
                                                                                • Data security
                                                                                  • Outlook
                                                                                    • Acknowledgments
                                                                                    • References

                                level of fault tolerance CP systems also ensure data consis-tency ie multiple copies of the same data are guaranteedto be completely identical However CP could not ensuresound availability because of the high cost for consistencyassurance Therefore CP systems are useful for the scenar-ios with moderate load but stringent requirements on dataaccuracy (eg trading data) BigTable and Hbase are twopopular CP systems

                                AP systems also ensure partition tolerance However APsystems are different from CP systems in that AP systemsalso ensure availability However AP systems only ensureeventual consistency rather than strong consistency in theprevious two systems Therefore AP systems only applyto the scenarios with frequent requests but not very highrequirements on accuracy For example in online SocialNetworking Services (SNS) systems there are many con-current visits to the data but a certain amount of data errorsare tolerable Furthermore because AP systems ensureeventual consistency accurate data can still be obtained aftera certain amount of delay Therefore AP systems may alsobe used under the circumstances with no stringent realtimerequirements Dynamo and Cassandra are two popular APsystems

                                43 Storage mechanism for big data

                                Considerable research on big data promotes the develop-ment of storage mechanisms for big data Existing stor-age mechanisms of big data may be classified into threebottom-up levels (i) file systems (ii) databases and (iii)programming models

                                File systems are the foundation of the applications atupper levels Googlersquos GFS is an expandable distributedfile system to support large-scale distributed data-intensiveapplications [25] GFS uses cheap commodity servers toachieve fault-tolerance and provides customers with high-performance services GFS supports large-scale file appli-cations with more frequent reading than writing HoweverGFS also has some limitations such as a single point offailure and poor performances for small files Such limita-tions have been overcome by Colossus [82] the successorof GFS

                                In addition other companies and researchers also havetheir solutions to meet the different demands for storageof big data For example HDFS and Kosmosfs are deriva-tives of open source codes of GFS Microsoft developedCosmos [83] to support its search and advertisement busi-ness Facebook utilizes Haystack [84] to store the largeamount of small-sized photos Taobao also developed TFSand FastDFS In conclusion distributed file systems havebeen relatively mature after years of development and busi-ness operation Therefore we will focus on the other twolevels in the rest of this section

                                431 Database technology

                                The database technology has been evolving for more than30 years Various database systems are developed to handledatasets at different scales and support various applica-tions Traditional relational databases cannot meet the chal-lenges on categories and scales brought about by big dataNoSQL databases (ie non traditional relational databases)are becoming more popular for big data storage NoSQLdatabases feature flexible modes support for simple andeasy copy simple API eventual consistency and supportof large volume data NoSQL databases are becomingthe core technology for of big data We will examinethe following three main NoSQL databases in this sec-tion Key-value databases column-oriented databases anddocument-oriented databases each based on certain datamodels

                                ndash Key-value Databases Key-value Databases are con-stituted by a simple data model and data is storedcorresponding to key-values Every key is unique andcustomers may input queried values according to thekeys Such databases feature a simple structure andthe modern key-value databases are characterized withhigh expandability and shorter query response time thanthose of relational databases Over the past few yearsmany key-value databases have appeared as motivatedby Amazonrsquos Dynamo system [85] We will introduceDynamo and several other representative key-valuedatabases

                                ndash Dynamo Dynamo is a highly available andexpandable distributed key-value data stor-age system It is used to store and managethe status of some core services which canbe realized with key access in the Amazone-Commerce Platform The public mode ofrelational databases may generate invalid dataand limit data scale and availability whileDynamo can resolve these problems with asimple key-object interface which is consti-tuted by simple reading and writing opera-tion Dynamo achieves elasticity and avail-ability through the data partition data copyand object edition mechanisms Dynamo par-tition plan relies on Consistent Hashing [86]which has a main advantage that node pass-ing only affects directly adjacent nodes anddo not affect other nodes to divide the loadfor multiple main storage machines Dynamocopies data to N sets of servers in which Nis a configurable parameter in order to achieve

                                186 Mobile Netw Appl (2014) 19171ndash209

                                high availability and durability Dynamo sys-tem also provides eventual consistency so asto conduct asynchronous update on all copies

                                ndash Voldemort Voldemort is also a key-value stor-age system which was initially developed forand is still used by LinkedIn Key words andvalues in Voldemort are composite objectsconstituted by tables and images Volde-mort interface includes three simple opera-tions reading writing and deletion all ofwhich are confirmed by key words Volde-mort provides asynchronous updating con-current control of multiple editions but doesnot ensure data consistency However Volde-mort supports optimistic locking for consistentmulti-record updating When conflict happensbetween the updating and any other opera-tions the updating operation will quit Thedata copy mechanism of Voldmort is the sameas that of Dynamo Voldemort not only storesdata in RAM but allows data be inserted intoa storage engine Especially Voldemort sup-ports two storage engines including BerkeleyDB and Random Access Files

                                The key-value database emerged a few years agoDeeply influenced by Amazon Dynamo DB other key-value storage systems include Redis Tokyo Canbinetand Tokyo Tyrant Memcached and Memcache DBRiak and Scalaris all of which provide expandabilityby distributing key words into nodes Voldemort RiakTokyo Cabinet and Memecached can utilize attachedstorage devices to store data in RAM or disks Otherstorage systems store data at RAM and provide diskbackup or rely on copy and recovery to avoid backup

                                ndash Column-oriented Database The column-orienteddatabases store and process data according to columnsother than rows Both columns and rows are segmentedin multiple nodes to realize expandability The column-oriented databases are mainly inspired by GooglersquosBigTable In this Section we first discuss BigTable andthen introduce several derivative tools

                                ndash BigTable BigTable is a distributed structureddata storage system which is designed to pro-cess the large-scale (PB class) data amongthousands commercial servers [87] The basicdata structure of Bigtable is a multi-dimensionsequenced mapping with sparse distributedand persistent storage Indexes of mappingare row key column key and timestampsand every value in mapping is an unana-lyzed byte array Each row key in BigTable

                                is a 64KB character string By lexicograph-ical order rows are stored and continuallysegmented into Tablets (ie units of distribu-tion) for load balance Thus reading a shortrow of data can be highly effective since itonly involves communication with a small por-tion of machines The columns are groupedaccording to the prefixes of keys and thusforming column families These column fami-lies are the basic units for access control Thetimestamps are 64-bit integers to distinguishdifferent editions of cell values Clients mayflexibly determine the number of cell editionsstored These editions are sequenced in thedescending order of timestamps so the latestedition will always be read

                                The BigTable API features the creation anddeletion of Tablets and column families as wellas modification of metadata of clusters tablesand column families Client applications mayinsert or delete values of BigTable query val-ues from columns or browse sub-datasets in atable Bigtable also supports some other char-acteristics such as transaction processing in asingle row Users may utilize such features toconduct more complex data processing

                                Every procedure executed by BigTableincludes three main components Masterserver Tablet server and client libraryBigtable only allows one set of Master serverbe distributed to be responsible for distribut-ing tablets for Tablet server detecting addedor removed Tablet servers and conductingload balance In addition it can also mod-ify BigTable schema eg creating tables andcolumn families and collecting garbage savedin GFS as well as deleted or disabled filesand using them in specific BigTable instancesEvery tablet server manages a Tablet set andis responsible for the reading and writing of aloaded Tablet When Tablets are too big theywill be segmented by the server The applica-tion client library is used to communicate withBigTable instances

                                BigTable is based on many fundamentalcomponents of Google including GFS [25]cluster management system SSTable file for-mat and Chubby [88] GFS is use to store dataand log files The cluster management systemis responsible for task scheduling resourcessharing processing of machine failures andmonitoring of machine statuses SSTable fileformat is used to store BigTable data internally

                                Mobile Netw Appl (2014) 19171ndash209 187

                                and it provides mapping between persistentsequenced and unchangeable keys and valuesas any byte strings BigTable utilizes Chubbyfor the following tasks in server 1) ensurethere is at most one active Master copy atany time 2) store the bootstrap location ofBigTable data 3) look up Tablet server 4) con-duct error recovery in case of Table server fail-ures 5) store BigTable schema information 6)store the access control table

                                ndash Cassandra Cassandra is a distributed storagesystem to manage the huge amount of struc-tured data distributed among multiple commer-cial servers [89] The system was developedby Facebook and became an open source toolin 2008 It adopts the ideas and concepts ofboth Amazon Dynamo and Google BigTableespecially integrating the distributed systemtechnology of Dynamo with the BigTable datamodel Tables in Cassandra are in the form ofdistributed four-dimensional structured map-ping where the four dimensions including rowcolumn column family and super column Arow is distinguished by a string-key with arbi-trary length No matter the amount of columnsto be read or written the operation on rowsis an auto Columns may constitute clusterswhich is called column families and are sim-ilar to the data model of Bigtable Cassandraprovides two kinds of column families columnfamilies and super columns The super columnincludes arbitrary number of columns relatedto same names A column family includescolumns and super columns which may becontinuously inserted to the column familyduring runtime The partition and copy mecha-nisms of Cassandra are very similar to those ofDynamo so as to achieve consistency

                                ndash Derivative tools of BigTable since theBigTable code cannot be obtained throughthe open source license some open sourceprojects compete to implement the BigTableconcept to develop similar systems such asHBase and Hypertable

                                HBase is a BigTable cloned version pro-grammed with Java and is a part of Hadoop ofApachersquos MapReduce framework [90] HBasereplaces GFS with HDFS It writes updatedcontents into RAM and regularly writes theminto files on disks The row operations areatomic operations equipped with row-levellocking and transaction processing which is

                                optional for large scale Partition and distribu-tion are transparently operated and have spacefor client hash or fixed key

                                HyperTable was developed similar toBigTable to obtain a set of high-performanceexpandable distributed storage and process-ing systems for structured and unstructureddata [91] HyperTable relies on distributedfile systems eg HDFS and distributed lockmanager Data representation processing andpartition mechanism are similar to that inBigTable HyperTable has its own query lan-guage called HyperTable query language(HQL) and allows users to create modify andquery underlying tables

                                Since the column-oriented storage databases mainlyemulate BigTable their designs are all similar exceptfor the concurrency mechanism and several other fea-tures For example Cassandra emphasizes weak consis-tency of concurrent control of multiple editions whileHBase and HyperTable focus on strong consistencythrough locks or log records

                                ndash Document Database Compared with key-value stor-age document storage can support more complex dataforms Since documents do not follow strict modesthere is no need to conduct mode migration In additionkey-value pairs can still be saved We will examine threeimportant representatives of document storage systemsie MongoDB SimpleDB and CouchDB

                                ndash MongoDB MongoDB is open-source anddocument-oriented database [92] MongoDBstores documents as Binary JSON (BSON)objects [93] which is similar to object Everydocument has an ID field as the primary keyQuery in MongoDB is expressed with syn-tax similar to JSON A database driver sendsthe query as a BSON object to MongoDBThe system allows query on all documentsincluding embedded objects and arrays Toenable rapid query indexes can be createdin the queryable fields of documents Thecopy operation in MongoDB can be executedwith log files in the main nodes that supportall the high-level operations conducted in thedatabase During copying the slavers queryall the writing operations since the last syn-chronization to the master and execute opera-tions in log files in local databases MongoDBsupports horizontal expansion with automaticsharing to distribute data among thousandsof nodes by automatically balancing load andfailover

                                188 Mobile Netw Appl (2014) 19171ndash209

                                ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

                                ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

                                Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

                                ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

                                functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

                                Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

                                ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

                                The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

                                In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

                                Mobile Netw Appl (2014) 19171ndash209 189

                                DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

                                ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

                                All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

                                ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

                                The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

                                Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

                                The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

                                Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

                                ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

                                ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

                                51 Traditional data analysis

                                5 Big data analysis

                                190 Mobile Netw Appl (2014) 19171ndash209

                                ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

                                ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

                                ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

                                ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

                                ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

                                52 Big data analytic methods

                                In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

                                ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

                                ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

                                ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

                                ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

                                ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

                                Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

                                53 Architecture for big data analysis

                                Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

                                Mobile Netw Appl (2014) 19171ndash209 191

                                Table 1 Comparison of MPI MapReduce and Dryad

                                MPI MapReduce Dryad

                                Deployment Computing node and data Computing and data storage Computing and data storage

                                storage arranged separately arranged at the same node arranged at the same node

                                (Data should be moved (Computing should (Computing should

                                computing node) be close to data) be close to data)

                                Resource management ndash Workqueue(google) Not clear

                                scheduling HOD(Yahoo)

                                Low level programming MPI API MapReduce API Dryad API

                                High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

                                Data storage The local file system GFS(google) NTFS

                                NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

                                Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

                                the tasks

                                Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

                                memory access Shared-memory FIFOs

                                Fault-tolerant Checkpoint Task re-execute Task re-execute

                                531 Real-time vs offline analysis

                                According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

                                ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

                                ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

                                532 Analysis at different levels

                                Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

                                ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

                                ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

                                ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

                                192 Mobile Netw Appl (2014) 19171ndash209

                                533 Analysis with different complexity

                                The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

                                54 Tools for big data mining and analysis

                                Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

                                ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

                                ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

                                ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

                                The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

                                ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

                                ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

                                6 Big data applications

                                In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

                                Mobile Netw Appl (2014) 19171ndash209 193

                                However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

                                61 Application evolutions

                                Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

                                ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

                                ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

                                most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

                                ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

                                As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

                                62 Big data analysis fields

                                webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

                                194 Mobile Netw Appl (2014) 19171ndash209

                                621 Structured data analysis

                                Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

                                However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

                                622 Text data analysis

                                The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

                                623 Web data analysis

                                Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

                                mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

                                Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

                                Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

                                Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

                                Mobile Netw Appl (2014) 19171ndash209 195

                                624 Multimedia data analysis

                                Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

                                Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

                                Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

                                Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

                                segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

                                Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

                                The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

                                625 Network data analysis

                                Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

                                196 Mobile Netw Appl (2014) 19171ndash209

                                and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

                                The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

                                Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

                                Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

                                Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

                                is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

                                626 Mobile data analysis

                                By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

                                With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

                                Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

                                Mobile Netw Appl (2014) 19171ndash209 197

                                In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

                                Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

                                63 Key applications of big data

                                631 Application of big data in enterprises

                                At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

                                In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

                                Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

                                632 Application of IoT based big data

                                IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

                                Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

                                Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

                                633 Application of online social network-oriented big data

                                Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

                                198 Mobile Netw Appl (2014) 19171ndash209

                                information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

                                ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

                                ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

                                is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

                                The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

                                In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

                                Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

                                Mobile Netw Appl (2014) 19171ndash209 199

                                or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

                                Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

                                ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

                                ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

                                ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

                                634 Applications of healthcare and medical big data

                                Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

                                effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

                                For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

                                The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

                                HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

                                Fig 6 The correlation between Tweets about rice price and food price inflation

                                200 Mobile Netw Appl (2014) 19171ndash209

                                imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

                                635 Collective intelligence

                                With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

                                Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

                                In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

                                636 Smart grid

                                Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

                                supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

                                ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

                                ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

                                Mobile Netw Appl (2014) 19171ndash209 201

                                according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

                                ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

                                7 Conclusion open issues and outlook

                                In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

                                In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

                                71 Open issues

                                The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

                                711 Theoretical research

                                Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

                                ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

                                ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

                                ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

                                712 Technology development

                                The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

                                ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

                                202 Mobile Netw Appl (2014) 19171ndash209

                                ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

                                ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

                                ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

                                713 Practical implications

                                Although there are already many successful big data appli-cations many practical problems should be solved

                                ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

                                ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

                                ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

                                individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

                                ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

                                714 Data security

                                In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

                                ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

                                ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

                                Mobile Netw Appl (2014) 19171ndash209 203

                                quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

                                ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

                                ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

                                The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

                                72 Outlook

                                The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

                                not predict the future but may take precautions for possibleevents to occur in the future

                                ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

                                ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

                                ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

                                ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

                                204 Mobile Netw Appl (2014) 19171ndash209

                                utilizes relational diagrams to express interpersonalrelationship

                                ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

                                ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

                                ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

                                ndash Compared with accurate data we would like toaccept numerous and complicated data

                                ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

                                ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

                                ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

                                Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

                                increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

                                Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

                                References

                                1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

                                2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

                                3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

                                4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

                                5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

                                httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

                                7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

                                8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

                                9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

                                10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

                                11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

                                12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

                                13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

                                14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

                                15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

                                16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

                                17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

                                18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

                                19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

                                Mobile Netw Appl (2014) 19171ndash209 205

                                20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

                                21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

                                22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

                                23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

                                24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

                                25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

                                26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

                                27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

                                28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

                                29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

                                30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

                                31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

                                32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

                                33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

                                34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

                                35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

                                36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

                                37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

                                38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

                                39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

                                40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

                                41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

                                42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

                                43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

                                44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

                                45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

                                46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

                                47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

                                48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

                                49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

                                50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

                                51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

                                52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

                                53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

                                54 Cisco data center interconnect design and deployment guide(2010)

                                55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

                                56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

                                57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

                                58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

                                59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

                                206 Mobile Netw Appl (2014) 19171ndash209

                                60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

                                61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

                                62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

                                63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

                                64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

                                65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

                                66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

                                67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

                                68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

                                69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

                                70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

                                71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

                                72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

                                73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

                                74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

                                75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

                                76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

                                77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

                                78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

                                79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

                                80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

                                81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

                                82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

                                83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

                                84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

                                85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

                                86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

                                87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

                                88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

                                89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

                                90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

                                Media Inc93 Crockford D (2006) The applicationjson media type for

                                javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

                                SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

                                tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

                                (2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

                                97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

                                98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

                                99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

                                Mobile Netw Appl (2014) 19171ndash209 207

                                100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

                                101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

                                102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

                                103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

                                104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

                                105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

                                106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

                                107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

                                108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

                                109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

                                110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

                                111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

                                112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

                                113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

                                114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

                                115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

                                D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

                                117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

                                118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

                                the 7th ACM international conference on computing frontiersACM pp 277ndash286

                                119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

                                120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

                                121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

                                122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

                                123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

                                124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

                                125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

                                126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

                                127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

                                128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

                                129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

                                130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

                                131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

                                132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

                                133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

                                134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

                                135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

                                136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

                                137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

                                138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

                                139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

                                140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

                                208 Mobile Netw Appl (2014) 19171ndash209

                                141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

                                142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

                                143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

                                144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

                                145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

                                146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

                                147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

                                148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

                                149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

                                150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

                                151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

                                152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

                                153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

                                154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

                                155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

                                156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

                                Mobile Netw Appl (2014) 19171ndash209 209

                                • Big Data A Survey
                                  • Abstract
                                  • Background
                                    • Dawn of big data era
                                    • Definition and features of big data
                                    • Big data value
                                    • The development of big data
                                    • Challenges of big data
                                      • Related technologies
                                        • Relationship between cloud computing and big data
                                        • Relationship between IoT and big data
                                        • Data center
                                        • Relationship between hadoop and big data
                                          • Big data generation and acquisition
                                            • Data generation
                                              • Enterprise data
                                              • IoT data
                                              • Bio-medical data
                                              • Data generation from other fields
                                                • Big data acquisition
                                                  • Data collection
                                                  • Data transportation
                                                  • Data pre-processing
                                                      • Big data storage
                                                        • Storage system for massive data
                                                        • Distributed storage system
                                                        • Storage mechanism for big data
                                                          • Database technology
                                                            • Traditional data analysis
                                                            • Big data analytic methods
                                                            • Architecture for big data analysis
                                                              • Real-time vs offline analysis
                                                              • Analysis at different levels
                                                              • Analysis with different complexity
                                                                • Tools for big data mining and analysis
                                                                  • Big data applications
                                                                    • Key applications of big data
                                                                      • Application evolutions
                                                                      • Structured data analysis
                                                                      • Text data analysis
                                                                      • Web data analysis
                                                                      • Multimedia data analysis
                                                                      • Network data analysis
                                                                      • Mobile data analysis
                                                                        • Key applications of big data
                                                                          • Application of big data in enterprises
                                                                          • Application of IoT based big data
                                                                          • Application of online social network-oriented big data
                                                                          • Applications of healthcare and medical big data
                                                                          • Collective intelligence
                                                                          • Smart grid
                                                                              • Conclusion open issues and outlook
                                                                                • Open issues
                                                                                  • Theoretical research
                                                                                  • Technology development
                                                                                  • Practical implications
                                                                                  • Data security
                                                                                    • Outlook
                                                                                      • Acknowledgments
                                                                                      • References

                                  high availability and durability Dynamo sys-tem also provides eventual consistency so asto conduct asynchronous update on all copies

                                  ndash Voldemort Voldemort is also a key-value stor-age system which was initially developed forand is still used by LinkedIn Key words andvalues in Voldemort are composite objectsconstituted by tables and images Volde-mort interface includes three simple opera-tions reading writing and deletion all ofwhich are confirmed by key words Volde-mort provides asynchronous updating con-current control of multiple editions but doesnot ensure data consistency However Volde-mort supports optimistic locking for consistentmulti-record updating When conflict happensbetween the updating and any other opera-tions the updating operation will quit Thedata copy mechanism of Voldmort is the sameas that of Dynamo Voldemort not only storesdata in RAM but allows data be inserted intoa storage engine Especially Voldemort sup-ports two storage engines including BerkeleyDB and Random Access Files

                                  The key-value database emerged a few years agoDeeply influenced by Amazon Dynamo DB other key-value storage systems include Redis Tokyo Canbinetand Tokyo Tyrant Memcached and Memcache DBRiak and Scalaris all of which provide expandabilityby distributing key words into nodes Voldemort RiakTokyo Cabinet and Memecached can utilize attachedstorage devices to store data in RAM or disks Otherstorage systems store data at RAM and provide diskbackup or rely on copy and recovery to avoid backup

                                  ndash Column-oriented Database The column-orienteddatabases store and process data according to columnsother than rows Both columns and rows are segmentedin multiple nodes to realize expandability The column-oriented databases are mainly inspired by GooglersquosBigTable In this Section we first discuss BigTable andthen introduce several derivative tools

                                  ndash BigTable BigTable is a distributed structureddata storage system which is designed to pro-cess the large-scale (PB class) data amongthousands commercial servers [87] The basicdata structure of Bigtable is a multi-dimensionsequenced mapping with sparse distributedand persistent storage Indexes of mappingare row key column key and timestampsand every value in mapping is an unana-lyzed byte array Each row key in BigTable

                                  is a 64KB character string By lexicograph-ical order rows are stored and continuallysegmented into Tablets (ie units of distribu-tion) for load balance Thus reading a shortrow of data can be highly effective since itonly involves communication with a small por-tion of machines The columns are groupedaccording to the prefixes of keys and thusforming column families These column fami-lies are the basic units for access control Thetimestamps are 64-bit integers to distinguishdifferent editions of cell values Clients mayflexibly determine the number of cell editionsstored These editions are sequenced in thedescending order of timestamps so the latestedition will always be read

                                  The BigTable API features the creation anddeletion of Tablets and column families as wellas modification of metadata of clusters tablesand column families Client applications mayinsert or delete values of BigTable query val-ues from columns or browse sub-datasets in atable Bigtable also supports some other char-acteristics such as transaction processing in asingle row Users may utilize such features toconduct more complex data processing

                                  Every procedure executed by BigTableincludes three main components Masterserver Tablet server and client libraryBigtable only allows one set of Master serverbe distributed to be responsible for distribut-ing tablets for Tablet server detecting addedor removed Tablet servers and conductingload balance In addition it can also mod-ify BigTable schema eg creating tables andcolumn families and collecting garbage savedin GFS as well as deleted or disabled filesand using them in specific BigTable instancesEvery tablet server manages a Tablet set andis responsible for the reading and writing of aloaded Tablet When Tablets are too big theywill be segmented by the server The applica-tion client library is used to communicate withBigTable instances

                                  BigTable is based on many fundamentalcomponents of Google including GFS [25]cluster management system SSTable file for-mat and Chubby [88] GFS is use to store dataand log files The cluster management systemis responsible for task scheduling resourcessharing processing of machine failures andmonitoring of machine statuses SSTable fileformat is used to store BigTable data internally

                                  Mobile Netw Appl (2014) 19171ndash209 187

                                  and it provides mapping between persistentsequenced and unchangeable keys and valuesas any byte strings BigTable utilizes Chubbyfor the following tasks in server 1) ensurethere is at most one active Master copy atany time 2) store the bootstrap location ofBigTable data 3) look up Tablet server 4) con-duct error recovery in case of Table server fail-ures 5) store BigTable schema information 6)store the access control table

                                  ndash Cassandra Cassandra is a distributed storagesystem to manage the huge amount of struc-tured data distributed among multiple commer-cial servers [89] The system was developedby Facebook and became an open source toolin 2008 It adopts the ideas and concepts ofboth Amazon Dynamo and Google BigTableespecially integrating the distributed systemtechnology of Dynamo with the BigTable datamodel Tables in Cassandra are in the form ofdistributed four-dimensional structured map-ping where the four dimensions including rowcolumn column family and super column Arow is distinguished by a string-key with arbi-trary length No matter the amount of columnsto be read or written the operation on rowsis an auto Columns may constitute clusterswhich is called column families and are sim-ilar to the data model of Bigtable Cassandraprovides two kinds of column families columnfamilies and super columns The super columnincludes arbitrary number of columns relatedto same names A column family includescolumns and super columns which may becontinuously inserted to the column familyduring runtime The partition and copy mecha-nisms of Cassandra are very similar to those ofDynamo so as to achieve consistency

                                  ndash Derivative tools of BigTable since theBigTable code cannot be obtained throughthe open source license some open sourceprojects compete to implement the BigTableconcept to develop similar systems such asHBase and Hypertable

                                  HBase is a BigTable cloned version pro-grammed with Java and is a part of Hadoop ofApachersquos MapReduce framework [90] HBasereplaces GFS with HDFS It writes updatedcontents into RAM and regularly writes theminto files on disks The row operations areatomic operations equipped with row-levellocking and transaction processing which is

                                  optional for large scale Partition and distribu-tion are transparently operated and have spacefor client hash or fixed key

                                  HyperTable was developed similar toBigTable to obtain a set of high-performanceexpandable distributed storage and process-ing systems for structured and unstructureddata [91] HyperTable relies on distributedfile systems eg HDFS and distributed lockmanager Data representation processing andpartition mechanism are similar to that inBigTable HyperTable has its own query lan-guage called HyperTable query language(HQL) and allows users to create modify andquery underlying tables

                                  Since the column-oriented storage databases mainlyemulate BigTable their designs are all similar exceptfor the concurrency mechanism and several other fea-tures For example Cassandra emphasizes weak consis-tency of concurrent control of multiple editions whileHBase and HyperTable focus on strong consistencythrough locks or log records

                                  ndash Document Database Compared with key-value stor-age document storage can support more complex dataforms Since documents do not follow strict modesthere is no need to conduct mode migration In additionkey-value pairs can still be saved We will examine threeimportant representatives of document storage systemsie MongoDB SimpleDB and CouchDB

                                  ndash MongoDB MongoDB is open-source anddocument-oriented database [92] MongoDBstores documents as Binary JSON (BSON)objects [93] which is similar to object Everydocument has an ID field as the primary keyQuery in MongoDB is expressed with syn-tax similar to JSON A database driver sendsthe query as a BSON object to MongoDBThe system allows query on all documentsincluding embedded objects and arrays Toenable rapid query indexes can be createdin the queryable fields of documents Thecopy operation in MongoDB can be executedwith log files in the main nodes that supportall the high-level operations conducted in thedatabase During copying the slavers queryall the writing operations since the last syn-chronization to the master and execute opera-tions in log files in local databases MongoDBsupports horizontal expansion with automaticsharing to distribute data among thousandsof nodes by automatically balancing load andfailover

                                  188 Mobile Netw Appl (2014) 19171ndash209

                                  ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

                                  ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

                                  Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

                                  ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

                                  functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

                                  Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

                                  ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

                                  The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

                                  In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

                                  Mobile Netw Appl (2014) 19171ndash209 189

                                  DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

                                  ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

                                  All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

                                  ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

                                  The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

                                  Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

                                  The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

                                  Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

                                  ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

                                  ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

                                  51 Traditional data analysis

                                  5 Big data analysis

                                  190 Mobile Netw Appl (2014) 19171ndash209

                                  ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

                                  ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

                                  ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

                                  ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

                                  ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

                                  52 Big data analytic methods

                                  In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

                                  ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

                                  ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

                                  ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

                                  ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

                                  ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

                                  Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

                                  53 Architecture for big data analysis

                                  Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

                                  Mobile Netw Appl (2014) 19171ndash209 191

                                  Table 1 Comparison of MPI MapReduce and Dryad

                                  MPI MapReduce Dryad

                                  Deployment Computing node and data Computing and data storage Computing and data storage

                                  storage arranged separately arranged at the same node arranged at the same node

                                  (Data should be moved (Computing should (Computing should

                                  computing node) be close to data) be close to data)

                                  Resource management ndash Workqueue(google) Not clear

                                  scheduling HOD(Yahoo)

                                  Low level programming MPI API MapReduce API Dryad API

                                  High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

                                  Data storage The local file system GFS(google) NTFS

                                  NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

                                  Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

                                  the tasks

                                  Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

                                  memory access Shared-memory FIFOs

                                  Fault-tolerant Checkpoint Task re-execute Task re-execute

                                  531 Real-time vs offline analysis

                                  According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

                                  ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

                                  ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

                                  532 Analysis at different levels

                                  Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

                                  ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

                                  ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

                                  ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

                                  192 Mobile Netw Appl (2014) 19171ndash209

                                  533 Analysis with different complexity

                                  The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

                                  54 Tools for big data mining and analysis

                                  Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

                                  ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

                                  ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

                                  ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

                                  The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

                                  ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

                                  ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

                                  6 Big data applications

                                  In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

                                  Mobile Netw Appl (2014) 19171ndash209 193

                                  However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

                                  61 Application evolutions

                                  Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

                                  ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

                                  ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

                                  most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

                                  ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

                                  As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

                                  62 Big data analysis fields

                                  webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

                                  194 Mobile Netw Appl (2014) 19171ndash209

                                  621 Structured data analysis

                                  Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

                                  However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

                                  622 Text data analysis

                                  The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

                                  623 Web data analysis

                                  Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

                                  mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

                                  Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

                                  Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

                                  Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

                                  Mobile Netw Appl (2014) 19171ndash209 195

                                  624 Multimedia data analysis

                                  Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

                                  Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

                                  Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

                                  Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

                                  segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

                                  Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

                                  The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

                                  625 Network data analysis

                                  Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

                                  196 Mobile Netw Appl (2014) 19171ndash209

                                  and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

                                  The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

                                  Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

                                  Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

                                  Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

                                  is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

                                  626 Mobile data analysis

                                  By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

                                  With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

                                  Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

                                  Mobile Netw Appl (2014) 19171ndash209 197

                                  In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

                                  Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

                                  63 Key applications of big data

                                  631 Application of big data in enterprises

                                  At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

                                  In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

                                  Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

                                  632 Application of IoT based big data

                                  IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

                                  Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

                                  Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

                                  633 Application of online social network-oriented big data

                                  Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

                                  198 Mobile Netw Appl (2014) 19171ndash209

                                  information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

                                  ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

                                  ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

                                  is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

                                  The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

                                  In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

                                  Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

                                  Mobile Netw Appl (2014) 19171ndash209 199

                                  or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

                                  Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

                                  ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

                                  ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

                                  ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

                                  634 Applications of healthcare and medical big data

                                  Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

                                  effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

                                  For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

                                  The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

                                  HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

                                  Fig 6 The correlation between Tweets about rice price and food price inflation

                                  200 Mobile Netw Appl (2014) 19171ndash209

                                  imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

                                  635 Collective intelligence

                                  With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

                                  Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

                                  In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

                                  636 Smart grid

                                  Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

                                  supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

                                  ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

                                  ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

                                  Mobile Netw Appl (2014) 19171ndash209 201

                                  according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

                                  ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

                                  7 Conclusion open issues and outlook

                                  In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

                                  In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

                                  71 Open issues

                                  The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

                                  711 Theoretical research

                                  Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

                                  ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

                                  ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

                                  ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

                                  712 Technology development

                                  The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

                                  ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

                                  202 Mobile Netw Appl (2014) 19171ndash209

                                  ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

                                  ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

                                  ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

                                  713 Practical implications

                                  Although there are already many successful big data appli-cations many practical problems should be solved

                                  ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

                                  ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

                                  ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

                                  individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

                                  ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

                                  714 Data security

                                  In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

                                  ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

                                  ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

                                  Mobile Netw Appl (2014) 19171ndash209 203

                                  quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

                                  ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

                                  ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

                                  The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

                                  72 Outlook

                                  The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

                                  not predict the future but may take precautions for possibleevents to occur in the future

                                  ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

                                  ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

                                  ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

                                  ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

                                  204 Mobile Netw Appl (2014) 19171ndash209

                                  utilizes relational diagrams to express interpersonalrelationship

                                  ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

                                  ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

                                  ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

                                  ndash Compared with accurate data we would like toaccept numerous and complicated data

                                  ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

                                  ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

                                  ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

                                  Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

                                  increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

                                  Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

                                  References

                                  1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

                                  2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

                                  3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

                                  4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

                                  5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

                                  httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

                                  7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

                                  8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

                                  9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

                                  10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

                                  11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

                                  12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

                                  13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

                                  14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

                                  15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

                                  16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

                                  17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

                                  18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

                                  19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

                                  Mobile Netw Appl (2014) 19171ndash209 205

                                  20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

                                  21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

                                  22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

                                  23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

                                  24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

                                  25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

                                  26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

                                  27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

                                  28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

                                  29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

                                  30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

                                  31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

                                  32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

                                  33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

                                  34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

                                  35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

                                  36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

                                  37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

                                  38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

                                  39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

                                  40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

                                  41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

                                  42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

                                  43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

                                  44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

                                  45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

                                  46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

                                  47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

                                  48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

                                  49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

                                  50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

                                  51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

                                  52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

                                  53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

                                  54 Cisco data center interconnect design and deployment guide(2010)

                                  55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

                                  56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

                                  57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

                                  58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

                                  59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

                                  206 Mobile Netw Appl (2014) 19171ndash209

                                  60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

                                  61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

                                  62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

                                  63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

                                  64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

                                  65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

                                  66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

                                  67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

                                  68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

                                  69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

                                  70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

                                  71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

                                  72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

                                  73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

                                  74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

                                  75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

                                  76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

                                  77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

                                  78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

                                  79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

                                  80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

                                  81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

                                  82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

                                  83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

                                  84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

                                  85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

                                  86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

                                  87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

                                  88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

                                  89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

                                  90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

                                  Media Inc93 Crockford D (2006) The applicationjson media type for

                                  javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

                                  SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

                                  tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

                                  (2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

                                  97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

                                  98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

                                  99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

                                  Mobile Netw Appl (2014) 19171ndash209 207

                                  100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

                                  101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

                                  102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

                                  103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

                                  104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

                                  105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

                                  106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

                                  107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

                                  108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

                                  109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

                                  110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

                                  111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

                                  112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

                                  113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

                                  114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

                                  115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

                                  D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

                                  117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

                                  118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

                                  the 7th ACM international conference on computing frontiersACM pp 277ndash286

                                  119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

                                  120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

                                  121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

                                  122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

                                  123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

                                  124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

                                  125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

                                  126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

                                  127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

                                  128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

                                  129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

                                  130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

                                  131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

                                  132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

                                  133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

                                  134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

                                  135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

                                  136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

                                  137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

                                  138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

                                  139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

                                  140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

                                  208 Mobile Netw Appl (2014) 19171ndash209

                                  141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

                                  142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

                                  143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

                                  144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

                                  145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

                                  146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

                                  147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

                                  148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

                                  149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

                                  150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

                                  151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

                                  152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

                                  153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

                                  154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

                                  155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

                                  156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

                                  Mobile Netw Appl (2014) 19171ndash209 209

                                  • Big Data A Survey
                                    • Abstract
                                    • Background
                                      • Dawn of big data era
                                      • Definition and features of big data
                                      • Big data value
                                      • The development of big data
                                      • Challenges of big data
                                        • Related technologies
                                          • Relationship between cloud computing and big data
                                          • Relationship between IoT and big data
                                          • Data center
                                          • Relationship between hadoop and big data
                                            • Big data generation and acquisition
                                              • Data generation
                                                • Enterprise data
                                                • IoT data
                                                • Bio-medical data
                                                • Data generation from other fields
                                                  • Big data acquisition
                                                    • Data collection
                                                    • Data transportation
                                                    • Data pre-processing
                                                        • Big data storage
                                                          • Storage system for massive data
                                                          • Distributed storage system
                                                          • Storage mechanism for big data
                                                            • Database technology
                                                              • Traditional data analysis
                                                              • Big data analytic methods
                                                              • Architecture for big data analysis
                                                                • Real-time vs offline analysis
                                                                • Analysis at different levels
                                                                • Analysis with different complexity
                                                                  • Tools for big data mining and analysis
                                                                    • Big data applications
                                                                      • Key applications of big data
                                                                        • Application evolutions
                                                                        • Structured data analysis
                                                                        • Text data analysis
                                                                        • Web data analysis
                                                                        • Multimedia data analysis
                                                                        • Network data analysis
                                                                        • Mobile data analysis
                                                                          • Key applications of big data
                                                                            • Application of big data in enterprises
                                                                            • Application of IoT based big data
                                                                            • Application of online social network-oriented big data
                                                                            • Applications of healthcare and medical big data
                                                                            • Collective intelligence
                                                                            • Smart grid
                                                                                • Conclusion open issues and outlook
                                                                                  • Open issues
                                                                                    • Theoretical research
                                                                                    • Technology development
                                                                                    • Practical implications
                                                                                    • Data security
                                                                                      • Outlook
                                                                                        • Acknowledgments
                                                                                        • References

                                    and it provides mapping between persistentsequenced and unchangeable keys and valuesas any byte strings BigTable utilizes Chubbyfor the following tasks in server 1) ensurethere is at most one active Master copy atany time 2) store the bootstrap location ofBigTable data 3) look up Tablet server 4) con-duct error recovery in case of Table server fail-ures 5) store BigTable schema information 6)store the access control table

                                    ndash Cassandra Cassandra is a distributed storagesystem to manage the huge amount of struc-tured data distributed among multiple commer-cial servers [89] The system was developedby Facebook and became an open source toolin 2008 It adopts the ideas and concepts ofboth Amazon Dynamo and Google BigTableespecially integrating the distributed systemtechnology of Dynamo with the BigTable datamodel Tables in Cassandra are in the form ofdistributed four-dimensional structured map-ping where the four dimensions including rowcolumn column family and super column Arow is distinguished by a string-key with arbi-trary length No matter the amount of columnsto be read or written the operation on rowsis an auto Columns may constitute clusterswhich is called column families and are sim-ilar to the data model of Bigtable Cassandraprovides two kinds of column families columnfamilies and super columns The super columnincludes arbitrary number of columns relatedto same names A column family includescolumns and super columns which may becontinuously inserted to the column familyduring runtime The partition and copy mecha-nisms of Cassandra are very similar to those ofDynamo so as to achieve consistency

                                    ndash Derivative tools of BigTable since theBigTable code cannot be obtained throughthe open source license some open sourceprojects compete to implement the BigTableconcept to develop similar systems such asHBase and Hypertable

                                    HBase is a BigTable cloned version pro-grammed with Java and is a part of Hadoop ofApachersquos MapReduce framework [90] HBasereplaces GFS with HDFS It writes updatedcontents into RAM and regularly writes theminto files on disks The row operations areatomic operations equipped with row-levellocking and transaction processing which is

                                    optional for large scale Partition and distribu-tion are transparently operated and have spacefor client hash or fixed key

                                    HyperTable was developed similar toBigTable to obtain a set of high-performanceexpandable distributed storage and process-ing systems for structured and unstructureddata [91] HyperTable relies on distributedfile systems eg HDFS and distributed lockmanager Data representation processing andpartition mechanism are similar to that inBigTable HyperTable has its own query lan-guage called HyperTable query language(HQL) and allows users to create modify andquery underlying tables

                                    Since the column-oriented storage databases mainlyemulate BigTable their designs are all similar exceptfor the concurrency mechanism and several other fea-tures For example Cassandra emphasizes weak consis-tency of concurrent control of multiple editions whileHBase and HyperTable focus on strong consistencythrough locks or log records

                                    ndash Document Database Compared with key-value stor-age document storage can support more complex dataforms Since documents do not follow strict modesthere is no need to conduct mode migration In additionkey-value pairs can still be saved We will examine threeimportant representatives of document storage systemsie MongoDB SimpleDB and CouchDB

                                    ndash MongoDB MongoDB is open-source anddocument-oriented database [92] MongoDBstores documents as Binary JSON (BSON)objects [93] which is similar to object Everydocument has an ID field as the primary keyQuery in MongoDB is expressed with syn-tax similar to JSON A database driver sendsthe query as a BSON object to MongoDBThe system allows query on all documentsincluding embedded objects and arrays Toenable rapid query indexes can be createdin the queryable fields of documents Thecopy operation in MongoDB can be executedwith log files in the main nodes that supportall the high-level operations conducted in thedatabase During copying the slavers queryall the writing operations since the last syn-chronization to the master and execute opera-tions in log files in local databases MongoDBsupports horizontal expansion with automaticsharing to distribute data among thousandsof nodes by automatically balancing load andfailover

                                    188 Mobile Netw Appl (2014) 19171ndash209

                                    ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

                                    ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

                                    Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

                                    ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

                                    functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

                                    Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

                                    ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

                                    The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

                                    In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

                                    Mobile Netw Appl (2014) 19171ndash209 189

                                    DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

                                    ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

                                    All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

                                    ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

                                    The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

                                    Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

                                    The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

                                    Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

                                    ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

                                    ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

                                    51 Traditional data analysis

                                    5 Big data analysis

                                    190 Mobile Netw Appl (2014) 19171ndash209

                                    ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

                                    ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

                                    ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

                                    ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

                                    ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

                                    52 Big data analytic methods

                                    In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

                                    ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

                                    ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

                                    ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

                                    ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

                                    ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

                                    Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

                                    53 Architecture for big data analysis

                                    Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

                                    Mobile Netw Appl (2014) 19171ndash209 191

                                    Table 1 Comparison of MPI MapReduce and Dryad

                                    MPI MapReduce Dryad

                                    Deployment Computing node and data Computing and data storage Computing and data storage

                                    storage arranged separately arranged at the same node arranged at the same node

                                    (Data should be moved (Computing should (Computing should

                                    computing node) be close to data) be close to data)

                                    Resource management ndash Workqueue(google) Not clear

                                    scheduling HOD(Yahoo)

                                    Low level programming MPI API MapReduce API Dryad API

                                    High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

                                    Data storage The local file system GFS(google) NTFS

                                    NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

                                    Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

                                    the tasks

                                    Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

                                    memory access Shared-memory FIFOs

                                    Fault-tolerant Checkpoint Task re-execute Task re-execute

                                    531 Real-time vs offline analysis

                                    According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

                                    ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

                                    ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

                                    532 Analysis at different levels

                                    Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

                                    ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

                                    ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

                                    ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

                                    192 Mobile Netw Appl (2014) 19171ndash209

                                    533 Analysis with different complexity

                                    The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

                                    54 Tools for big data mining and analysis

                                    Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

                                    ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

                                    ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

                                    ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

                                    The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

                                    ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

                                    ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

                                    6 Big data applications

                                    In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

                                    Mobile Netw Appl (2014) 19171ndash209 193

                                    However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

                                    61 Application evolutions

                                    Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

                                    ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

                                    ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

                                    most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

                                    ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

                                    As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

                                    62 Big data analysis fields

                                    webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

                                    194 Mobile Netw Appl (2014) 19171ndash209

                                    621 Structured data analysis

                                    Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

                                    However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

                                    622 Text data analysis

                                    The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

                                    623 Web data analysis

                                    Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

                                    mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

                                    Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

                                    Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

                                    Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

                                    Mobile Netw Appl (2014) 19171ndash209 195

                                    624 Multimedia data analysis

                                    Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

                                    Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

                                    Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

                                    Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

                                    segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

                                    Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

                                    The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

                                    625 Network data analysis

                                    Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

                                    196 Mobile Netw Appl (2014) 19171ndash209

                                    and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

                                    The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

                                    Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

                                    Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

                                    Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

                                    is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

                                    626 Mobile data analysis

                                    By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

                                    With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

                                    Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

                                    Mobile Netw Appl (2014) 19171ndash209 197

                                    In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

                                    Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

                                    63 Key applications of big data

                                    631 Application of big data in enterprises

                                    At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

                                    In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

                                    Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

                                    632 Application of IoT based big data

                                    IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

                                    Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

                                    Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

                                    633 Application of online social network-oriented big data

                                    Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

                                    198 Mobile Netw Appl (2014) 19171ndash209

                                    information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

                                    ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

                                    ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

                                    is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

                                    The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

                                    In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

                                    Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

                                    Mobile Netw Appl (2014) 19171ndash209 199

                                    or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

                                    Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

                                    ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

                                    ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

                                    ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

                                    634 Applications of healthcare and medical big data

                                    Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

                                    effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

                                    For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

                                    The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

                                    HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

                                    Fig 6 The correlation between Tweets about rice price and food price inflation

                                    200 Mobile Netw Appl (2014) 19171ndash209

                                    imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

                                    635 Collective intelligence

                                    With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

                                    Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

                                    In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

                                    636 Smart grid

                                    Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

                                    supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

                                    ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

                                    ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

                                    Mobile Netw Appl (2014) 19171ndash209 201

                                    according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

                                    ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

                                    7 Conclusion open issues and outlook

                                    In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

                                    In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

                                    71 Open issues

                                    The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

                                    711 Theoretical research

                                    Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

                                    ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

                                    ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

                                    ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

                                    712 Technology development

                                    The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

                                    ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

                                    202 Mobile Netw Appl (2014) 19171ndash209

                                    ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

                                    ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

                                    ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

                                    713 Practical implications

                                    Although there are already many successful big data appli-cations many practical problems should be solved

                                    ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

                                    ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

                                    ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

                                    individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

                                    ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

                                    714 Data security

                                    In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

                                    ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

                                    ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

                                    Mobile Netw Appl (2014) 19171ndash209 203

                                    quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

                                    ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

                                    ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

                                    The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

                                    72 Outlook

                                    The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

                                    not predict the future but may take precautions for possibleevents to occur in the future

                                    ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

                                    ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

                                    ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

                                    ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

                                    204 Mobile Netw Appl (2014) 19171ndash209

                                    utilizes relational diagrams to express interpersonalrelationship

                                    ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

                                    ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

                                    ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

                                    ndash Compared with accurate data we would like toaccept numerous and complicated data

                                    ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

                                    ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

                                    ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

                                    Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

                                    increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

                                    Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

                                    References

                                    1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

                                    2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

                                    3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

                                    4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

                                    5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

                                    httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

                                    7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

                                    8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

                                    9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

                                    10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

                                    11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

                                    12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

                                    13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

                                    14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

                                    15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

                                    16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

                                    17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

                                    18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

                                    19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

                                    Mobile Netw Appl (2014) 19171ndash209 205

                                    20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

                                    21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

                                    22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

                                    23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

                                    24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

                                    25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

                                    26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

                                    27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

                                    28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

                                    29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

                                    30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

                                    31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

                                    32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

                                    33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

                                    34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

                                    35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

                                    36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

                                    37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

                                    38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

                                    39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

                                    40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

                                    41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

                                    42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

                                    43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

                                    44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

                                    45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

                                    46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

                                    47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

                                    48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

                                    49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

                                    50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

                                    51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

                                    52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

                                    53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

                                    54 Cisco data center interconnect design and deployment guide(2010)

                                    55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

                                    56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

                                    57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

                                    58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

                                    59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

                                    206 Mobile Netw Appl (2014) 19171ndash209

                                    60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

                                    61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

                                    62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

                                    63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

                                    64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

                                    65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

                                    66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

                                    67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

                                    68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

                                    69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

                                    70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

                                    71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

                                    72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

                                    73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

                                    74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

                                    75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

                                    76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

                                    77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

                                    78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

                                    79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

                                    80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

                                    81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

                                    82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

                                    83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

                                    84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

                                    85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

                                    86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

                                    87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

                                    88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

                                    89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

                                    90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

                                    Media Inc93 Crockford D (2006) The applicationjson media type for

                                    javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

                                    SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

                                    tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

                                    (2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

                                    97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

                                    98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

                                    99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

                                    Mobile Netw Appl (2014) 19171ndash209 207

                                    100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

                                    101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

                                    102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

                                    103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

                                    104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

                                    105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

                                    106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

                                    107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

                                    108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

                                    109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

                                    110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

                                    111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

                                    112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

                                    113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

                                    114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

                                    115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

                                    D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

                                    117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

                                    118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

                                    the 7th ACM international conference on computing frontiersACM pp 277ndash286

                                    119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

                                    120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

                                    121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

                                    122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

                                    123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

                                    124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

                                    125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

                                    126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

                                    127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

                                    128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

                                    129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

                                    130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

                                    131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

                                    132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

                                    133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

                                    134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

                                    135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

                                    136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

                                    137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

                                    138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

                                    139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

                                    140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

                                    208 Mobile Netw Appl (2014) 19171ndash209

                                    141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

                                    142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

                                    143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

                                    144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

                                    145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

                                    146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

                                    147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

                                    148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

                                    149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

                                    150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

                                    151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

                                    152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

                                    153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

                                    154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

                                    155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

                                    156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

                                    Mobile Netw Appl (2014) 19171ndash209 209

                                    • Big Data A Survey
                                      • Abstract
                                      • Background
                                        • Dawn of big data era
                                        • Definition and features of big data
                                        • Big data value
                                        • The development of big data
                                        • Challenges of big data
                                          • Related technologies
                                            • Relationship between cloud computing and big data
                                            • Relationship between IoT and big data
                                            • Data center
                                            • Relationship between hadoop and big data
                                              • Big data generation and acquisition
                                                • Data generation
                                                  • Enterprise data
                                                  • IoT data
                                                  • Bio-medical data
                                                  • Data generation from other fields
                                                    • Big data acquisition
                                                      • Data collection
                                                      • Data transportation
                                                      • Data pre-processing
                                                          • Big data storage
                                                            • Storage system for massive data
                                                            • Distributed storage system
                                                            • Storage mechanism for big data
                                                              • Database technology
                                                                • Traditional data analysis
                                                                • Big data analytic methods
                                                                • Architecture for big data analysis
                                                                  • Real-time vs offline analysis
                                                                  • Analysis at different levels
                                                                  • Analysis with different complexity
                                                                    • Tools for big data mining and analysis
                                                                      • Big data applications
                                                                        • Key applications of big data
                                                                          • Application evolutions
                                                                          • Structured data analysis
                                                                          • Text data analysis
                                                                          • Web data analysis
                                                                          • Multimedia data analysis
                                                                          • Network data analysis
                                                                          • Mobile data analysis
                                                                            • Key applications of big data
                                                                              • Application of big data in enterprises
                                                                              • Application of IoT based big data
                                                                              • Application of online social network-oriented big data
                                                                              • Applications of healthcare and medical big data
                                                                              • Collective intelligence
                                                                              • Smart grid
                                                                                  • Conclusion open issues and outlook
                                                                                    • Open issues
                                                                                      • Theoretical research
                                                                                      • Technology development
                                                                                      • Practical implications
                                                                                      • Data security
                                                                                        • Outlook
                                                                                          • Acknowledgments
                                                                                          • References

                                      ndash SimpleDB SimpleDB is a distributed databaseand is a web service of Amazon [94] Data inSimpleDB is organized into various domainsin which data may be stored acquired andqueried Domains include different proper-ties and namevalue pair sets of projectsDate is copied to different machines at dif-ferent data centers in order to ensure datasafety and improve performance This systemdoes not support automatic partition and thuscould not be expanded with the change ofdata volume SimpleDB allows users to querywith SQL It is worth noting that SimpleDBcan assure eventual consistency but does notsupport to Muti-Version Concurrency Control(MVCC) Therefore conflicts therein couldnot be detected from the client side

                                      ndash CouchDB Apache CouchDB is a document-oriented database written in Erlang [95] Datain CouchDB is organized into documents con-sisting of fields named by keysnames andvalues which are stored and accessed as JSONobjects Every document is provided witha unique identifier CouchDB allows accessto database documents through the RESTfulHTTP API If a document needs to be modi-fied the client must download the entire doc-ument to modify it and then send it back tothe database After a document is rewrittenonce the identifier will be updated CouchDButilizes the optimal copying to obtain scalabil-ity without a sharing mechanism Since var-ious CouchDBs may be executed along withother transactions simultaneously any kinds ofReplication Topology can be built The con-sistency of CouchDB relies on the copyingmechanism CouchDB supports MVCC withhistorical Hash records

                                      Big data are generally stored in hundreds and even thou-sands of commercial servers Thus the traditional parallelmodels such as Message Passing Interface (MPI) and OpenMulti-Processing (OpenMP) may not be adequate to sup-port such large-scale parallel programs Recently someproposed parallel programming models effectively improvethe performance of NoSQL and reduce the performancegap to relational databases Therefore these models havebecome the cornerstone for the analysis of massive data

                                      ndash MapReduce MapReduce [22] is a simple but pow-erful programming model for large-scale computingusing a large number of clusters of commercial PCsto achieve automatic parallel processing and distribu-tion In MapReduce computing model only has two

                                      functions ie Map and Reduce both of which are pro-grammed by users The Map function processes inputkey-value pairs and generates intermediate key-valuepairs Then MapReduce will combine all the intermedi-ate values related to the same key and transmit them tothe Reduce function which further compress the valueset into a smaller set MapReduce has the advantagethat it avoids the complicated steps for developing par-allel applications eg data scheduling fault-toleranceand inter-node communications The user only needs toprogram the two functions to develop a parallel applica-tion The initial MapReduce framework did not supportmultiple datasets in a task which has been mitigated bysome recent enhancements [96 97]

                                      Over the past decades programmers are familiarwith the advanced declarative language of SQL oftenused in a relational database for task description anddataset analysis However the succinct MapReduceframework only provides two nontransparent functionswhich cannot cover all the common operations There-fore programmers have to spend time on programmingthe basic functions which are typically hard to be main-tained and reused In order to improve the programmingefficiency some advanced language systems have beenproposed eg Sawzall [98] of Google Pig Latin [99]of Yahoo Hive [100] of Facebook and Scope [87] ofMicrosoft

                                      ndash Dryad Dryad [101] is a general-purpose distributedexecution engine for processing parallel applicationsof coarse-grained data The operational structure ofDryad is a directed acyclic graph in which vertexesrepresent programs and edges represent data channelsDryad executes operations on the vertexes in clustersand transmits data via data channels including doc-uments TCP connections and shared-memory FIFODuring operation resources in a logic operation graphare automatically map to physical resources

                                      The operation structure of Dryad is coordinated by acentral program called job manager which can be exe-cuted in clusters or workstations through network Ajob manager consists of two parts 1) application codeswhich are used to build a job communication graphand 2) program library codes that are used to arrangeavailable resources All kinds of data are directly trans-mitted between vertexes Therefore the job manager isonly responsible for decision-making which does notobstruct any data transmission

                                      In Dryad application developers can flexibly chooseany directed acyclic graph to describe the communica-tion modes of the application and express data transmis-sion mechanisms In addition Dryad allows vertexesto use any amount of input and output data whileMapReduce supports only one input and output set

                                      Mobile Netw Appl (2014) 19171ndash209 189

                                      DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

                                      ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

                                      All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

                                      ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

                                      The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

                                      Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

                                      The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

                                      Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

                                      ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

                                      ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

                                      51 Traditional data analysis

                                      5 Big data analysis

                                      190 Mobile Netw Appl (2014) 19171ndash209

                                      ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

                                      ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

                                      ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

                                      ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

                                      ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

                                      52 Big data analytic methods

                                      In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

                                      ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

                                      ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

                                      ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

                                      ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

                                      ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

                                      Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

                                      53 Architecture for big data analysis

                                      Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

                                      Mobile Netw Appl (2014) 19171ndash209 191

                                      Table 1 Comparison of MPI MapReduce and Dryad

                                      MPI MapReduce Dryad

                                      Deployment Computing node and data Computing and data storage Computing and data storage

                                      storage arranged separately arranged at the same node arranged at the same node

                                      (Data should be moved (Computing should (Computing should

                                      computing node) be close to data) be close to data)

                                      Resource management ndash Workqueue(google) Not clear

                                      scheduling HOD(Yahoo)

                                      Low level programming MPI API MapReduce API Dryad API

                                      High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

                                      Data storage The local file system GFS(google) NTFS

                                      NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

                                      Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

                                      the tasks

                                      Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

                                      memory access Shared-memory FIFOs

                                      Fault-tolerant Checkpoint Task re-execute Task re-execute

                                      531 Real-time vs offline analysis

                                      According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

                                      ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

                                      ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

                                      532 Analysis at different levels

                                      Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

                                      ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

                                      ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

                                      ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

                                      192 Mobile Netw Appl (2014) 19171ndash209

                                      533 Analysis with different complexity

                                      The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

                                      54 Tools for big data mining and analysis

                                      Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

                                      ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

                                      ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

                                      ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

                                      The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

                                      ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

                                      ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

                                      6 Big data applications

                                      In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

                                      Mobile Netw Appl (2014) 19171ndash209 193

                                      However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

                                      61 Application evolutions

                                      Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

                                      ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

                                      ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

                                      most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

                                      ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

                                      As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

                                      62 Big data analysis fields

                                      webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

                                      194 Mobile Netw Appl (2014) 19171ndash209

                                      621 Structured data analysis

                                      Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

                                      However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

                                      622 Text data analysis

                                      The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

                                      623 Web data analysis

                                      Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

                                      mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

                                      Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

                                      Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

                                      Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

                                      Mobile Netw Appl (2014) 19171ndash209 195

                                      624 Multimedia data analysis

                                      Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

                                      Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

                                      Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

                                      Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

                                      segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

                                      Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

                                      The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

                                      625 Network data analysis

                                      Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

                                      196 Mobile Netw Appl (2014) 19171ndash209

                                      and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

                                      The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

                                      Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

                                      Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

                                      Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

                                      is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

                                      626 Mobile data analysis

                                      By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

                                      With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

                                      Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

                                      Mobile Netw Appl (2014) 19171ndash209 197

                                      In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

                                      Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

                                      63 Key applications of big data

                                      631 Application of big data in enterprises

                                      At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

                                      In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

                                      Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

                                      632 Application of IoT based big data

                                      IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

                                      Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

                                      Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

                                      633 Application of online social network-oriented big data

                                      Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

                                      198 Mobile Netw Appl (2014) 19171ndash209

                                      information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

                                      ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

                                      ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

                                      is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

                                      The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

                                      In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

                                      Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

                                      Mobile Netw Appl (2014) 19171ndash209 199

                                      or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

                                      Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

                                      ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

                                      ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

                                      ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

                                      634 Applications of healthcare and medical big data

                                      Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

                                      effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

                                      For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

                                      The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

                                      HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

                                      Fig 6 The correlation between Tweets about rice price and food price inflation

                                      200 Mobile Netw Appl (2014) 19171ndash209

                                      imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

                                      635 Collective intelligence

                                      With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

                                      Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

                                      In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

                                      636 Smart grid

                                      Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

                                      supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

                                      ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

                                      ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

                                      Mobile Netw Appl (2014) 19171ndash209 201

                                      according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

                                      ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

                                      7 Conclusion open issues and outlook

                                      In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

                                      In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

                                      71 Open issues

                                      The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

                                      711 Theoretical research

                                      Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

                                      ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

                                      ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

                                      ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

                                      712 Technology development

                                      The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

                                      ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

                                      202 Mobile Netw Appl (2014) 19171ndash209

                                      ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

                                      ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

                                      ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

                                      713 Practical implications

                                      Although there are already many successful big data appli-cations many practical problems should be solved

                                      ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

                                      ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

                                      ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

                                      individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

                                      ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

                                      714 Data security

                                      In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

                                      ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

                                      ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

                                      Mobile Netw Appl (2014) 19171ndash209 203

                                      quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

                                      ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

                                      ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

                                      The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

                                      72 Outlook

                                      The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

                                      not predict the future but may take precautions for possibleevents to occur in the future

                                      ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

                                      ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

                                      ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

                                      ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

                                      204 Mobile Netw Appl (2014) 19171ndash209

                                      utilizes relational diagrams to express interpersonalrelationship

                                      ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

                                      ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

                                      ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

                                      ndash Compared with accurate data we would like toaccept numerous and complicated data

                                      ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

                                      ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

                                      ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

                                      Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

                                      increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

                                      Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

                                      References

                                      1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

                                      2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

                                      3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

                                      4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

                                      5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

                                      httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

                                      7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

                                      8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

                                      9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

                                      10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

                                      11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

                                      12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

                                      13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

                                      14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

                                      15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

                                      16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

                                      17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

                                      18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

                                      19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

                                      Mobile Netw Appl (2014) 19171ndash209 205

                                      20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

                                      21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

                                      22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

                                      23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

                                      24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

                                      25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

                                      26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

                                      27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

                                      28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

                                      29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

                                      30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

                                      31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

                                      32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

                                      33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

                                      34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

                                      35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

                                      36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

                                      37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

                                      38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

                                      39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

                                      40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

                                      41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

                                      42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

                                      43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

                                      44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

                                      45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

                                      46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

                                      47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

                                      48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

                                      49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

                                      50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

                                      51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

                                      52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

                                      53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

                                      54 Cisco data center interconnect design and deployment guide(2010)

                                      55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

                                      56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

                                      57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

                                      58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

                                      59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

                                      206 Mobile Netw Appl (2014) 19171ndash209

                                      60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

                                      61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

                                      62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

                                      63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

                                      64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

                                      65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

                                      66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

                                      67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

                                      68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

                                      69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

                                      70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

                                      71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

                                      72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

                                      73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

                                      74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

                                      75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

                                      76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

                                      77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

                                      78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

                                      79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

                                      80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

                                      81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

                                      82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

                                      83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

                                      84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

                                      85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

                                      86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

                                      87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

                                      88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

                                      89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

                                      90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

                                      Media Inc93 Crockford D (2006) The applicationjson media type for

                                      javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

                                      SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

                                      tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

                                      (2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

                                      97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

                                      98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

                                      99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

                                      Mobile Netw Appl (2014) 19171ndash209 207

                                      100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

                                      101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

                                      102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

                                      103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

                                      104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

                                      105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

                                      106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

                                      107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

                                      108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

                                      109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

                                      110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

                                      111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

                                      112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

                                      113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

                                      114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

                                      115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

                                      D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

                                      117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

                                      118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

                                      the 7th ACM international conference on computing frontiersACM pp 277ndash286

                                      119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

                                      120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

                                      121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

                                      122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

                                      123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

                                      124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

                                      125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

                                      126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

                                      127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

                                      128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

                                      129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

                                      130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

                                      131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

                                      132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

                                      133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

                                      134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

                                      135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

                                      136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

                                      137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

                                      138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

                                      139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

                                      140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

                                      208 Mobile Netw Appl (2014) 19171ndash209

                                      141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

                                      142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

                                      143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

                                      144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

                                      145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

                                      146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

                                      147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

                                      148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

                                      149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

                                      150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

                                      151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

                                      152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

                                      153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

                                      154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

                                      155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

                                      156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

                                      Mobile Netw Appl (2014) 19171ndash209 209

                                      • Big Data A Survey
                                        • Abstract
                                        • Background
                                          • Dawn of big data era
                                          • Definition and features of big data
                                          • Big data value
                                          • The development of big data
                                          • Challenges of big data
                                            • Related technologies
                                              • Relationship between cloud computing and big data
                                              • Relationship between IoT and big data
                                              • Data center
                                              • Relationship between hadoop and big data
                                                • Big data generation and acquisition
                                                  • Data generation
                                                    • Enterprise data
                                                    • IoT data
                                                    • Bio-medical data
                                                    • Data generation from other fields
                                                      • Big data acquisition
                                                        • Data collection
                                                        • Data transportation
                                                        • Data pre-processing
                                                            • Big data storage
                                                              • Storage system for massive data
                                                              • Distributed storage system
                                                              • Storage mechanism for big data
                                                                • Database technology
                                                                  • Traditional data analysis
                                                                  • Big data analytic methods
                                                                  • Architecture for big data analysis
                                                                    • Real-time vs offline analysis
                                                                    • Analysis at different levels
                                                                    • Analysis with different complexity
                                                                      • Tools for big data mining and analysis
                                                                        • Big data applications
                                                                          • Key applications of big data
                                                                            • Application evolutions
                                                                            • Structured data analysis
                                                                            • Text data analysis
                                                                            • Web data analysis
                                                                            • Multimedia data analysis
                                                                            • Network data analysis
                                                                            • Mobile data analysis
                                                                              • Key applications of big data
                                                                                • Application of big data in enterprises
                                                                                • Application of IoT based big data
                                                                                • Application of online social network-oriented big data
                                                                                • Applications of healthcare and medical big data
                                                                                • Collective intelligence
                                                                                • Smart grid
                                                                                    • Conclusion open issues and outlook
                                                                                      • Open issues
                                                                                        • Theoretical research
                                                                                        • Technology development
                                                                                        • Practical implications
                                                                                        • Data security
                                                                                          • Outlook
                                                                                            • Acknowledgments
                                                                                            • References

                                        DryadLINQ [102] is the advanced language of Dryadand is used to integrate the aforementioned SQL-likelanguage execution environment

                                        ndash All-Pairs All-Pairs [103] is a system specially designedfor biometrics bio-informatics and data mining appli-cations It focuses on comparing element pairs in twodatasets by a given function All-Pairs can be expressedas three-tuples (Set A Set B and Function F) in whichFunction F is utilized to compare all elements in Set Aand Set B The comparison result is an output matrix Mwhich is also called the Cartesian product or cross joinof Set A and Set B

                                        All-Pairs is implemented in four phases systemmodeling distribution of input data batch job man-agement and result collection In Phase I an approx-imation model of system performance will be built toevaluate how much CPU resource is needed and how toconduct job partition In Phase II a spanning tree is builtfor data transmissions which makes the workload ofevery partition retrieve input data effectively In PhaseIII after the data flow is delivered to proper nodes theAll-Pairs engine will build a batch-processing submis-sion for jobs in partitions while sequencing them in thebatch processing system and formulating a node run-ning command to acquire data In the last phase afterthe job completion of the batch processing system theextraction engine will collect results and combine themin a proper structure which is generally a single file listin which all results are put in order

                                        ndash Pregel The Pregel [104] system of Google facilitatesthe processing of large-sized graphs eg analysis ofnetwork graphs and social networking services A com-putational task is expressed by a directed graph con-stituted by vertexes and directed edges Every vertexis related to a modifiable and user-defined value andevery directed edge related to a source vertex is con-stituted by the user-defined value and the identifier ofa target vertex When the graph is built the programconducts iterative calculations which is called super-steps among which global synchronization points areset until algorithm completion and output completionIn every superstep vertex computations are paralleland every vertex executes the same user-defined func-tion to express a given algorithm logic Every vertexmay modify its and its output edges status receive amessage sent from the previous superstep send themessage to other vertexes and even modify the topolog-ical structure of the entire graph Edges are not providedwith corresponding computations Functions of everyvertex may be removed by suspension When all ver-texes are in an inactive status without any message totransmit the entire program execution is completed

                                        The Pregel program output is a set consisting of the val-ues output from all the vertexes Generally speakingthe input and output of Pregel program are isomorphicdirected graphs

                                        Inspired by the above programming models otherresearches have also focused on programming modes formore complex computational tasks eg iterative computa-tions [105 106] fault-tolerant memory computations [107]incremental computations [108] and flow control decision-making related to data [109]

                                        The analysis of big data mainly involves analytical meth-ods for traditional data and big data analytical architecturefor big data and software used for mining and analysis ofbig data Data analysis is the final and the most importantphase in the value chain of big data with the purpose ofextracting useful values providing suggestions or decisionsDifferent levels of potential values can be generated throughthe analysis of datasets in different fields [10] Howeverdata analysis is a broad area which frequently changesand is extremely complex In this section we introduce themethods architectures and tools for big data analysis

                                        Traditional data analysis means to use proper statisticalmethods to analyze massive data to concentrate extractand refine useful data hidden in a batch of chaotic datasetsand to identify the inherent law of the subject matter so asto maximize the value of data Data analysis plays a hugeguidance role in making development plans for a countryunderstanding customer demands for commerce and pre-dicting market trend for enterprises Big data analysis can bedeemed as the analysis technique for a special kind of dataTherefore many traditional data analysis methods may stillbe utilized for big data analysis Several representative tradi-tional data analysis methods are examined in the followingmany of which are from statistics and computer science

                                        ndash Cluster Analysis is a statistical method for groupingobjects and specifically classifying objects accordingto some features Cluster analysis is used to differenti-ate objects with particular features and divide them intosome categories (clusters) according to these featuressuch that objects in the same category will have highhomogeneity while different categories will have highheterogeneity Cluster analysis is an unsupervised studymethod without training data

                                        ndash Factor Analysis is basically targeted at describing therelation among many elements with only a few factorsie grouping several closely related variables into a fac-tor and the few factors are then used to reveal the mostinformation of the original data

                                        51 Traditional data analysis

                                        5 Big data analysis

                                        190 Mobile Netw Appl (2014) 19171ndash209

                                        ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

                                        ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

                                        ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

                                        ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

                                        ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

                                        52 Big data analytic methods

                                        In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

                                        ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

                                        ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

                                        ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

                                        ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

                                        ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

                                        Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

                                        53 Architecture for big data analysis

                                        Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

                                        Mobile Netw Appl (2014) 19171ndash209 191

                                        Table 1 Comparison of MPI MapReduce and Dryad

                                        MPI MapReduce Dryad

                                        Deployment Computing node and data Computing and data storage Computing and data storage

                                        storage arranged separately arranged at the same node arranged at the same node

                                        (Data should be moved (Computing should (Computing should

                                        computing node) be close to data) be close to data)

                                        Resource management ndash Workqueue(google) Not clear

                                        scheduling HOD(Yahoo)

                                        Low level programming MPI API MapReduce API Dryad API

                                        High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

                                        Data storage The local file system GFS(google) NTFS

                                        NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

                                        Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

                                        the tasks

                                        Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

                                        memory access Shared-memory FIFOs

                                        Fault-tolerant Checkpoint Task re-execute Task re-execute

                                        531 Real-time vs offline analysis

                                        According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

                                        ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

                                        ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

                                        532 Analysis at different levels

                                        Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

                                        ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

                                        ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

                                        ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

                                        192 Mobile Netw Appl (2014) 19171ndash209

                                        533 Analysis with different complexity

                                        The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

                                        54 Tools for big data mining and analysis

                                        Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

                                        ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

                                        ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

                                        ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

                                        The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

                                        ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

                                        ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

                                        6 Big data applications

                                        In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

                                        Mobile Netw Appl (2014) 19171ndash209 193

                                        However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

                                        61 Application evolutions

                                        Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

                                        ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

                                        ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

                                        most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

                                        ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

                                        As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

                                        62 Big data analysis fields

                                        webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

                                        194 Mobile Netw Appl (2014) 19171ndash209

                                        621 Structured data analysis

                                        Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

                                        However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

                                        622 Text data analysis

                                        The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

                                        623 Web data analysis

                                        Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

                                        mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

                                        Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

                                        Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

                                        Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

                                        Mobile Netw Appl (2014) 19171ndash209 195

                                        624 Multimedia data analysis

                                        Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

                                        Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

                                        Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

                                        Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

                                        segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

                                        Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

                                        The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

                                        625 Network data analysis

                                        Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

                                        196 Mobile Netw Appl (2014) 19171ndash209

                                        and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

                                        The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

                                        Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

                                        Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

                                        Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

                                        is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

                                        626 Mobile data analysis

                                        By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

                                        With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

                                        Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

                                        Mobile Netw Appl (2014) 19171ndash209 197

                                        In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

                                        Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

                                        63 Key applications of big data

                                        631 Application of big data in enterprises

                                        At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

                                        In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

                                        Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

                                        632 Application of IoT based big data

                                        IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

                                        Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

                                        Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

                                        633 Application of online social network-oriented big data

                                        Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

                                        198 Mobile Netw Appl (2014) 19171ndash209

                                        information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

                                        ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

                                        ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

                                        is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

                                        The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

                                        In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

                                        Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

                                        Mobile Netw Appl (2014) 19171ndash209 199

                                        or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

                                        Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

                                        ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

                                        ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

                                        ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

                                        634 Applications of healthcare and medical big data

                                        Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

                                        effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

                                        For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

                                        The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

                                        HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

                                        Fig 6 The correlation between Tweets about rice price and food price inflation

                                        200 Mobile Netw Appl (2014) 19171ndash209

                                        imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

                                        635 Collective intelligence

                                        With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

                                        Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

                                        In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

                                        636 Smart grid

                                        Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

                                        supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

                                        ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

                                        ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

                                        Mobile Netw Appl (2014) 19171ndash209 201

                                        according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

                                        ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

                                        7 Conclusion open issues and outlook

                                        In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

                                        In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

                                        71 Open issues

                                        The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

                                        711 Theoretical research

                                        Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

                                        ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

                                        ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

                                        ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

                                        712 Technology development

                                        The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

                                        ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

                                        202 Mobile Netw Appl (2014) 19171ndash209

                                        ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

                                        ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

                                        ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

                                        713 Practical implications

                                        Although there are already many successful big data appli-cations many practical problems should be solved

                                        ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

                                        ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

                                        ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

                                        individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

                                        ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

                                        714 Data security

                                        In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

                                        ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

                                        ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

                                        Mobile Netw Appl (2014) 19171ndash209 203

                                        quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

                                        ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

                                        ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

                                        The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

                                        72 Outlook

                                        The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

                                        not predict the future but may take precautions for possibleevents to occur in the future

                                        ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

                                        ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

                                        ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

                                        ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

                                        204 Mobile Netw Appl (2014) 19171ndash209

                                        utilizes relational diagrams to express interpersonalrelationship

                                        ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

                                        ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

                                        ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

                                        ndash Compared with accurate data we would like toaccept numerous and complicated data

                                        ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

                                        ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

                                        ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

                                        Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

                                        increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

                                        Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

                                        References

                                        1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

                                        2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

                                        3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

                                        4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

                                        5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

                                        httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

                                        7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

                                        8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

                                        9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

                                        10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

                                        11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

                                        12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

                                        13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

                                        14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

                                        15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

                                        16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

                                        17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

                                        18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

                                        19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

                                        Mobile Netw Appl (2014) 19171ndash209 205

                                        20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

                                        21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

                                        22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

                                        23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

                                        24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

                                        25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

                                        26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

                                        27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

                                        28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

                                        29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

                                        30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

                                        31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

                                        32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

                                        33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

                                        34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

                                        35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

                                        36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

                                        37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

                                        38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

                                        39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

                                        40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

                                        41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

                                        42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

                                        43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

                                        44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

                                        45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

                                        46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

                                        47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

                                        48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

                                        49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

                                        50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

                                        51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

                                        52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

                                        53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

                                        54 Cisco data center interconnect design and deployment guide(2010)

                                        55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

                                        56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

                                        57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

                                        58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

                                        59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

                                        206 Mobile Netw Appl (2014) 19171ndash209

                                        60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

                                        61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

                                        62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

                                        63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

                                        64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

                                        65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

                                        66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

                                        67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

                                        68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

                                        69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

                                        70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

                                        71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

                                        72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

                                        73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

                                        74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

                                        75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

                                        76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

                                        77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

                                        78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

                                        79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

                                        80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

                                        81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

                                        82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

                                        83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

                                        84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

                                        85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

                                        86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

                                        87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

                                        88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

                                        89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

                                        90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

                                        Media Inc93 Crockford D (2006) The applicationjson media type for

                                        javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

                                        SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

                                        tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

                                        (2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

                                        97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

                                        98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

                                        99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

                                        Mobile Netw Appl (2014) 19171ndash209 207

                                        100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

                                        101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

                                        102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

                                        103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

                                        104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

                                        105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

                                        106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

                                        107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

                                        108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

                                        109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

                                        110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

                                        111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

                                        112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

                                        113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

                                        114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

                                        115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

                                        D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

                                        117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

                                        118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

                                        the 7th ACM international conference on computing frontiersACM pp 277ndash286

                                        119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

                                        120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

                                        121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

                                        122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

                                        123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

                                        124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

                                        125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

                                        126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

                                        127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

                                        128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

                                        129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

                                        130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

                                        131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

                                        132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

                                        133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

                                        134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

                                        135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

                                        136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

                                        137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

                                        138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

                                        139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

                                        140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

                                        208 Mobile Netw Appl (2014) 19171ndash209

                                        141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

                                        142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

                                        143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

                                        144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

                                        145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

                                        146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

                                        147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

                                        148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

                                        149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

                                        150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

                                        151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

                                        152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

                                        153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

                                        154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

                                        155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

                                        156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

                                        Mobile Netw Appl (2014) 19171ndash209 209

                                        • Big Data A Survey
                                          • Abstract
                                          • Background
                                            • Dawn of big data era
                                            • Definition and features of big data
                                            • Big data value
                                            • The development of big data
                                            • Challenges of big data
                                              • Related technologies
                                                • Relationship between cloud computing and big data
                                                • Relationship between IoT and big data
                                                • Data center
                                                • Relationship between hadoop and big data
                                                  • Big data generation and acquisition
                                                    • Data generation
                                                      • Enterprise data
                                                      • IoT data
                                                      • Bio-medical data
                                                      • Data generation from other fields
                                                        • Big data acquisition
                                                          • Data collection
                                                          • Data transportation
                                                          • Data pre-processing
                                                              • Big data storage
                                                                • Storage system for massive data
                                                                • Distributed storage system
                                                                • Storage mechanism for big data
                                                                  • Database technology
                                                                    • Traditional data analysis
                                                                    • Big data analytic methods
                                                                    • Architecture for big data analysis
                                                                      • Real-time vs offline analysis
                                                                      • Analysis at different levels
                                                                      • Analysis with different complexity
                                                                        • Tools for big data mining and analysis
                                                                          • Big data applications
                                                                            • Key applications of big data
                                                                              • Application evolutions
                                                                              • Structured data analysis
                                                                              • Text data analysis
                                                                              • Web data analysis
                                                                              • Multimedia data analysis
                                                                              • Network data analysis
                                                                              • Mobile data analysis
                                                                                • Key applications of big data
                                                                                  • Application of big data in enterprises
                                                                                  • Application of IoT based big data
                                                                                  • Application of online social network-oriented big data
                                                                                  • Applications of healthcare and medical big data
                                                                                  • Collective intelligence
                                                                                  • Smart grid
                                                                                      • Conclusion open issues and outlook
                                                                                        • Open issues
                                                                                          • Theoretical research
                                                                                          • Technology development
                                                                                          • Practical implications
                                                                                          • Data security
                                                                                            • Outlook
                                                                                              • Acknowledgments
                                                                                              • References

                                          ndash Correlation Analysis is an analytical method for deter-mining the law of relations such as correlation cor-relative dependence and mutual restriction amongobserved phenomena and accordingly conducting fore-cast and control Such relations may be classified intotwo types (i) function reflecting the strict dependencerelationship among phenomena which is also calleda definitive dependence relationship (ii) correlationsome undetermined or inexact dependence relationsand the numerical value of a variable may correspondto several numerical values of the other variable andsuch numerical values present a regular fluctuationsurrounding their mean values

                                          ndash Regression Analysis is a mathematical tool for reveal-ing correlations between one variable and several othervariables Based on a group of experiments or observeddata regression analysis identifies dependence relation-ships among variables hidden by randomness Regres-sion analysis may make complex and undeterminedcorrelations among variables to be simple and regular

                                          ndash AB Testing also called bucket testing It is a technol-ogy for determining how to improve target variables bycomparing the tested group Big data will require a largenumber of tests to be executed and analyzed

                                          ndash Statistical Analysis Statistical analysis is based onthe statistical theory a branch of applied mathemat-ics In statistical theory randomness and uncertaintyare modeled with Probability Theory Statistical anal-ysis can provide a description and an inference forbig data Descriptive statistical analysis can summarizeand describe datasets while inferential statistical anal-ysis can draw conclusions from data subject to randomvariations Statistical analysis is widely applied in theeconomic and medical care fields [110]

                                          ndash Data Mining Algorithms Data mining is a processfor extracting hidden unknown but potentially usefulinformation and knowledge from massive incompletenoisy fuzzy and random data In 2006 The IEEE Inter-national Conference on Data Mining Series (ICDM)identified ten most influential data mining algorithmsthrough a strict selection procedure [111] includingC45 k-means SVM Apriori EM Naive Bayes andCart etc These ten algorithms cover classificationclustering regression statistical learning associationanalysis and linking mining all of which are the mostimportant problems in data mining research

                                          52 Big data analytic methods

                                          In the dawn of the big data era people are concerned how torapidly extract key information from massive data so as tobring values for enterprises and individuals At present themain processing methods of big data are shown as follows

                                          ndash Bloom Filter Bloom Filter consists of a series of Hashfunctions The principle of Bloom Filter is to store Hashvalues of data other than data itself by utilizing a bitarray which is in essence a bitmap index that uses Hashfunctions to conduct lossy compression storage of dataIt has such advantages as high space efficiency andhigh query speed but also has some disadvantages inmisrecognition and deletion

                                          ndash Hashing it is a method that essentially transforms datainto shorter fixed-length numerical values or index val-ues Hashing has such advantages as rapid readingwriting and high query speed but it is hard to find asound Hash function

                                          ndash Index index is always an effective method to reducethe expense of disk reading and writing and improveinsertion deletion modification and query speeds inboth traditional relational databases that manage struc-tured data and other technologies that manage semi-structured and unstructured data However index hasa disadvantage that it has the additional cost for stor-ing index files which should be maintained dynamicallywhen data is updated

                                          ndash Triel also called trie tree a variant of Hash Tree It ismainly applied to rapid retrieval and word frequencystatistics The main idea of Triel is to utilize commonprefixes of character strings to reduce comparison oncharacter strings to the greatest extent so as to improvequery efficiency

                                          ndash Parallel Computing compared to traditional serial com-puting parallel computing refers to simultaneouslyutilizing several computing resources to complete acomputation task Its basic idea is to decompose aproblem and assign them to several separate processesto be independently completed so as to achieve co-processing Presently some classic parallel comput-ing models include MPI (Message Passing Interface)MapReduce and Dryad (See a comparison in Table 1)

                                          Although the parallel computing systems or tools suchas MapReduce or Dryad are useful for big data analysisthey are low levels tools that are hard to learn and useTherefore some high-level parallel programming tools orlanguages are being developed based on these systems Suchhigh-level languages include Sawzall Pig and Hive usedfor MapReduce as well as Scope and DryadLINQ used forDryad

                                          53 Architecture for big data analysis

                                          Because of the 4Vs of big data different analyticalarchitectures shall be considered for different applicationrequirements

                                          Mobile Netw Appl (2014) 19171ndash209 191

                                          Table 1 Comparison of MPI MapReduce and Dryad

                                          MPI MapReduce Dryad

                                          Deployment Computing node and data Computing and data storage Computing and data storage

                                          storage arranged separately arranged at the same node arranged at the same node

                                          (Data should be moved (Computing should (Computing should

                                          computing node) be close to data) be close to data)

                                          Resource management ndash Workqueue(google) Not clear

                                          scheduling HOD(Yahoo)

                                          Low level programming MPI API MapReduce API Dryad API

                                          High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

                                          Data storage The local file system GFS(google) NTFS

                                          NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

                                          Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

                                          the tasks

                                          Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

                                          memory access Shared-memory FIFOs

                                          Fault-tolerant Checkpoint Task re-execute Task re-execute

                                          531 Real-time vs offline analysis

                                          According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

                                          ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

                                          ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

                                          532 Analysis at different levels

                                          Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

                                          ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

                                          ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

                                          ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

                                          192 Mobile Netw Appl (2014) 19171ndash209

                                          533 Analysis with different complexity

                                          The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

                                          54 Tools for big data mining and analysis

                                          Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

                                          ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

                                          ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

                                          ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

                                          The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

                                          ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

                                          ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

                                          6 Big data applications

                                          In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

                                          Mobile Netw Appl (2014) 19171ndash209 193

                                          However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

                                          61 Application evolutions

                                          Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

                                          ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

                                          ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

                                          most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

                                          ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

                                          As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

                                          62 Big data analysis fields

                                          webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

                                          194 Mobile Netw Appl (2014) 19171ndash209

                                          621 Structured data analysis

                                          Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

                                          However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

                                          622 Text data analysis

                                          The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

                                          623 Web data analysis

                                          Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

                                          mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

                                          Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

                                          Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

                                          Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

                                          Mobile Netw Appl (2014) 19171ndash209 195

                                          624 Multimedia data analysis

                                          Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

                                          Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

                                          Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

                                          Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

                                          segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

                                          Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

                                          The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

                                          625 Network data analysis

                                          Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

                                          196 Mobile Netw Appl (2014) 19171ndash209

                                          and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

                                          The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

                                          Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

                                          Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

                                          Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

                                          is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

                                          626 Mobile data analysis

                                          By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

                                          With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

                                          Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

                                          Mobile Netw Appl (2014) 19171ndash209 197

                                          In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

                                          Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

                                          63 Key applications of big data

                                          631 Application of big data in enterprises

                                          At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

                                          In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

                                          Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

                                          632 Application of IoT based big data

                                          IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

                                          Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

                                          Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

                                          633 Application of online social network-oriented big data

                                          Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

                                          198 Mobile Netw Appl (2014) 19171ndash209

                                          information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

                                          ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

                                          ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

                                          is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

                                          The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

                                          In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

                                          Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

                                          Mobile Netw Appl (2014) 19171ndash209 199

                                          or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

                                          Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

                                          ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

                                          ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

                                          ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

                                          634 Applications of healthcare and medical big data

                                          Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

                                          effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

                                          For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

                                          The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

                                          HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

                                          Fig 6 The correlation between Tweets about rice price and food price inflation

                                          200 Mobile Netw Appl (2014) 19171ndash209

                                          imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

                                          635 Collective intelligence

                                          With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

                                          Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

                                          In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

                                          636 Smart grid

                                          Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

                                          supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

                                          ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

                                          ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

                                          Mobile Netw Appl (2014) 19171ndash209 201

                                          according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

                                          ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

                                          7 Conclusion open issues and outlook

                                          In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

                                          In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

                                          71 Open issues

                                          The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

                                          711 Theoretical research

                                          Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

                                          ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

                                          ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

                                          ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

                                          712 Technology development

                                          The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

                                          ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

                                          202 Mobile Netw Appl (2014) 19171ndash209

                                          ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

                                          ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

                                          ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

                                          713 Practical implications

                                          Although there are already many successful big data appli-cations many practical problems should be solved

                                          ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

                                          ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

                                          ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

                                          individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

                                          ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

                                          714 Data security

                                          In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

                                          ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

                                          ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

                                          Mobile Netw Appl (2014) 19171ndash209 203

                                          quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

                                          ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

                                          ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

                                          The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

                                          72 Outlook

                                          The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

                                          not predict the future but may take precautions for possibleevents to occur in the future

                                          ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

                                          ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

                                          ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

                                          ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

                                          204 Mobile Netw Appl (2014) 19171ndash209

                                          utilizes relational diagrams to express interpersonalrelationship

                                          ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

                                          ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

                                          ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

                                          ndash Compared with accurate data we would like toaccept numerous and complicated data

                                          ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

                                          ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

                                          ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

                                          Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

                                          increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

                                          Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

                                          References

                                          1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

                                          2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

                                          3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

                                          4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

                                          5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

                                          httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

                                          7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

                                          8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

                                          9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

                                          10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

                                          11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

                                          12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

                                          13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

                                          14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

                                          15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

                                          16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

                                          17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

                                          18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

                                          19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

                                          Mobile Netw Appl (2014) 19171ndash209 205

                                          20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

                                          21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

                                          22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

                                          23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

                                          24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

                                          25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

                                          26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

                                          27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

                                          28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

                                          29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

                                          30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

                                          31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

                                          32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

                                          33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

                                          34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

                                          35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

                                          36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

                                          37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

                                          38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

                                          39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

                                          40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

                                          41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

                                          42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

                                          43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

                                          44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

                                          45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

                                          46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

                                          47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

                                          48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

                                          49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

                                          50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

                                          51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

                                          52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

                                          53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

                                          54 Cisco data center interconnect design and deployment guide(2010)

                                          55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

                                          56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

                                          57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

                                          58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

                                          59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

                                          206 Mobile Netw Appl (2014) 19171ndash209

                                          60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

                                          61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

                                          62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

                                          63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

                                          64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

                                          65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

                                          66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

                                          67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

                                          68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

                                          69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

                                          70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

                                          71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

                                          72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

                                          73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

                                          74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

                                          75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

                                          76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

                                          77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

                                          78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

                                          79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

                                          80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

                                          81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

                                          82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

                                          83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

                                          84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

                                          85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

                                          86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

                                          87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

                                          88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

                                          89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

                                          90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

                                          Media Inc93 Crockford D (2006) The applicationjson media type for

                                          javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

                                          SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

                                          tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

                                          (2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

                                          97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

                                          98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

                                          99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

                                          Mobile Netw Appl (2014) 19171ndash209 207

                                          100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

                                          101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

                                          102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

                                          103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

                                          104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

                                          105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

                                          106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

                                          107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

                                          108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

                                          109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

                                          110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

                                          111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

                                          112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

                                          113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

                                          114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

                                          115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

                                          D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

                                          117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

                                          118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

                                          the 7th ACM international conference on computing frontiersACM pp 277ndash286

                                          119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

                                          120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

                                          121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

                                          122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

                                          123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

                                          124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

                                          125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

                                          126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

                                          127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

                                          128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

                                          129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

                                          130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

                                          131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

                                          132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

                                          133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

                                          134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

                                          135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

                                          136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

                                          137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

                                          138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

                                          139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

                                          140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

                                          208 Mobile Netw Appl (2014) 19171ndash209

                                          141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

                                          142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

                                          143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

                                          144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

                                          145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

                                          146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

                                          147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

                                          148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

                                          149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

                                          150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

                                          151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

                                          152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

                                          153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

                                          154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

                                          155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

                                          156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

                                          Mobile Netw Appl (2014) 19171ndash209 209

                                          • Big Data A Survey
                                            • Abstract
                                            • Background
                                              • Dawn of big data era
                                              • Definition and features of big data
                                              • Big data value
                                              • The development of big data
                                              • Challenges of big data
                                                • Related technologies
                                                  • Relationship between cloud computing and big data
                                                  • Relationship between IoT and big data
                                                  • Data center
                                                  • Relationship between hadoop and big data
                                                    • Big data generation and acquisition
                                                      • Data generation
                                                        • Enterprise data
                                                        • IoT data
                                                        • Bio-medical data
                                                        • Data generation from other fields
                                                          • Big data acquisition
                                                            • Data collection
                                                            • Data transportation
                                                            • Data pre-processing
                                                                • Big data storage
                                                                  • Storage system for massive data
                                                                  • Distributed storage system
                                                                  • Storage mechanism for big data
                                                                    • Database technology
                                                                      • Traditional data analysis
                                                                      • Big data analytic methods
                                                                      • Architecture for big data analysis
                                                                        • Real-time vs offline analysis
                                                                        • Analysis at different levels
                                                                        • Analysis with different complexity
                                                                          • Tools for big data mining and analysis
                                                                            • Big data applications
                                                                              • Key applications of big data
                                                                                • Application evolutions
                                                                                • Structured data analysis
                                                                                • Text data analysis
                                                                                • Web data analysis
                                                                                • Multimedia data analysis
                                                                                • Network data analysis
                                                                                • Mobile data analysis
                                                                                  • Key applications of big data
                                                                                    • Application of big data in enterprises
                                                                                    • Application of IoT based big data
                                                                                    • Application of online social network-oriented big data
                                                                                    • Applications of healthcare and medical big data
                                                                                    • Collective intelligence
                                                                                    • Smart grid
                                                                                        • Conclusion open issues and outlook
                                                                                          • Open issues
                                                                                            • Theoretical research
                                                                                            • Technology development
                                                                                            • Practical implications
                                                                                            • Data security
                                                                                              • Outlook
                                                                                                • Acknowledgments
                                                                                                • References

                                            Table 1 Comparison of MPI MapReduce and Dryad

                                            MPI MapReduce Dryad

                                            Deployment Computing node and data Computing and data storage Computing and data storage

                                            storage arranged separately arranged at the same node arranged at the same node

                                            (Data should be moved (Computing should (Computing should

                                            computing node) be close to data) be close to data)

                                            Resource management ndash Workqueue(google) Not clear

                                            scheduling HOD(Yahoo)

                                            Low level programming MPI API MapReduce API Dryad API

                                            High level programming ndash Pig Hive Jaql middot middot middot Scope DryadLINQ

                                            Data storage The local file system GFS(google) NTFS

                                            NFS middot middot middot HDFS(Hadoop) KFS Cosmos DFS

                                            Amazon S3 middot middot middotTask partitioning User manually partition Automation Automation

                                            the tasks

                                            Communication Messaging Remote Files(Local FS DFS) Files TCP Pipes

                                            memory access Shared-memory FIFOs

                                            Fault-tolerant Checkpoint Task re-execute Task re-execute

                                            531 Real-time vs offline analysis

                                            According to timeliness requirements big data analysis canbe classified into real-time analysis and off-line analysis

                                            ndash Real-time analysis is mainly used in E-commerce andfinance Since data constantly changes rapid data anal-ysis is needed and analytical results shall be returnedwith a very short delay The main existing architec-tures of real-time analysis include (i) parallel process-ing clusters using traditional relational databases and(ii) memory-based computing platforms For exampleGreenplum from EMC and HANA from SAP are bothreal-time analysis architectures

                                            ndash Offline analysis is usually used for applications with-out high requirements on response time eg machinelearning statistical analysis and recommendation algo-rithms Offline analysis generally conducts analysis byimporting logs into a special platform through dataacquisition tools Under the big data setting manyInternet enterprises utilize the offline analysis archi-tecture based on Hadoop in order to reduce the costof data format conversion and improve the efficiencyof data acquisition Examples include Facebookrsquos opensource tool Scribe LinkedInrsquos open source tool KafkaTaobaorsquos open source tool Timetunnel and Chukwa ofHadoop etc These tools can meet the demands of dataacquisition and transmission with hundreds of MB persecond

                                            532 Analysis at different levels

                                            Big data analysis can also be classified into memory levelanalysis Business Intelligence (BI) level analysis and mas-sive level analysis which are examined in the following

                                            ndash Memory-level analysis is for the case where the totaldata volume is smaller than the maximum memory ofa cluster Nowadays the memory of server cluster sur-passes hundreds of GB while even the TB level iscommon Therefore an internal database technologymay be used and hot data shall reside in the memory soas to improve the analytical efficiency Memory-levelanalysis is extremely suitable for real-time analysisMongoDB is a representative memory-level analyticalarchitecture With the development of SSD (Solid-StateDrive) the capacity and performance of memory-leveldata analysis has been further improved and widelyapplied

                                            ndash BI analysis is for the case when the data scale sur-passes the memory level but may be imported intothe BI analysis environment The currently mainstreamBI products are provided with data analysis plans tosupport the level over TB

                                            ndash Massive analysis is for the case when the data scalehas completely surpassed the capacities of BI productsand traditional relational databases At present mostmassive analysis utilize HDFS of Hadoop to store dataand use MapReduce for data analysis Most massiveanalysis belongs to the offline analysis category

                                            192 Mobile Netw Appl (2014) 19171ndash209

                                            533 Analysis with different complexity

                                            The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

                                            54 Tools for big data mining and analysis

                                            Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

                                            ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

                                            ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

                                            ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

                                            The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

                                            ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

                                            ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

                                            6 Big data applications

                                            In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

                                            Mobile Netw Appl (2014) 19171ndash209 193

                                            However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

                                            61 Application evolutions

                                            Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

                                            ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

                                            ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

                                            most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

                                            ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

                                            As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

                                            62 Big data analysis fields

                                            webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

                                            194 Mobile Netw Appl (2014) 19171ndash209

                                            621 Structured data analysis

                                            Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

                                            However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

                                            622 Text data analysis

                                            The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

                                            623 Web data analysis

                                            Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

                                            mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

                                            Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

                                            Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

                                            Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

                                            Mobile Netw Appl (2014) 19171ndash209 195

                                            624 Multimedia data analysis

                                            Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

                                            Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

                                            Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

                                            Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

                                            segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

                                            Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

                                            The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

                                            625 Network data analysis

                                            Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

                                            196 Mobile Netw Appl (2014) 19171ndash209

                                            and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

                                            The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

                                            Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

                                            Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

                                            Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

                                            is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

                                            626 Mobile data analysis

                                            By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

                                            With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

                                            Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

                                            Mobile Netw Appl (2014) 19171ndash209 197

                                            In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

                                            Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

                                            63 Key applications of big data

                                            631 Application of big data in enterprises

                                            At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

                                            In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

                                            Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

                                            632 Application of IoT based big data

                                            IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

                                            Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

                                            Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

                                            633 Application of online social network-oriented big data

                                            Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

                                            198 Mobile Netw Appl (2014) 19171ndash209

                                            information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

                                            ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

                                            ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

                                            is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

                                            The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

                                            In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

                                            Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

                                            Mobile Netw Appl (2014) 19171ndash209 199

                                            or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

                                            Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

                                            ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

                                            ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

                                            ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

                                            634 Applications of healthcare and medical big data

                                            Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

                                            effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

                                            For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

                                            The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

                                            HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

                                            Fig 6 The correlation between Tweets about rice price and food price inflation

                                            200 Mobile Netw Appl (2014) 19171ndash209

                                            imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

                                            635 Collective intelligence

                                            With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

                                            Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

                                            In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

                                            636 Smart grid

                                            Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

                                            supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

                                            ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

                                            ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

                                            Mobile Netw Appl (2014) 19171ndash209 201

                                            according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

                                            ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

                                            7 Conclusion open issues and outlook

                                            In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

                                            In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

                                            71 Open issues

                                            The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

                                            711 Theoretical research

                                            Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

                                            ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

                                            ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

                                            ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

                                            712 Technology development

                                            The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

                                            ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

                                            202 Mobile Netw Appl (2014) 19171ndash209

                                            ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

                                            ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

                                            ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

                                            713 Practical implications

                                            Although there are already many successful big data appli-cations many practical problems should be solved

                                            ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

                                            ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

                                            ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

                                            individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

                                            ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

                                            714 Data security

                                            In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

                                            ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

                                            ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

                                            Mobile Netw Appl (2014) 19171ndash209 203

                                            quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

                                            ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

                                            ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

                                            The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

                                            72 Outlook

                                            The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

                                            not predict the future but may take precautions for possibleevents to occur in the future

                                            ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

                                            ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

                                            ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

                                            ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

                                            204 Mobile Netw Appl (2014) 19171ndash209

                                            utilizes relational diagrams to express interpersonalrelationship

                                            ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

                                            ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

                                            ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

                                            ndash Compared with accurate data we would like toaccept numerous and complicated data

                                            ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

                                            ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

                                            ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

                                            Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

                                            increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

                                            Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

                                            References

                                            1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

                                            2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

                                            3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

                                            4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

                                            5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

                                            httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

                                            7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

                                            8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

                                            9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

                                            10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

                                            11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

                                            12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

                                            13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

                                            14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

                                            15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

                                            16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

                                            17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

                                            18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

                                            19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

                                            Mobile Netw Appl (2014) 19171ndash209 205

                                            20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

                                            21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

                                            22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

                                            23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

                                            24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

                                            25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

                                            26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

                                            27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

                                            28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

                                            29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

                                            30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

                                            31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

                                            32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

                                            33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

                                            34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

                                            35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

                                            36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

                                            37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

                                            38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

                                            39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

                                            40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

                                            41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

                                            42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

                                            43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

                                            44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

                                            45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

                                            46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

                                            47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

                                            48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

                                            49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

                                            50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

                                            51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

                                            52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

                                            53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

                                            54 Cisco data center interconnect design and deployment guide(2010)

                                            55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

                                            56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

                                            57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

                                            58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

                                            59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

                                            206 Mobile Netw Appl (2014) 19171ndash209

                                            60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

                                            61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

                                            62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

                                            63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

                                            64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

                                            65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

                                            66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

                                            67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

                                            68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

                                            69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

                                            70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

                                            71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

                                            72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

                                            73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

                                            74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

                                            75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

                                            76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

                                            77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

                                            78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

                                            79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

                                            80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

                                            81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

                                            82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

                                            83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

                                            84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

                                            85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

                                            86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

                                            87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

                                            88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

                                            89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

                                            90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

                                            Media Inc93 Crockford D (2006) The applicationjson media type for

                                            javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

                                            SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

                                            tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

                                            (2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

                                            97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

                                            98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

                                            99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

                                            Mobile Netw Appl (2014) 19171ndash209 207

                                            100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

                                            101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

                                            102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

                                            103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

                                            104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

                                            105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

                                            106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

                                            107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

                                            108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

                                            109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

                                            110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

                                            111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

                                            112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

                                            113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

                                            114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

                                            115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

                                            D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

                                            117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

                                            118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

                                            the 7th ACM international conference on computing frontiersACM pp 277ndash286

                                            119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

                                            120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

                                            121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

                                            122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

                                            123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

                                            124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

                                            125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

                                            126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

                                            127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

                                            128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

                                            129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

                                            130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

                                            131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

                                            132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

                                            133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

                                            134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

                                            135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

                                            136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

                                            137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

                                            138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

                                            139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

                                            140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

                                            208 Mobile Netw Appl (2014) 19171ndash209

                                            141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

                                            142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

                                            143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

                                            144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

                                            145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

                                            146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

                                            147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

                                            148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

                                            149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

                                            150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

                                            151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

                                            152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

                                            153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

                                            154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

                                            155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

                                            156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

                                            Mobile Netw Appl (2014) 19171ndash209 209

                                            • Big Data A Survey
                                              • Abstract
                                              • Background
                                                • Dawn of big data era
                                                • Definition and features of big data
                                                • Big data value
                                                • The development of big data
                                                • Challenges of big data
                                                  • Related technologies
                                                    • Relationship between cloud computing and big data
                                                    • Relationship between IoT and big data
                                                    • Data center
                                                    • Relationship between hadoop and big data
                                                      • Big data generation and acquisition
                                                        • Data generation
                                                          • Enterprise data
                                                          • IoT data
                                                          • Bio-medical data
                                                          • Data generation from other fields
                                                            • Big data acquisition
                                                              • Data collection
                                                              • Data transportation
                                                              • Data pre-processing
                                                                  • Big data storage
                                                                    • Storage system for massive data
                                                                    • Distributed storage system
                                                                    • Storage mechanism for big data
                                                                      • Database technology
                                                                        • Traditional data analysis
                                                                        • Big data analytic methods
                                                                        • Architecture for big data analysis
                                                                          • Real-time vs offline analysis
                                                                          • Analysis at different levels
                                                                          • Analysis with different complexity
                                                                            • Tools for big data mining and analysis
                                                                              • Big data applications
                                                                                • Key applications of big data
                                                                                  • Application evolutions
                                                                                  • Structured data analysis
                                                                                  • Text data analysis
                                                                                  • Web data analysis
                                                                                  • Multimedia data analysis
                                                                                  • Network data analysis
                                                                                  • Mobile data analysis
                                                                                    • Key applications of big data
                                                                                      • Application of big data in enterprises
                                                                                      • Application of IoT based big data
                                                                                      • Application of online social network-oriented big data
                                                                                      • Applications of healthcare and medical big data
                                                                                      • Collective intelligence
                                                                                      • Smart grid
                                                                                          • Conclusion open issues and outlook
                                                                                            • Open issues
                                                                                              • Theoretical research
                                                                                              • Technology development
                                                                                              • Practical implications
                                                                                              • Data security
                                                                                                • Outlook
                                                                                                  • Acknowledgments
                                                                                                  • References

                                              533 Analysis with different complexity

                                              The time and space complexity of data analysis algorithmsdiffer greatly from each other according to different kindsof data and application demands For example for applica-tions that are amenable to parallel processing a distributedalgorithm may be designed and a parallel processing modelmay be used for data analysis

                                              54 Tools for big data mining and analysis

                                              Many tools for big data mining and analysis are avail-able including professional and amateur software expen-sive commercial software and open source software In thissection we briefly review the top five most widely usedsoftware according to a survey of ldquoWhat Analytics Datamining Big Data software that you used in the past 12months for a real projectrdquo of 798 professionals made byKDNuggets in 2012 [112]

                                              ndash R (307 ) R an open source programming languageand software environment is designed for datamininganalysis and visualization While computing-intensive tasks are executed code programmed with CC++ and Fortran may be called in the R environment Inaddition skilled users can directly call R objects in CActually R is a realization of the S language which isan interpreted language developed by ATampT Bell Labsand used for data exploration statistical analysis anddrawing plots Compared to S R is more popular sinceit is open source R ranks top 1 in the KDNuggets 2012survey Furthermore in a survey of ldquoDesign languagesyou have used for data mininganalysis in the past yearrdquoin 2012R was also in the first place defeating SQL andJava Due to the popularity of R database manufactur-ers such as Teradata and Oracle have released productssupporting R

                                              ndash Excel (298 ) Excel a core component of MicrosoftOffice provides powerful data processing and statisti-cal analysis capabilities When Excel is installed someadvanced plug-ins such as Analysis ToolPak and SolverAdd-in with powerful functions for data analysis areintegrated initially but such plug-ins can be used onlyif users enable them Excel is also the only commercialsoftware among the top five

                                              ndash Rapid-I Rapidminer (267 ) Rapidminer is an opensource software used for data mining machine learn-ing and predictive analysis In an investigation ofKDnuggets in 2011 it was more frequently used thanR (ranked Top 1) Data mining and machine learn-ing programs provided by RapidMiner include ExtractTransform and Load (ETL) data pre-processing andvisualization modeling evaluation and deployment

                                              The data mining flow is described in XML and dis-played through a graphic user interface (GUI) Rapid-Miner is written in Java It integrates the learner andevaluation method of Weka and works with R Func-tions of Rapidminer are implemented with connectionof processes including various operators The entireflow can be deemed as a production line of a factorywith original data input and model results output Theoperators can be considered as some specific functionswith different input and output characteristics

                                              ndash KNMINE (218 ) KNIME (Konstanz InformationMiner) is a user-friendly intelligent and open-source-rich data integration data processing data analysis anddata mining platform [113] It allows users to createdata flows or data channels in a visualized mannerto selectively run some or all analytical proceduresand provides analytical results models and interac-tive views KNIME was written in Java and based onEclipse provides more functions as plug-ins Throughplug-in files users can insert processing modules forfiles pictures and time series and integrate them intovarious open source projects eg R and Weka KNIMEcontrols data integration cleansing conversion filter-ing statistics mining and finally data visualizationThe entire development process is conducted undera visualized environment KNIME is designed as amodule-based and expandable framework There is nodependence between its processing units and data con-tainers making it adaptive to the distributed environ-ment and independent development In addition it iseasy to expand KNIME Developers can effortlesslyexpand various nodes and views of KNIME

                                              ndash WekaPentaho (148 ) Weka abbreviated fromWaikato Environment for Knowledge Analysis is a freeand open-source machine learning and data mining soft-ware written in Java Weka provides such functions asdata processing feature selection classification regres-sion clustering association rule and visualization etcPentaho is one of the most popular open-source BI soft-ware It includes a web server platform and several toolsto support reporting analysis charting data integrationand data mining etc all aspects of BI Wekarsquos data pro-cessing algorithms are also integrated in Pentaho andcan be directly called

                                              6 Big data applications

                                              In the previous section we examined big data analysiswhich is the final and most important phase of the valuechain of big data Big data analysis can provide usefulvalues via judgments suggestions supports or decisions

                                              Mobile Netw Appl (2014) 19171ndash209 193

                                              However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

                                              61 Application evolutions

                                              Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

                                              ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

                                              ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

                                              most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

                                              ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

                                              As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

                                              62 Big data analysis fields

                                              webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

                                              194 Mobile Netw Appl (2014) 19171ndash209

                                              621 Structured data analysis

                                              Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

                                              However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

                                              622 Text data analysis

                                              The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

                                              623 Web data analysis

                                              Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

                                              mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

                                              Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

                                              Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

                                              Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

                                              Mobile Netw Appl (2014) 19171ndash209 195

                                              624 Multimedia data analysis

                                              Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

                                              Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

                                              Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

                                              Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

                                              segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

                                              Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

                                              The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

                                              625 Network data analysis

                                              Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

                                              196 Mobile Netw Appl (2014) 19171ndash209

                                              and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

                                              The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

                                              Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

                                              Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

                                              Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

                                              is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

                                              626 Mobile data analysis

                                              By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

                                              With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

                                              Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

                                              Mobile Netw Appl (2014) 19171ndash209 197

                                              In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

                                              Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

                                              63 Key applications of big data

                                              631 Application of big data in enterprises

                                              At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

                                              In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

                                              Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

                                              632 Application of IoT based big data

                                              IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

                                              Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

                                              Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

                                              633 Application of online social network-oriented big data

                                              Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

                                              198 Mobile Netw Appl (2014) 19171ndash209

                                              information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

                                              ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

                                              ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

                                              is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

                                              The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

                                              In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

                                              Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

                                              Mobile Netw Appl (2014) 19171ndash209 199

                                              or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

                                              Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

                                              ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

                                              ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

                                              ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

                                              634 Applications of healthcare and medical big data

                                              Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

                                              effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

                                              For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

                                              The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

                                              HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

                                              Fig 6 The correlation between Tweets about rice price and food price inflation

                                              200 Mobile Netw Appl (2014) 19171ndash209

                                              imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

                                              635 Collective intelligence

                                              With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

                                              Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

                                              In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

                                              636 Smart grid

                                              Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

                                              supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

                                              ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

                                              ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

                                              Mobile Netw Appl (2014) 19171ndash209 201

                                              according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

                                              ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

                                              7 Conclusion open issues and outlook

                                              In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

                                              In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

                                              71 Open issues

                                              The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

                                              711 Theoretical research

                                              Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

                                              ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

                                              ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

                                              ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

                                              712 Technology development

                                              The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

                                              ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

                                              202 Mobile Netw Appl (2014) 19171ndash209

                                              ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

                                              ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

                                              ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

                                              713 Practical implications

                                              Although there are already many successful big data appli-cations many practical problems should be solved

                                              ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

                                              ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

                                              ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

                                              individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

                                              ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

                                              714 Data security

                                              In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

                                              ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

                                              ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

                                              Mobile Netw Appl (2014) 19171ndash209 203

                                              quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

                                              ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

                                              ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

                                              The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

                                              72 Outlook

                                              The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

                                              not predict the future but may take precautions for possibleevents to occur in the future

                                              ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

                                              ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

                                              ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

                                              ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

                                              204 Mobile Netw Appl (2014) 19171ndash209

                                              utilizes relational diagrams to express interpersonalrelationship

                                              ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

                                              ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

                                              ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

                                              ndash Compared with accurate data we would like toaccept numerous and complicated data

                                              ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

                                              ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

                                              ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

                                              Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

                                              increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

                                              Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

                                              References

                                              1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

                                              2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

                                              3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

                                              4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

                                              5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

                                              httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

                                              7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

                                              8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

                                              9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

                                              10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

                                              11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

                                              12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

                                              13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

                                              14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

                                              15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

                                              16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

                                              17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

                                              18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

                                              19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

                                              Mobile Netw Appl (2014) 19171ndash209 205

                                              20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

                                              21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

                                              22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

                                              23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

                                              24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

                                              25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

                                              26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

                                              27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

                                              28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

                                              29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

                                              30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

                                              31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

                                              32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

                                              33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

                                              34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

                                              35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

                                              36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

                                              37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

                                              38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

                                              39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

                                              40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

                                              41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

                                              42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

                                              43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

                                              44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

                                              45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

                                              46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

                                              47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

                                              48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

                                              49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

                                              50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

                                              51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

                                              52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

                                              53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

                                              54 Cisco data center interconnect design and deployment guide(2010)

                                              55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

                                              56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

                                              57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

                                              58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

                                              59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

                                              206 Mobile Netw Appl (2014) 19171ndash209

                                              60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

                                              61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

                                              62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

                                              63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

                                              64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

                                              65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

                                              66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

                                              67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

                                              68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

                                              69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

                                              70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

                                              71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

                                              72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

                                              73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

                                              74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

                                              75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

                                              76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

                                              77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

                                              78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

                                              79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

                                              80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

                                              81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

                                              82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

                                              83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

                                              84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

                                              85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

                                              86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

                                              87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

                                              88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

                                              89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

                                              90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

                                              Media Inc93 Crockford D (2006) The applicationjson media type for

                                              javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

                                              SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

                                              tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

                                              (2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

                                              97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

                                              98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

                                              99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

                                              Mobile Netw Appl (2014) 19171ndash209 207

                                              100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

                                              101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

                                              102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

                                              103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

                                              104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

                                              105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

                                              106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

                                              107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

                                              108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

                                              109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

                                              110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

                                              111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

                                              112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

                                              113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

                                              114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

                                              115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

                                              D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

                                              117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

                                              118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

                                              the 7th ACM international conference on computing frontiersACM pp 277ndash286

                                              119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

                                              120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

                                              121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

                                              122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

                                              123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

                                              124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

                                              125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

                                              126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

                                              127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

                                              128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

                                              129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

                                              130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

                                              131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

                                              132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

                                              133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

                                              134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

                                              135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

                                              136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

                                              137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

                                              138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

                                              139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

                                              140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

                                              208 Mobile Netw Appl (2014) 19171ndash209

                                              141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

                                              142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

                                              143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

                                              144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

                                              145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

                                              146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

                                              147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

                                              148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

                                              149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

                                              150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

                                              151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

                                              152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

                                              153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

                                              154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

                                              155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

                                              156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

                                              Mobile Netw Appl (2014) 19171ndash209 209

                                              • Big Data A Survey
                                                • Abstract
                                                • Background
                                                  • Dawn of big data era
                                                  • Definition and features of big data
                                                  • Big data value
                                                  • The development of big data
                                                  • Challenges of big data
                                                    • Related technologies
                                                      • Relationship between cloud computing and big data
                                                      • Relationship between IoT and big data
                                                      • Data center
                                                      • Relationship between hadoop and big data
                                                        • Big data generation and acquisition
                                                          • Data generation
                                                            • Enterprise data
                                                            • IoT data
                                                            • Bio-medical data
                                                            • Data generation from other fields
                                                              • Big data acquisition
                                                                • Data collection
                                                                • Data transportation
                                                                • Data pre-processing
                                                                    • Big data storage
                                                                      • Storage system for massive data
                                                                      • Distributed storage system
                                                                      • Storage mechanism for big data
                                                                        • Database technology
                                                                          • Traditional data analysis
                                                                          • Big data analytic methods
                                                                          • Architecture for big data analysis
                                                                            • Real-time vs offline analysis
                                                                            • Analysis at different levels
                                                                            • Analysis with different complexity
                                                                              • Tools for big data mining and analysis
                                                                                • Big data applications
                                                                                  • Key applications of big data
                                                                                    • Application evolutions
                                                                                    • Structured data analysis
                                                                                    • Text data analysis
                                                                                    • Web data analysis
                                                                                    • Multimedia data analysis
                                                                                    • Network data analysis
                                                                                    • Mobile data analysis
                                                                                      • Key applications of big data
                                                                                        • Application of big data in enterprises
                                                                                        • Application of IoT based big data
                                                                                        • Application of online social network-oriented big data
                                                                                        • Applications of healthcare and medical big data
                                                                                        • Collective intelligence
                                                                                        • Smart grid
                                                                                            • Conclusion open issues and outlook
                                                                                              • Open issues
                                                                                                • Theoretical research
                                                                                                • Technology development
                                                                                                • Practical implications
                                                                                                • Data security
                                                                                                  • Outlook
                                                                                                    • Acknowledgments
                                                                                                    • References

                                                However data analysis involves a wide range of applica-tions which frequently change and are extremely complexIn this section we first review the evolution of data sourcesWe then examine six of the most important data analysisfields including structured data analysis text analysis web-site analysis multimedia analysis network analysis andmobile analysis Finally we introduce several key applica-tion fields of big data

                                                61 Application evolutions

                                                Recently big data analysis has been proposed as anadvanced analytical technology which typically includeslarge-scale and complex programs under specific analyticalmethods As a matter of fact data driven applications haveemerged in the past decades For example as early as 1990sBI has become a prevailing technology for business appli-cations and network search engines based on massive datamining processing emerged in the early 21st century Somepotential and influential applications from different fieldsand their data and analysis characteristics are discussed asfollows

                                                ndash Evolution of Commercial Applications The earliestbusiness data was generally structured data which wascollected by companies from legacy systems and thenstored in RDBMSs Analytical techniques used in suchsystems were prevailing in the 1990s and was intu-itive and simple eg in the forms of reports dash-board queries with condition search-based businessintelligence online transaction processing interactivevisualization score cards predictive modeling anddata mining [114] Since the beginning of 21st cen-tury networks and the World Wide Web (WWW) hasbeen providing a unique opportunity for organizationsto have online display and directly interact with cus-tomers Abundant products and customer informationsuch as clickstream data logs and user behavior can beacquired from the WWW Product layout optimizationcustomer trade analysis product suggestions and mar-ket structure analysis can be conducted by text analysisand website mining techniques As reported in [115]the quantity of mobile phones and tablet PC first sur-passed that of laptops and PCs in 2011 Mobile phonesand Internet of Things based on sensors are opening anew generation of innovation applications and requir-ing considerably larger capacity of supporting locationsensing people oriented and context-aware operation

                                                ndash Evolution of Network Applications The early gen-eration of the Internet mainly provided email andthe WWW services Text analysis data mining and

                                                most applications are web-based regardless of theirfield and design goals Network data accounts for amajor percentage of the global data volume Web hasbecome a common platform for interconnected pagesfull of various kinds of data such as text images audiovideos and interactive contents etc Therefore a plen-tiful of advanced technologies used for semi-structuredor unstructured data emerged at the right moment Forexample image analysis can extract useful informationfrom images (eg face recognition) Multimedia anal-ysis technologies can be applied to automated videosurveillance systems for business law enforcement andmilitary applications Since 2004 online social mediasuch as Internet forums online communities blogssocial networking services and social multimedia web-sites provide users with great opportunities to createupload and share contents

                                                ndash Evolution of Scientific Applications Scientific researchin many fields is acquiring massive data with high-throughput sensors and instruments such as astro-physics oceanology genomics and environmentalresearch The US National Science Foundation (NSF)has recently announced the BIGDATA program to pro-mote efforts to extract knowledge and insights fromlarge and complex collections of digital data Somescientific research disciplines have developed big dataplatforms and obtained useful outcomes For examplein biology iPlant [116] applies network infrastructurephysical computing resources coordination environ-ment virtual machine resources inter-operative anal-ysis software and data service to assist researcherseducators and students in enriching plant sciences TheiPlant datasets have high varieties in form includingspecification or reference data experimental data ana-log or model data observation data and other deriveddata

                                                As discussed we can divide data analysis research intosix key technical fields ie structured data analysis textdata analysis web data analysis multimedia data analy-sis network data analysis and mobile data analysis Sucha classification aims to emphasize data characteristics butsome of the fields may utilize similar basic technologiesSince data analysis has a broad scope and it is not easy tohave a comprehensive coverage we will focus on the keyproblems and technologies in data analysis in the followingdiscussions

                                                62 Big data analysis fields

                                                webpage analysis have been applied to the mining ofemail contents and building search engines Nowadays

                                                194 Mobile Netw Appl (2014) 19171ndash209

                                                621 Structured data analysis

                                                Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

                                                However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

                                                622 Text data analysis

                                                The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

                                                623 Web data analysis

                                                Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

                                                mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

                                                Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

                                                Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

                                                Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

                                                Mobile Netw Appl (2014) 19171ndash209 195

                                                624 Multimedia data analysis

                                                Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

                                                Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

                                                Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

                                                Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

                                                segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

                                                Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

                                                The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

                                                625 Network data analysis

                                                Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

                                                196 Mobile Netw Appl (2014) 19171ndash209

                                                and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

                                                The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

                                                Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

                                                Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

                                                Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

                                                is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

                                                626 Mobile data analysis

                                                By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

                                                With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

                                                Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

                                                Mobile Netw Appl (2014) 19171ndash209 197

                                                In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

                                                Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

                                                63 Key applications of big data

                                                631 Application of big data in enterprises

                                                At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

                                                In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

                                                Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

                                                632 Application of IoT based big data

                                                IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

                                                Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

                                                Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

                                                633 Application of online social network-oriented big data

                                                Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

                                                198 Mobile Netw Appl (2014) 19171ndash209

                                                information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

                                                ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

                                                ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

                                                is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

                                                The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

                                                In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

                                                Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

                                                Mobile Netw Appl (2014) 19171ndash209 199

                                                or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

                                                Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

                                                ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

                                                ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

                                                ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

                                                634 Applications of healthcare and medical big data

                                                Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

                                                effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

                                                For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

                                                The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

                                                HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

                                                Fig 6 The correlation between Tweets about rice price and food price inflation

                                                200 Mobile Netw Appl (2014) 19171ndash209

                                                imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

                                                635 Collective intelligence

                                                With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

                                                Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

                                                In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

                                                636 Smart grid

                                                Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

                                                supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

                                                ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

                                                ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

                                                Mobile Netw Appl (2014) 19171ndash209 201

                                                according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

                                                ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

                                                7 Conclusion open issues and outlook

                                                In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

                                                In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

                                                71 Open issues

                                                The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

                                                711 Theoretical research

                                                Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

                                                ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

                                                ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

                                                ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

                                                712 Technology development

                                                The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

                                                ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

                                                202 Mobile Netw Appl (2014) 19171ndash209

                                                ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

                                                ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

                                                ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

                                                713 Practical implications

                                                Although there are already many successful big data appli-cations many practical problems should be solved

                                                ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

                                                ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

                                                ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

                                                individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

                                                ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

                                                714 Data security

                                                In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

                                                ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

                                                ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

                                                Mobile Netw Appl (2014) 19171ndash209 203

                                                quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

                                                ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

                                                ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

                                                The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

                                                72 Outlook

                                                The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

                                                not predict the future but may take precautions for possibleevents to occur in the future

                                                ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

                                                ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

                                                ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

                                                ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

                                                204 Mobile Netw Appl (2014) 19171ndash209

                                                utilizes relational diagrams to express interpersonalrelationship

                                                ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

                                                ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

                                                ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

                                                ndash Compared with accurate data we would like toaccept numerous and complicated data

                                                ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

                                                ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

                                                ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

                                                Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

                                                increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

                                                Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

                                                References

                                                1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

                                                2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

                                                3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

                                                4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

                                                5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

                                                httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

                                                7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

                                                8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

                                                9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

                                                10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

                                                11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

                                                12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

                                                13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

                                                14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

                                                15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

                                                16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

                                                17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

                                                18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

                                                19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

                                                Mobile Netw Appl (2014) 19171ndash209 205

                                                20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

                                                21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

                                                22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

                                                23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

                                                24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

                                                25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

                                                26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

                                                27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

                                                28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

                                                29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

                                                30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

                                                31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

                                                32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

                                                33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

                                                34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

                                                35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

                                                36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

                                                37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

                                                38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

                                                39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

                                                40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

                                                41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

                                                42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

                                                43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

                                                44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

                                                45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

                                                46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

                                                47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

                                                48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

                                                49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

                                                50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

                                                51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

                                                52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

                                                53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

                                                54 Cisco data center interconnect design and deployment guide(2010)

                                                55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

                                                56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

                                                57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

                                                58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

                                                59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

                                                206 Mobile Netw Appl (2014) 19171ndash209

                                                60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

                                                61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

                                                62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

                                                63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

                                                64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

                                                65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

                                                66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

                                                67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

                                                68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

                                                69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

                                                70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

                                                71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

                                                72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

                                                73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

                                                74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

                                                75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

                                                76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

                                                77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

                                                78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

                                                79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

                                                80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

                                                81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

                                                82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

                                                83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

                                                84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

                                                85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

                                                86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

                                                87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

                                                88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

                                                89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

                                                90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

                                                Media Inc93 Crockford D (2006) The applicationjson media type for

                                                javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

                                                SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

                                                tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

                                                (2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

                                                97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

                                                98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

                                                99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

                                                Mobile Netw Appl (2014) 19171ndash209 207

                                                100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

                                                101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

                                                102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

                                                103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

                                                104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

                                                105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

                                                106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

                                                107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

                                                108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

                                                109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

                                                110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

                                                111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

                                                112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

                                                113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

                                                114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

                                                115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

                                                D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

                                                117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

                                                118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

                                                the 7th ACM international conference on computing frontiersACM pp 277ndash286

                                                119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

                                                120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

                                                121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

                                                122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

                                                123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

                                                124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

                                                125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

                                                126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

                                                127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

                                                128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

                                                129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

                                                130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

                                                131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

                                                132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

                                                133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

                                                134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

                                                135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

                                                136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

                                                137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

                                                138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

                                                139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

                                                140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

                                                208 Mobile Netw Appl (2014) 19171ndash209

                                                141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

                                                142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

                                                143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

                                                144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

                                                145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

                                                146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

                                                147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

                                                148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

                                                149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

                                                150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

                                                151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

                                                152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

                                                153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

                                                154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

                                                155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

                                                156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

                                                Mobile Netw Appl (2014) 19171ndash209 209

                                                • Big Data A Survey
                                                  • Abstract
                                                  • Background
                                                    • Dawn of big data era
                                                    • Definition and features of big data
                                                    • Big data value
                                                    • The development of big data
                                                    • Challenges of big data
                                                      • Related technologies
                                                        • Relationship between cloud computing and big data
                                                        • Relationship between IoT and big data
                                                        • Data center
                                                        • Relationship between hadoop and big data
                                                          • Big data generation and acquisition
                                                            • Data generation
                                                              • Enterprise data
                                                              • IoT data
                                                              • Bio-medical data
                                                              • Data generation from other fields
                                                                • Big data acquisition
                                                                  • Data collection
                                                                  • Data transportation
                                                                  • Data pre-processing
                                                                      • Big data storage
                                                                        • Storage system for massive data
                                                                        • Distributed storage system
                                                                        • Storage mechanism for big data
                                                                          • Database technology
                                                                            • Traditional data analysis
                                                                            • Big data analytic methods
                                                                            • Architecture for big data analysis
                                                                              • Real-time vs offline analysis
                                                                              • Analysis at different levels
                                                                              • Analysis with different complexity
                                                                                • Tools for big data mining and analysis
                                                                                  • Big data applications
                                                                                    • Key applications of big data
                                                                                      • Application evolutions
                                                                                      • Structured data analysis
                                                                                      • Text data analysis
                                                                                      • Web data analysis
                                                                                      • Multimedia data analysis
                                                                                      • Network data analysis
                                                                                      • Mobile data analysis
                                                                                        • Key applications of big data
                                                                                          • Application of big data in enterprises
                                                                                          • Application of IoT based big data
                                                                                          • Application of online social network-oriented big data
                                                                                          • Applications of healthcare and medical big data
                                                                                          • Collective intelligence
                                                                                          • Smart grid
                                                                                              • Conclusion open issues and outlook
                                                                                                • Open issues
                                                                                                  • Theoretical research
                                                                                                  • Technology development
                                                                                                  • Practical implications
                                                                                                  • Data security
                                                                                                    • Outlook
                                                                                                      • Acknowledgments
                                                                                                      • References

                                                  621 Structured data analysis

                                                  Business applications and scientific research may generatemassive structured data of which the management and anal-ysis rely on mature commercialized technologies such asRDBMS data warehouse OLAP and BPM (Business Pro-cess Management) [28] Data analysis is mainly based ondata mining and statistical analysis both of which have beenwell studied over the past 30 years

                                                  However data analysis is still a very active research fieldand new application demands drive the development of newmethods For example statistical machine learning based onexact mathematical models and powerful algorithms havebeen applied to anomaly detection [117] and energy con-trol [118] Exploiting data characteristics time and spacemining can extract knowledge structures hidden in high-speed data flows and sensors [119] Driven by privacyprotection in e-commerce e-government and health careapplications privacy protection data mining is an emergingresearch field [120] Over the past decade process mining isbecoming a new research field especially in process analysiswith event data [121]

                                                  622 Text data analysis

                                                  The most common format of information storage is texteg emails business documents web pages and socialmedia Therefore text analysis is deemed to feature morebusiness-based potential than structured data Generallytext analysis is a process to extract useful informationand knowledge from unstructured text Text mining isinter-disciplinary involving information retrieval machinelearning statistics computing linguistics and data min-ing in particular Most text mining systems are based ontext expressions and natural language processing (NLP)with more emphasis on the latter NLP allows comput-ers to analyze interpret and even generate text Somecommon NLP methods include lexical acquisition wordsense disambiguation part-of-speech tagging and prob-abilistic context free grammar [122] Some NLP-basedtechniques have been applied to text mining includinginformation extraction topic models text summarizationclassification clustering question answering and opinionmining

                                                  623 Web data analysis

                                                  Web data analysis has emerged as an active research fieldIt aims to automatically retrieve extract and evaluate infor-mation from Web documents and services so as to dis-cover useful knowledge Web analysis is related to severalresearch fields including database information retrievalNLP and text mining According to the different parts be

                                                  mined we classify Web data analysis into three relatedfields Web content mining Web structure mining and Webusage mining [123]

                                                  Web content mining is the process to discover usefulknowledge in Web pages which generally involve severaltypes of data such as text image audio video code meta-data and hyperlink The research on image audio andvideo mining has recently been called multimedia analysiswhich will be discussed in the Section 615 Since mostWeb content data is unstructured text data the research onWeb data analysis mainly centers around text and hypertextText mining is discussed in Section 613 while Hypertextmining involves the mining of the semi-structured HTMLfiles that contain hyperlinks Supervised learning and clas-sification play important roles in hyperlink mining egemail newsgroup management and Web catalogue mainte-nance [124] Web content mining can be conducted with twomethods the information retrieval method and the databasemethod Information retrieval mainly assists in or improvesinformation lookup or filters user information accordingto deductions or configuration documents The databasemethod aims to simulate and integrate data in Web so as toconduct more complex queries than searches based on keywords

                                                  Web structure mining involves models for discover-ing Web link structures Here the structure refers to theschematic diagrams linked in a website or among multiplewebsites Models are built based on topological structuresprovided with hyperlinks with or without link descrip-tion Such models reveal the similarities and correlationsamong different websites and are used to classify websitepages Page Rank [125] and CLEVER [126] make full useof the models to look up relevant website pages Topic-oriented crawler is another successful case by utilizing themodels [127]

                                                  Web usage mining aims to mine auxiliary data gener-ated by Web dialogues or activities Web content miningand Web structure mining use the master Web data Webusage data includes access logs at Web servers and proxyservers browsersrsquo history records user profiles registrationdata user sessions or trades cache user queries bookmarkdata mouse clicks and scrolls and any other kinds of datagenerated through interaction with the Web As Web ser-vices and Web20 are becoming mature and popular Webusage data will have increasingly high variety Web usagemining plays key roles in personalized space e-commercenetwork privacysecurity and other emerging fields Forexample collaborative recommender systems can person-alize e-commerce by utilizing the different preferences ofusers

                                                  Mobile Netw Appl (2014) 19171ndash209 195

                                                  624 Multimedia data analysis

                                                  Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

                                                  Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

                                                  Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

                                                  Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

                                                  segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

                                                  Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

                                                  The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

                                                  625 Network data analysis

                                                  Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

                                                  196 Mobile Netw Appl (2014) 19171ndash209

                                                  and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

                                                  The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

                                                  Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

                                                  Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

                                                  Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

                                                  is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

                                                  626 Mobile data analysis

                                                  By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

                                                  With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

                                                  Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

                                                  Mobile Netw Appl (2014) 19171ndash209 197

                                                  In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

                                                  Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

                                                  63 Key applications of big data

                                                  631 Application of big data in enterprises

                                                  At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

                                                  In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

                                                  Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

                                                  632 Application of IoT based big data

                                                  IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

                                                  Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

                                                  Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

                                                  633 Application of online social network-oriented big data

                                                  Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

                                                  198 Mobile Netw Appl (2014) 19171ndash209

                                                  information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

                                                  ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

                                                  ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

                                                  is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

                                                  The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

                                                  In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

                                                  Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

                                                  Mobile Netw Appl (2014) 19171ndash209 199

                                                  or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

                                                  Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

                                                  ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

                                                  ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

                                                  ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

                                                  634 Applications of healthcare and medical big data

                                                  Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

                                                  effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

                                                  For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

                                                  The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

                                                  HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

                                                  Fig 6 The correlation between Tweets about rice price and food price inflation

                                                  200 Mobile Netw Appl (2014) 19171ndash209

                                                  imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

                                                  635 Collective intelligence

                                                  With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

                                                  Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

                                                  In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

                                                  636 Smart grid

                                                  Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

                                                  supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

                                                  ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

                                                  ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

                                                  Mobile Netw Appl (2014) 19171ndash209 201

                                                  according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

                                                  ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

                                                  7 Conclusion open issues and outlook

                                                  In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

                                                  In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

                                                  71 Open issues

                                                  The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

                                                  711 Theoretical research

                                                  Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

                                                  ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

                                                  ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

                                                  ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

                                                  712 Technology development

                                                  The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

                                                  ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

                                                  202 Mobile Netw Appl (2014) 19171ndash209

                                                  ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

                                                  ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

                                                  ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

                                                  713 Practical implications

                                                  Although there are already many successful big data appli-cations many practical problems should be solved

                                                  ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

                                                  ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

                                                  ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

                                                  individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

                                                  ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

                                                  714 Data security

                                                  In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

                                                  ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

                                                  ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

                                                  Mobile Netw Appl (2014) 19171ndash209 203

                                                  quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

                                                  ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

                                                  ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

                                                  The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

                                                  72 Outlook

                                                  The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

                                                  not predict the future but may take precautions for possibleevents to occur in the future

                                                  ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

                                                  ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

                                                  ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

                                                  ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

                                                  204 Mobile Netw Appl (2014) 19171ndash209

                                                  utilizes relational diagrams to express interpersonalrelationship

                                                  ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

                                                  ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

                                                  ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

                                                  ndash Compared with accurate data we would like toaccept numerous and complicated data

                                                  ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

                                                  ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

                                                  ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

                                                  Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

                                                  increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

                                                  Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

                                                  References

                                                  1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

                                                  2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

                                                  3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

                                                  4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

                                                  5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

                                                  httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

                                                  7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

                                                  8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

                                                  9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

                                                  10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

                                                  11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

                                                  12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

                                                  13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

                                                  14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

                                                  15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

                                                  16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

                                                  17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

                                                  18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

                                                  19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

                                                  Mobile Netw Appl (2014) 19171ndash209 205

                                                  20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

                                                  21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

                                                  22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

                                                  23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

                                                  24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

                                                  25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

                                                  26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

                                                  27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

                                                  28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

                                                  29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

                                                  30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

                                                  31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

                                                  32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

                                                  33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

                                                  34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

                                                  35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

                                                  36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

                                                  37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

                                                  38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

                                                  39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

                                                  40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

                                                  41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

                                                  42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

                                                  43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

                                                  44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

                                                  45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

                                                  46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

                                                  47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

                                                  48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

                                                  49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

                                                  50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

                                                  51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

                                                  52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

                                                  53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

                                                  54 Cisco data center interconnect design and deployment guide(2010)

                                                  55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

                                                  56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

                                                  57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

                                                  58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

                                                  59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

                                                  206 Mobile Netw Appl (2014) 19171ndash209

                                                  60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

                                                  61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

                                                  62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

                                                  63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

                                                  64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

                                                  65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

                                                  66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

                                                  67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

                                                  68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

                                                  69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

                                                  70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

                                                  71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

                                                  72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

                                                  73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

                                                  74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

                                                  75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

                                                  76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

                                                  77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

                                                  78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

                                                  79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

                                                  80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

                                                  81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

                                                  82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

                                                  83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

                                                  84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

                                                  85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

                                                  86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

                                                  87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

                                                  88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

                                                  89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

                                                  90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

                                                  Media Inc93 Crockford D (2006) The applicationjson media type for

                                                  javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

                                                  SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

                                                  tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

                                                  (2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

                                                  97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

                                                  98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

                                                  99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

                                                  Mobile Netw Appl (2014) 19171ndash209 207

                                                  100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

                                                  101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

                                                  102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

                                                  103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

                                                  104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

                                                  105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

                                                  106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

                                                  107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

                                                  108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

                                                  109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

                                                  110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

                                                  111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

                                                  112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

                                                  113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

                                                  114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

                                                  115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

                                                  D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

                                                  117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

                                                  118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

                                                  the 7th ACM international conference on computing frontiersACM pp 277ndash286

                                                  119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

                                                  120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

                                                  121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

                                                  122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

                                                  123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

                                                  124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

                                                  125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

                                                  126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

                                                  127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

                                                  128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

                                                  129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

                                                  130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

                                                  131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

                                                  132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

                                                  133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

                                                  134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

                                                  135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

                                                  136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

                                                  137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

                                                  138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

                                                  139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

                                                  140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

                                                  208 Mobile Netw Appl (2014) 19171ndash209

                                                  141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

                                                  142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

                                                  143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

                                                  144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

                                                  145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

                                                  146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

                                                  147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

                                                  148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

                                                  149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

                                                  150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

                                                  151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

                                                  152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

                                                  153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

                                                  154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

                                                  155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

                                                  156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

                                                  Mobile Netw Appl (2014) 19171ndash209 209

                                                  • Big Data A Survey
                                                    • Abstract
                                                    • Background
                                                      • Dawn of big data era
                                                      • Definition and features of big data
                                                      • Big data value
                                                      • The development of big data
                                                      • Challenges of big data
                                                        • Related technologies
                                                          • Relationship between cloud computing and big data
                                                          • Relationship between IoT and big data
                                                          • Data center
                                                          • Relationship between hadoop and big data
                                                            • Big data generation and acquisition
                                                              • Data generation
                                                                • Enterprise data
                                                                • IoT data
                                                                • Bio-medical data
                                                                • Data generation from other fields
                                                                  • Big data acquisition
                                                                    • Data collection
                                                                    • Data transportation
                                                                    • Data pre-processing
                                                                        • Big data storage
                                                                          • Storage system for massive data
                                                                          • Distributed storage system
                                                                          • Storage mechanism for big data
                                                                            • Database technology
                                                                              • Traditional data analysis
                                                                              • Big data analytic methods
                                                                              • Architecture for big data analysis
                                                                                • Real-time vs offline analysis
                                                                                • Analysis at different levels
                                                                                • Analysis with different complexity
                                                                                  • Tools for big data mining and analysis
                                                                                    • Big data applications
                                                                                      • Key applications of big data
                                                                                        • Application evolutions
                                                                                        • Structured data analysis
                                                                                        • Text data analysis
                                                                                        • Web data analysis
                                                                                        • Multimedia data analysis
                                                                                        • Network data analysis
                                                                                        • Mobile data analysis
                                                                                          • Key applications of big data
                                                                                            • Application of big data in enterprises
                                                                                            • Application of IoT based big data
                                                                                            • Application of online social network-oriented big data
                                                                                            • Applications of healthcare and medical big data
                                                                                            • Collective intelligence
                                                                                            • Smart grid
                                                                                                • Conclusion open issues and outlook
                                                                                                  • Open issues
                                                                                                    • Theoretical research
                                                                                                    • Technology development
                                                                                                    • Practical implications
                                                                                                    • Data security
                                                                                                      • Outlook
                                                                                                        • Acknowledgments
                                                                                                        • References

                                                    624 Multimedia data analysis

                                                    Multimedia data (mainly including images audio andvideos) have been growing at an amazing speed which isextracted useful knowledge and understand the semantemesby analysis Because multimedia data is heterogeneous andmost of such data contains richer information than sim-ple structured data or text data extracting information isconfronted with the huge challenge of the semantic dif-ferences Research on multimedia analysis covers manydisciplines Some recent research priorities include multi-media summarization multimedia annotation multimediaindex and retrieval multimedia suggestion and multimediaevent detection etc

                                                    Audio summarization can be accomplished by extractingthe prominent words or phrases from metadata or syn-thesizing a new representation Video summarization is tointerpret the most important or representative video con-tent sequence and it can be static or dynamic Static videosummarization methods utilize a key frame sequence orcontext-sensitive key frames to represent a video Suchmethods are simple and have been applied to many businessapplications (eg by Yahoo AltaVista and Google) buttheir performance is poor Dynamic summarization meth-ods use a series of video frame to represent a video andtake other smooth measures to make the final summariza-tion look more natural In [128] the authors propose atopic-oriented multimedia summarization system (TOMS)that can automatically summarize the important informationin a video belonging to a certain topic area based on a givenset of extracted features from the video

                                                    Multimedia annotation inserts labels to describe con-tents of images and videos at both syntax and semanticlevels With such labels the management summarizationand retrieval of multimedia data can be easily implementedSince manual annotation is both time and labor inten-sive automatic annotation without any human interventionsbecomes highly appealing The main challenge for auto-matic multimedia annotation is the semantic differenceAlthough much progress has been made the performanceof existing automatic annotation methods still needs to beimproved Currently many efforts are being made to syn-chronously explore both manual and automatic multimediaannotation [129]

                                                    Multimedia indexing and retrieval involve describingstoring and organizing multimedia information and assist-ing users to conveniently and quickly look up multime-dia resources [130] Generally multimedia indexing andretrieval include five procedures structural analysis featureextraction data mining classification and annotation queryand retrieval [131] Structural analysis aims to segment avideo into several semantic structural elements includinglens boundary detection key frame extraction and scene

                                                    segmentation etc According to the result of structuralanalysis the second procedure is feature extraction whichmainly includes further mining the features of key framesobjects texts and movements which are the foundation ofvideo indexing and retrieval Data mining classificationand annotation are to utilize the extracted features to findthe modes of video contents and put videos into scheduledcategories so as to generate video indexes Upon receiving aquery the system will use a similarity measurement methodto look up a candidate video The retrieval result optimizesthe related feedback

                                                    Multimedia recommendation is to recommend specificmultimedia contents according to usersrsquo preferences It isproven to be an effective approach to provide personal-ized services Most existing recommendation systems canbe classified into content-based systems and collaborative-filtering-based systems The content-based methods identifygeneral features of users or their interesting and recom-mend users for other contents with similar features Thesemethods largely rely on content similarity measurementbut most of them are troubled by analysis limitation andexcessive specifications The collaborative-filtering-basedmethods identify groups with similar interests and recom-mend contents for group members according to their behav-ior [132] Presently a mixed method is introduced whichintegrates advantages of the aforementioned two types ofmethods to improve recommendation quality [133]

                                                    The US National Institute of Standards and Technol-ogy (NIST) initiated the TREC Video Retrieval Evaluationfor detecting the occurrence of an event in video-clipsbased on Event Kit which contains some text descriptionrelated to concepts and video examples [134] In [135]the author proposed a new algorithm on special multimediaevent detection using a few positive training examples Theresearch on video event detection is still in its infancy andmainly focuses on sports or news events running or abnor-mal events in monitoring videos and other similar eventswith repetitive patterns

                                                    625 Network data analysis

                                                    Network data analysis evolved from the initial quantita-tive analysis [136] and sociological network analysis [137]into the emerging online social network analysis in thebeginning of 21st century Many online social networkingservices include Twitter Facebook and LinkedIn etc havebecome increasingly popular over the years Such onlinesocial network services generally include massive linkeddata and content data The linked data is mainly in theform of graphic structures describing the communicationsbetween two entities The content data contains text imageand other network multimedia data The rich content insuch networks brings about both unprecedented challenges

                                                    196 Mobile Netw Appl (2014) 19171ndash209

                                                    and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

                                                    The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

                                                    Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

                                                    Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

                                                    Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

                                                    is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

                                                    626 Mobile data analysis

                                                    By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

                                                    With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

                                                    Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

                                                    Mobile Netw Appl (2014) 19171ndash209 197

                                                    In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

                                                    Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

                                                    63 Key applications of big data

                                                    631 Application of big data in enterprises

                                                    At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

                                                    In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

                                                    Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

                                                    632 Application of IoT based big data

                                                    IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

                                                    Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

                                                    Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

                                                    633 Application of online social network-oriented big data

                                                    Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

                                                    198 Mobile Netw Appl (2014) 19171ndash209

                                                    information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

                                                    ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

                                                    ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

                                                    is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

                                                    The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

                                                    In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

                                                    Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

                                                    Mobile Netw Appl (2014) 19171ndash209 199

                                                    or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

                                                    Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

                                                    ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

                                                    ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

                                                    ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

                                                    634 Applications of healthcare and medical big data

                                                    Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

                                                    effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

                                                    For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

                                                    The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

                                                    HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

                                                    Fig 6 The correlation between Tweets about rice price and food price inflation

                                                    200 Mobile Netw Appl (2014) 19171ndash209

                                                    imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

                                                    635 Collective intelligence

                                                    With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

                                                    Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

                                                    In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

                                                    636 Smart grid

                                                    Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

                                                    supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

                                                    ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

                                                    ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

                                                    Mobile Netw Appl (2014) 19171ndash209 201

                                                    according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

                                                    ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

                                                    7 Conclusion open issues and outlook

                                                    In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

                                                    In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

                                                    71 Open issues

                                                    The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

                                                    711 Theoretical research

                                                    Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

                                                    ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

                                                    ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

                                                    ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

                                                    712 Technology development

                                                    The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

                                                    ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

                                                    202 Mobile Netw Appl (2014) 19171ndash209

                                                    ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

                                                    ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

                                                    ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

                                                    713 Practical implications

                                                    Although there are already many successful big data appli-cations many practical problems should be solved

                                                    ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

                                                    ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

                                                    ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

                                                    individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

                                                    ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

                                                    714 Data security

                                                    In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

                                                    ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

                                                    ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

                                                    Mobile Netw Appl (2014) 19171ndash209 203

                                                    quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

                                                    ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

                                                    ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

                                                    The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

                                                    72 Outlook

                                                    The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

                                                    not predict the future but may take precautions for possibleevents to occur in the future

                                                    ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

                                                    ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

                                                    ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

                                                    ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

                                                    204 Mobile Netw Appl (2014) 19171ndash209

                                                    utilizes relational diagrams to express interpersonalrelationship

                                                    ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

                                                    ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

                                                    ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

                                                    ndash Compared with accurate data we would like toaccept numerous and complicated data

                                                    ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

                                                    ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

                                                    ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

                                                    Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

                                                    increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

                                                    Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

                                                    References

                                                    1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

                                                    2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

                                                    3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

                                                    4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

                                                    5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

                                                    httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

                                                    7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

                                                    8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

                                                    9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

                                                    10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

                                                    11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

                                                    12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

                                                    13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

                                                    14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

                                                    15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

                                                    16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

                                                    17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

                                                    18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

                                                    19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

                                                    Mobile Netw Appl (2014) 19171ndash209 205

                                                    20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

                                                    21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

                                                    22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

                                                    23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

                                                    24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

                                                    25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

                                                    26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

                                                    27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

                                                    28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

                                                    29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

                                                    30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

                                                    31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

                                                    32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

                                                    33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

                                                    34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

                                                    35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

                                                    36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

                                                    37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

                                                    38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

                                                    39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

                                                    40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

                                                    41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

                                                    42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

                                                    43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

                                                    44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

                                                    45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

                                                    46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

                                                    47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

                                                    48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

                                                    49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

                                                    50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

                                                    51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

                                                    52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

                                                    53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

                                                    54 Cisco data center interconnect design and deployment guide(2010)

                                                    55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

                                                    56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

                                                    57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

                                                    58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

                                                    59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

                                                    206 Mobile Netw Appl (2014) 19171ndash209

                                                    60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

                                                    61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

                                                    62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

                                                    63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

                                                    64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

                                                    65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

                                                    66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

                                                    67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

                                                    68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

                                                    69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

                                                    70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

                                                    71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

                                                    72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

                                                    73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

                                                    74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

                                                    75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

                                                    76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

                                                    77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

                                                    78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

                                                    79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

                                                    80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

                                                    81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

                                                    82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

                                                    83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

                                                    84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

                                                    85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

                                                    86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

                                                    87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

                                                    88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

                                                    89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

                                                    90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

                                                    Media Inc93 Crockford D (2006) The applicationjson media type for

                                                    javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

                                                    SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

                                                    tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

                                                    (2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

                                                    97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

                                                    98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

                                                    99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

                                                    Mobile Netw Appl (2014) 19171ndash209 207

                                                    100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

                                                    101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

                                                    102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

                                                    103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

                                                    104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

                                                    105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

                                                    106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

                                                    107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

                                                    108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

                                                    109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

                                                    110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

                                                    111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

                                                    112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

                                                    113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

                                                    114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

                                                    115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

                                                    D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

                                                    117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

                                                    118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

                                                    the 7th ACM international conference on computing frontiersACM pp 277ndash286

                                                    119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

                                                    120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

                                                    121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

                                                    122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

                                                    123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

                                                    124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

                                                    125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

                                                    126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

                                                    127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

                                                    128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

                                                    129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

                                                    130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

                                                    131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

                                                    132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

                                                    133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

                                                    134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

                                                    135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

                                                    136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

                                                    137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

                                                    138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

                                                    139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

                                                    140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

                                                    208 Mobile Netw Appl (2014) 19171ndash209

                                                    141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

                                                    142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

                                                    143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

                                                    144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

                                                    145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

                                                    146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

                                                    147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

                                                    148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

                                                    149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

                                                    150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

                                                    151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

                                                    152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

                                                    153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

                                                    154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

                                                    155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

                                                    156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

                                                    Mobile Netw Appl (2014) 19171ndash209 209

                                                    • Big Data A Survey
                                                      • Abstract
                                                      • Background
                                                        • Dawn of big data era
                                                        • Definition and features of big data
                                                        • Big data value
                                                        • The development of big data
                                                        • Challenges of big data
                                                          • Related technologies
                                                            • Relationship between cloud computing and big data
                                                            • Relationship between IoT and big data
                                                            • Data center
                                                            • Relationship between hadoop and big data
                                                              • Big data generation and acquisition
                                                                • Data generation
                                                                  • Enterprise data
                                                                  • IoT data
                                                                  • Bio-medical data
                                                                  • Data generation from other fields
                                                                    • Big data acquisition
                                                                      • Data collection
                                                                      • Data transportation
                                                                      • Data pre-processing
                                                                          • Big data storage
                                                                            • Storage system for massive data
                                                                            • Distributed storage system
                                                                            • Storage mechanism for big data
                                                                              • Database technology
                                                                                • Traditional data analysis
                                                                                • Big data analytic methods
                                                                                • Architecture for big data analysis
                                                                                  • Real-time vs offline analysis
                                                                                  • Analysis at different levels
                                                                                  • Analysis with different complexity
                                                                                    • Tools for big data mining and analysis
                                                                                      • Big data applications
                                                                                        • Key applications of big data
                                                                                          • Application evolutions
                                                                                          • Structured data analysis
                                                                                          • Text data analysis
                                                                                          • Web data analysis
                                                                                          • Multimedia data analysis
                                                                                          • Network data analysis
                                                                                          • Mobile data analysis
                                                                                            • Key applications of big data
                                                                                              • Application of big data in enterprises
                                                                                              • Application of IoT based big data
                                                                                              • Application of online social network-oriented big data
                                                                                              • Applications of healthcare and medical big data
                                                                                              • Collective intelligence
                                                                                              • Smart grid
                                                                                                  • Conclusion open issues and outlook
                                                                                                    • Open issues
                                                                                                      • Theoretical research
                                                                                                      • Technology development
                                                                                                      • Practical implications
                                                                                                      • Data security
                                                                                                        • Outlook
                                                                                                          • Acknowledgments
                                                                                                          • References

                                                      and opportunities for data analysis In accordance with thedata-centered perspective the existing research on socialnetworking service contexts can be classified into two cat-egories link-based structural analysis and content-basedanalysis [138]

                                                      The research on link-based structural analysis has alwaysbeen committed on link prediction community discoverysocial network evolution and social influence analysis etcSNS may be visualized as graphs in which every vertexcorresponds to a user and edges correspond to the correla-tions among users Since SNS are dynamic networks newvertexes and edges are continually added to the graphsLink prediction is to predict the possibility of future con-nection between two vertexes Many techniques can beused for link prediction eg feature-based classificationprobabilistic methods and Linear Algebra Feature-basedclassification is to select a group of features for a ver-tex and utilize the existing link information to generatebinary classifiers to predict the future link [139] Probabilis-tic methods aim to build models for connection probabilitiesamong vertexes in SNS [140] Linear Algebra computes thesimilarity between two vertexes according to the singularsimilar matrix [141] A community is represented by a sub-graphic matrix in which edges connecting vertexes in thesub-graph feature high density while the edges between twosub-graphs feature much lower density [142]

                                                      Many methods for community detection have been pro-posed and studied most of which are topology-based targetfunctions relying on the concept of capturing communitystructure Du et al utilized the property of overlapping com-munities in real life to propose an effective large-scale SNScommunity detection method [143] The research on SNSaims to look for a law and deduction model to interpretnetwork evolution Some empirical studies found that prox-imity bias geographical limitations and other factors playimportant roles in SNS evolution [144ndash146] and some gen-eration methods are proposed to assist network and systemdesign [147]

                                                      Social influence refers to the case when individualschange their behavior under the influence of others Thestrength of social influence depends on the relation amongindividuals network distances time effect and characteris-tics of networks and individuals etc Marketing advertise-ment recommendation and other applications can benefitfrom social influence by qualitatively and quantitativelymeasuring the influence of individuals on others [148 149]Generally if the proliferation of contents in SNS is consid-ered the performance of link-based structural analysis maybe further improved

                                                      Content-based analysis in SNS is also known as socialmedia analysis Social media include text multimedia posi-tioning and comments However social media analysis

                                                      is confronted with unprecedented challenges First mas-sive and continually growing social media data should beautomatically analyzed within a reasonable time windowSecond social media data contains much noise For exam-ple blogosphere contains a large number of spam blogs andso does trivial Tweets in Twitter Third SNS are dynamicnetworks which are frequently and quickly varying andupdated The existing research on social media analysis isstill in its infancy Considering that SNS contains massiveinformation transfer learning in heterogeneous networksaims to transfer knowledge information among differentmedia [150]

                                                      626 Mobile data analysis

                                                      By April 2013 Android Apps has provided more than650000 applications covering nearly all categories By theend of 2012 the monthly mobile data flow has reached885 PB [151] The massive data and abundant applica-tions call for mobile analysis but also bring about a fewchallenges As a whole mobile data has unique character-istics eg mobile sensing moving flexibility noise anda large amount of redundancy Recently new research onmobile analysis has been started in different fields Sincethe research on mobile analysis is just started we will onlyintroduce some recent and representative analysis applica-tions in this section

                                                      With the growth of numbers of mobile users andimproved performance mobile phones are now useful forbuilding and maintaining communities such as communi-ties with geographical locations and communities based ondifferent cultural backgrounds and interests(eg the latestWebchat) Traditional network communities or SNS com-munities are in short of online interaction among membersand the communities are active only when members are sit-ting before computers On the contrary mobile phones cansupport rich interaction at any time and anywhere Mobilecommunities are defined as that a group of individuals withthe same hobbies (ie health safety and entertainmentetc) gather together on networks meet to make a com-mon goal decide measures through consultation to achievethe goal and start to implement their plan [152] In [153]the authors proposed a qualitative model of a mobile com-munity It is now widely believed that mobile communityapplications will greatly promote the development of themobile industry

                                                      Recently the progress in wireless sensor mobile commu-nication technology and stream processing enable people tobuild a body area network to have real-time monitoring ofpeoplersquos health Generally medical data from various sen-sors have different characteristics in terms of attributes timeand space relations as well as physiological relations etc

                                                      Mobile Netw Appl (2014) 19171ndash209 197

                                                      In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

                                                      Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

                                                      63 Key applications of big data

                                                      631 Application of big data in enterprises

                                                      At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

                                                      In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

                                                      Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

                                                      632 Application of IoT based big data

                                                      IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

                                                      Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

                                                      Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

                                                      633 Application of online social network-oriented big data

                                                      Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

                                                      198 Mobile Netw Appl (2014) 19171ndash209

                                                      information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

                                                      ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

                                                      ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

                                                      is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

                                                      The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

                                                      In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

                                                      Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

                                                      Mobile Netw Appl (2014) 19171ndash209 199

                                                      or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

                                                      Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

                                                      ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

                                                      ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

                                                      ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

                                                      634 Applications of healthcare and medical big data

                                                      Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

                                                      effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

                                                      For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

                                                      The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

                                                      HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

                                                      Fig 6 The correlation between Tweets about rice price and food price inflation

                                                      200 Mobile Netw Appl (2014) 19171ndash209

                                                      imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

                                                      635 Collective intelligence

                                                      With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

                                                      Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

                                                      In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

                                                      636 Smart grid

                                                      Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

                                                      supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

                                                      ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

                                                      ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

                                                      Mobile Netw Appl (2014) 19171ndash209 201

                                                      according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

                                                      ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

                                                      7 Conclusion open issues and outlook

                                                      In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

                                                      In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

                                                      71 Open issues

                                                      The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

                                                      711 Theoretical research

                                                      Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

                                                      ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

                                                      ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

                                                      ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

                                                      712 Technology development

                                                      The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

                                                      ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

                                                      202 Mobile Netw Appl (2014) 19171ndash209

                                                      ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

                                                      ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

                                                      ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

                                                      713 Practical implications

                                                      Although there are already many successful big data appli-cations many practical problems should be solved

                                                      ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

                                                      ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

                                                      ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

                                                      individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

                                                      ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

                                                      714 Data security

                                                      In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

                                                      ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

                                                      ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

                                                      Mobile Netw Appl (2014) 19171ndash209 203

                                                      quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

                                                      ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

                                                      ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

                                                      The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

                                                      72 Outlook

                                                      The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

                                                      not predict the future but may take precautions for possibleevents to occur in the future

                                                      ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

                                                      ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

                                                      ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

                                                      ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

                                                      204 Mobile Netw Appl (2014) 19171ndash209

                                                      utilizes relational diagrams to express interpersonalrelationship

                                                      ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

                                                      ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

                                                      ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

                                                      ndash Compared with accurate data we would like toaccept numerous and complicated data

                                                      ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

                                                      ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

                                                      ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

                                                      Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

                                                      increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

                                                      Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

                                                      References

                                                      1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

                                                      2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

                                                      3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

                                                      4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

                                                      5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

                                                      httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

                                                      7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

                                                      8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

                                                      9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

                                                      10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

                                                      11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

                                                      12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

                                                      13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

                                                      14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

                                                      15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

                                                      16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

                                                      17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

                                                      18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

                                                      19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

                                                      Mobile Netw Appl (2014) 19171ndash209 205

                                                      20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

                                                      21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

                                                      22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

                                                      23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

                                                      24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

                                                      25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

                                                      26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

                                                      27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

                                                      28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

                                                      29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

                                                      30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

                                                      31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

                                                      32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

                                                      33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

                                                      34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

                                                      35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

                                                      36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

                                                      37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

                                                      38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

                                                      39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

                                                      40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

                                                      41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

                                                      42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

                                                      43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

                                                      44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

                                                      45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

                                                      46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

                                                      47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

                                                      48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

                                                      49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

                                                      50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

                                                      51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

                                                      52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

                                                      53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

                                                      54 Cisco data center interconnect design and deployment guide(2010)

                                                      55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

                                                      56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

                                                      57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

                                                      58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

                                                      59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

                                                      206 Mobile Netw Appl (2014) 19171ndash209

                                                      60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

                                                      61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

                                                      62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

                                                      63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

                                                      64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

                                                      65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

                                                      66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

                                                      67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

                                                      68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

                                                      69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

                                                      70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

                                                      71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

                                                      72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

                                                      73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

                                                      74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

                                                      75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

                                                      76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

                                                      77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

                                                      78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

                                                      79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

                                                      80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

                                                      81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

                                                      82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

                                                      83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

                                                      84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

                                                      85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

                                                      86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

                                                      87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

                                                      88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

                                                      89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

                                                      90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

                                                      Media Inc93 Crockford D (2006) The applicationjson media type for

                                                      javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

                                                      SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

                                                      tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

                                                      (2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

                                                      97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

                                                      98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

                                                      99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

                                                      Mobile Netw Appl (2014) 19171ndash209 207

                                                      100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

                                                      101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

                                                      102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

                                                      103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

                                                      104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

                                                      105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

                                                      106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

                                                      107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

                                                      108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

                                                      109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

                                                      110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

                                                      111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

                                                      112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

                                                      113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

                                                      114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

                                                      115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

                                                      D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

                                                      117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

                                                      118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

                                                      the 7th ACM international conference on computing frontiersACM pp 277ndash286

                                                      119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

                                                      120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

                                                      121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

                                                      122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

                                                      123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

                                                      124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

                                                      125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

                                                      126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

                                                      127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

                                                      128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

                                                      129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

                                                      130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

                                                      131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

                                                      132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

                                                      133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

                                                      134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

                                                      135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

                                                      136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

                                                      137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

                                                      138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

                                                      139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

                                                      140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

                                                      208 Mobile Netw Appl (2014) 19171ndash209

                                                      141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

                                                      142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

                                                      143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

                                                      144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

                                                      145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

                                                      146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

                                                      147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

                                                      148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

                                                      149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

                                                      150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

                                                      151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

                                                      152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

                                                      153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

                                                      154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

                                                      155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

                                                      156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

                                                      Mobile Netw Appl (2014) 19171ndash209 209

                                                      • Big Data A Survey
                                                        • Abstract
                                                        • Background
                                                          • Dawn of big data era
                                                          • Definition and features of big data
                                                          • Big data value
                                                          • The development of big data
                                                          • Challenges of big data
                                                            • Related technologies
                                                              • Relationship between cloud computing and big data
                                                              • Relationship between IoT and big data
                                                              • Data center
                                                              • Relationship between hadoop and big data
                                                                • Big data generation and acquisition
                                                                  • Data generation
                                                                    • Enterprise data
                                                                    • IoT data
                                                                    • Bio-medical data
                                                                    • Data generation from other fields
                                                                      • Big data acquisition
                                                                        • Data collection
                                                                        • Data transportation
                                                                        • Data pre-processing
                                                                            • Big data storage
                                                                              • Storage system for massive data
                                                                              • Distributed storage system
                                                                              • Storage mechanism for big data
                                                                                • Database technology
                                                                                  • Traditional data analysis
                                                                                  • Big data analytic methods
                                                                                  • Architecture for big data analysis
                                                                                    • Real-time vs offline analysis
                                                                                    • Analysis at different levels
                                                                                    • Analysis with different complexity
                                                                                      • Tools for big data mining and analysis
                                                                                        • Big data applications
                                                                                          • Key applications of big data
                                                                                            • Application evolutions
                                                                                            • Structured data analysis
                                                                                            • Text data analysis
                                                                                            • Web data analysis
                                                                                            • Multimedia data analysis
                                                                                            • Network data analysis
                                                                                            • Mobile data analysis
                                                                                              • Key applications of big data
                                                                                                • Application of big data in enterprises
                                                                                                • Application of IoT based big data
                                                                                                • Application of online social network-oriented big data
                                                                                                • Applications of healthcare and medical big data
                                                                                                • Collective intelligence
                                                                                                • Smart grid
                                                                                                    • Conclusion open issues and outlook
                                                                                                      • Open issues
                                                                                                        • Theoretical research
                                                                                                        • Technology development
                                                                                                        • Practical implications
                                                                                                        • Data security
                                                                                                          • Outlook
                                                                                                            • Acknowledgments
                                                                                                            • References

                                                        In addition such datasets involve privacy and safety protec-tion In [154] Garg et al introduce a multi-modal transportanalysis mechanism of raw data for real-time monitoring ofhealth Under the circumstance that only highly comprehen-sive characteristics related to health are available Park et alin [155] examined approaches to better utilize

                                                        Researchers from Gjovik University College in Norwayand Derawi Biometrics collaborated to develop an applica-tion for smart phones which analyzes paces when peoplewalk and uses the pace information for unlocking the safetysystem [11] In the meanwhile Robert Delano and BrianParise from Georgia Institute of Technology developed anapplication called iTrem which monitors human body trem-bling with a built-in seismograph in a mobile phone so as tocope with Parkinson and other nervous system diseases [11]

                                                        63 Key applications of big data

                                                        631 Application of big data in enterprises

                                                        At present big data mainly comes from and is mainly usedin enterprises while BI and OLAP can be regarded as thepredecessors of big data application The application of bigdata in enterprises can enhance their production efficiencyand competitiveness in many aspects In particular on mar-keting with correlation analysis of big data enterprises canmore accurately predict the consumer behavior and findnew business modes On sales planning after comparisonof massive data enterprises can optimize their commodityprices On operation enterprises can improve their opera-tion efficiency and satisfaction optimize the labor forceaccurately forecast personnel allocation requirements avoidexcess production capacity and reduce labor cost On sup-ply chain using big data enterprises may conduct inventoryoptimization logistic optimization and supplier coordina-tion etc to mitigate the gap between supply and demandcontrol budgets and improve services

                                                        In finance the application of big data in enterpriseshas been rapidly developed For example China MerchantsBank (CMB) utilizes data analysis to recognize that suchactivities as ldquoMulti-times score accumulationrdquo and ldquoscoreexchange in shopsrdquo are effective for attracting quality cus-tomers By building a customer drop out warning model thebank can sell high-yield financial products to the top 20 customers who are most likely to drop out so as to retainthem As a result the drop out ratios of customers with GoldCards and Sunflower Cards have been reduced by 15 and 7 respectively By analyzing customersrsquo transactionrecords potential small business customers can be effi-ciently identified By utilizing remote banking and the cloudreferral platform to implement cross-selling considerableperformance gains were achieved

                                                        Obviously the most classic application is in e-commerceTens of thousands of transactions are conducted in Taobaoand the corresponding transaction time commodity pricesand purchase quantities are recorded every day and moreimportant along with age gender address and even hob-bies and interests of buyers and sellers Data Cube of Taobaois a big data application on the Taobao platform throughwhich merchants can be ware of the macroscopic indus-trial status of the Taobao platform market conditions oftheir brands and consumersrsquo behaviors etc and accord-ingly make production and inventory decisions Meanwhilemore consumers can purchase their favorite commoditieswith more preferable prices The credit loan of Alibabaautomatically analyzes and judges weather to lend loans toenterprises through the acquired enterprise transaction databy virtue of big data technology while manual interventiondoes not occur in the entire process It is disclosed that sofar Alibaba has lent more than RMB 30 billion Yuan withonly about 03 bad loans which is greatly lower thanthose of other commercial banks

                                                        632 Application of IoT based big data

                                                        IoT is not only an important source of big data but also oneof the main markets of big data applications Because of thehigh variety of objects the applications of IoT also evolveendlessly

                                                        Logistic enterprises may have profoundly experiencedwith the application of IoT big data For example trucks ofUPS are equipped with sensors wireless adapters and GPSso the Headquarter can track truck positions and preventengine failures Meanwhile this system also helps UPS tosupervise and manage its employees and optimize deliveryroutes The optimal delivery routes specified for UPS trucksare derived from their past driving experience In 2011 UPSdrivers have driven for nearly 4828 million km less

                                                        Smart city is a hot research area based on the applicationof IoT data For example the smart city project coopera-tion between the Miami-Dade County in Florida and IBMclosely connects 35 types of key county government depart-ments and Miami city and helps government leaders obtainbetter information support in decision making for manag-ing water resources reducing traffic jam and improvingpublic safety The application of smart city brings aboutbenefits in many aspects for Dade County For instance theDepartment of Park Management of Dade County saved onemillion USD in water bills due to timely identifying andfixing water pipes that were running and leaking this year

                                                        633 Application of online social network-oriented big data

                                                        Online SNS is a social structure constituted by social indi-viduals and connections among individuals based on an

                                                        198 Mobile Netw Appl (2014) 19171ndash209

                                                        information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

                                                        ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

                                                        ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

                                                        is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

                                                        The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

                                                        In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

                                                        Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

                                                        Mobile Netw Appl (2014) 19171ndash209 199

                                                        or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

                                                        Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

                                                        ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

                                                        ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

                                                        ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

                                                        634 Applications of healthcare and medical big data

                                                        Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

                                                        effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

                                                        For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

                                                        The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

                                                        HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

                                                        Fig 6 The correlation between Tweets about rice price and food price inflation

                                                        200 Mobile Netw Appl (2014) 19171ndash209

                                                        imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

                                                        635 Collective intelligence

                                                        With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

                                                        Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

                                                        In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

                                                        636 Smart grid

                                                        Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

                                                        supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

                                                        ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

                                                        ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

                                                        Mobile Netw Appl (2014) 19171ndash209 201

                                                        according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

                                                        ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

                                                        7 Conclusion open issues and outlook

                                                        In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

                                                        In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

                                                        71 Open issues

                                                        The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

                                                        711 Theoretical research

                                                        Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

                                                        ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

                                                        ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

                                                        ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

                                                        712 Technology development

                                                        The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

                                                        ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

                                                        202 Mobile Netw Appl (2014) 19171ndash209

                                                        ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

                                                        ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

                                                        ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

                                                        713 Practical implications

                                                        Although there are already many successful big data appli-cations many practical problems should be solved

                                                        ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

                                                        ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

                                                        ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

                                                        individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

                                                        ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

                                                        714 Data security

                                                        In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

                                                        ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

                                                        ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

                                                        Mobile Netw Appl (2014) 19171ndash209 203

                                                        quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

                                                        ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

                                                        ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

                                                        The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

                                                        72 Outlook

                                                        The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

                                                        not predict the future but may take precautions for possibleevents to occur in the future

                                                        ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

                                                        ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

                                                        ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

                                                        ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

                                                        204 Mobile Netw Appl (2014) 19171ndash209

                                                        utilizes relational diagrams to express interpersonalrelationship

                                                        ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

                                                        ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

                                                        ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

                                                        ndash Compared with accurate data we would like toaccept numerous and complicated data

                                                        ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

                                                        ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

                                                        ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

                                                        Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

                                                        increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

                                                        Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

                                                        References

                                                        1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

                                                        2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

                                                        3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

                                                        4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

                                                        5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

                                                        httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

                                                        7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

                                                        8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

                                                        9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

                                                        10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

                                                        11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

                                                        12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

                                                        13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

                                                        14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

                                                        15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

                                                        16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

                                                        17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

                                                        18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

                                                        19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

                                                        Mobile Netw Appl (2014) 19171ndash209 205

                                                        20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

                                                        21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

                                                        22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

                                                        23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

                                                        24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

                                                        25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

                                                        26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

                                                        27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

                                                        28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

                                                        29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

                                                        30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

                                                        31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

                                                        32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

                                                        33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

                                                        34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

                                                        35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

                                                        36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

                                                        37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

                                                        38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

                                                        39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

                                                        40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

                                                        41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

                                                        42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

                                                        43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

                                                        44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

                                                        45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

                                                        46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

                                                        47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

                                                        48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

                                                        49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

                                                        50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

                                                        51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

                                                        52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

                                                        53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

                                                        54 Cisco data center interconnect design and deployment guide(2010)

                                                        55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

                                                        56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

                                                        57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

                                                        58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

                                                        59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

                                                        206 Mobile Netw Appl (2014) 19171ndash209

                                                        60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

                                                        61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

                                                        62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

                                                        63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

                                                        64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

                                                        65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

                                                        66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

                                                        67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

                                                        68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

                                                        69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

                                                        70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

                                                        71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

                                                        72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

                                                        73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

                                                        74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

                                                        75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

                                                        76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

                                                        77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

                                                        78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

                                                        79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

                                                        80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

                                                        81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

                                                        82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

                                                        83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

                                                        84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

                                                        85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

                                                        86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

                                                        87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

                                                        88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

                                                        89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

                                                        90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

                                                        Media Inc93 Crockford D (2006) The applicationjson media type for

                                                        javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

                                                        SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

                                                        tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

                                                        (2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

                                                        97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

                                                        98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

                                                        99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

                                                        Mobile Netw Appl (2014) 19171ndash209 207

                                                        100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

                                                        101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

                                                        102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

                                                        103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

                                                        104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

                                                        105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

                                                        106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

                                                        107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

                                                        108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

                                                        109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

                                                        110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

                                                        111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

                                                        112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

                                                        113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

                                                        114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

                                                        115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

                                                        D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

                                                        117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

                                                        118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

                                                        the 7th ACM international conference on computing frontiersACM pp 277ndash286

                                                        119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

                                                        120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

                                                        121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

                                                        122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

                                                        123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

                                                        124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

                                                        125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

                                                        126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

                                                        127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

                                                        128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

                                                        129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

                                                        130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

                                                        131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

                                                        132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

                                                        133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

                                                        134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

                                                        135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

                                                        136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

                                                        137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

                                                        138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

                                                        139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

                                                        140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

                                                        208 Mobile Netw Appl (2014) 19171ndash209

                                                        141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

                                                        142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

                                                        143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

                                                        144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

                                                        145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

                                                        146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

                                                        147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

                                                        148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

                                                        149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

                                                        150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

                                                        151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

                                                        152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

                                                        153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

                                                        154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

                                                        155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

                                                        156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

                                                        Mobile Netw Appl (2014) 19171ndash209 209

                                                        • Big Data A Survey
                                                          • Abstract
                                                          • Background
                                                            • Dawn of big data era
                                                            • Definition and features of big data
                                                            • Big data value
                                                            • The development of big data
                                                            • Challenges of big data
                                                              • Related technologies
                                                                • Relationship between cloud computing and big data
                                                                • Relationship between IoT and big data
                                                                • Data center
                                                                • Relationship between hadoop and big data
                                                                  • Big data generation and acquisition
                                                                    • Data generation
                                                                      • Enterprise data
                                                                      • IoT data
                                                                      • Bio-medical data
                                                                      • Data generation from other fields
                                                                        • Big data acquisition
                                                                          • Data collection
                                                                          • Data transportation
                                                                          • Data pre-processing
                                                                              • Big data storage
                                                                                • Storage system for massive data
                                                                                • Distributed storage system
                                                                                • Storage mechanism for big data
                                                                                  • Database technology
                                                                                    • Traditional data analysis
                                                                                    • Big data analytic methods
                                                                                    • Architecture for big data analysis
                                                                                      • Real-time vs offline analysis
                                                                                      • Analysis at different levels
                                                                                      • Analysis with different complexity
                                                                                        • Tools for big data mining and analysis
                                                                                          • Big data applications
                                                                                            • Key applications of big data
                                                                                              • Application evolutions
                                                                                              • Structured data analysis
                                                                                              • Text data analysis
                                                                                              • Web data analysis
                                                                                              • Multimedia data analysis
                                                                                              • Network data analysis
                                                                                              • Mobile data analysis
                                                                                                • Key applications of big data
                                                                                                  • Application of big data in enterprises
                                                                                                  • Application of IoT based big data
                                                                                                  • Application of online social network-oriented big data
                                                                                                  • Applications of healthcare and medical big data
                                                                                                  • Collective intelligence
                                                                                                  • Smart grid
                                                                                                      • Conclusion open issues and outlook
                                                                                                        • Open issues
                                                                                                          • Theoretical research
                                                                                                          • Technology development
                                                                                                          • Practical implications
                                                                                                          • Data security
                                                                                                            • Outlook
                                                                                                              • Acknowledgments
                                                                                                              • References

                                                          information network Big data of online SNS mainly comesfrom instant messages online social micro blog and sharedspace etc which represents various user activities Theanalysis of big data from online SNS uses computationalanalytical method provided for understanding relations inthe human society by virtue of theories and methodswhich involves mathematics informatics sociology andmanagement science etc from three dimensions includ-ing network structure group interaction and informationspreading The application includes network public opin-ion analysis network intelligence collection and analysissocialized marketing government decision-making supportand online education etc Fig 5 illustrates the technicalframework of the application of big data of online SNSClassic applications of big data from online SNS are intro-duced in the following which mainly mine and analyzecontent information and structural information to acquirevalues

                                                          ndash Content-based Applications Language and text are thetwo most important forms of presentation in SNSThrough the analysis of language and text user pref-erence emotion interest and demand etc may berevealed

                                                          ndash Structure-based Applications In SNS users are rep-resented as nodes while social relation interest andhobbies etc aggregate relations among users into aclustered structure Such structure with close relationsamong internal individuals but loose external relations

                                                          is also called a community The community-based anal-ysis is of vital importance to improve informationpropagation and for interpersonal relation analysis

                                                          The US Santa Cruz Police Department experi-mented by applying data for predictive analysis Byanalyzing SNS the police department can discovercrime trends and crime modes and even predict thecrime rates in major regions [11]

                                                          In April 2013 Wolfram Alpha a computing andsearch engine company studied the law of social behav-ior by analyzing social data of more than one millionAmerican users of Facebook According to the anal-ysis it was found that most Facebook users fall inlove in their early 20s and get engaged when theyare about 27 years old then get married when theyare about 30 years old Finally their marriage relation-ships exhibit slow changes between 30 and 60 yearsold Such research results are highly consistent with thedemographic census data of the US In addition GlobalPulse conducted a research that revealed some laws insocial and economic activities using SNS data Thisproject utilized publicly available Twitter messages inEnglish Japanese and Indonesian from July 2010 toOctober 2011 to analyze topics related to food fuelhousing and loan The goal is to better understand pub-lic behavior and concerns This project analyzed SNSbig data from several aspects 1) predicting the occur-rence of abnormal events by detecting the sharp growth

                                                          Fig 5 Enabling technologiesfor online socialnetwork-oriented big data

                                                          Mobile Netw Appl (2014) 19171ndash209 199

                                                          or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

                                                          Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

                                                          ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

                                                          ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

                                                          ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

                                                          634 Applications of healthcare and medical big data

                                                          Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

                                                          effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

                                                          For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

                                                          The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

                                                          HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

                                                          Fig 6 The correlation between Tweets about rice price and food price inflation

                                                          200 Mobile Netw Appl (2014) 19171ndash209

                                                          imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

                                                          635 Collective intelligence

                                                          With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

                                                          Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

                                                          In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

                                                          636 Smart grid

                                                          Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

                                                          supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

                                                          ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

                                                          ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

                                                          Mobile Netw Appl (2014) 19171ndash209 201

                                                          according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

                                                          ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

                                                          7 Conclusion open issues and outlook

                                                          In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

                                                          In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

                                                          71 Open issues

                                                          The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

                                                          711 Theoretical research

                                                          Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

                                                          ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

                                                          ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

                                                          ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

                                                          712 Technology development

                                                          The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

                                                          ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

                                                          202 Mobile Netw Appl (2014) 19171ndash209

                                                          ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

                                                          ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

                                                          ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

                                                          713 Practical implications

                                                          Although there are already many successful big data appli-cations many practical problems should be solved

                                                          ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

                                                          ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

                                                          ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

                                                          individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

                                                          ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

                                                          714 Data security

                                                          In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

                                                          ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

                                                          ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

                                                          Mobile Netw Appl (2014) 19171ndash209 203

                                                          quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

                                                          ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

                                                          ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

                                                          The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

                                                          72 Outlook

                                                          The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

                                                          not predict the future but may take precautions for possibleevents to occur in the future

                                                          ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

                                                          ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

                                                          ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

                                                          ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

                                                          204 Mobile Netw Appl (2014) 19171ndash209

                                                          utilizes relational diagrams to express interpersonalrelationship

                                                          ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

                                                          ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

                                                          ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

                                                          ndash Compared with accurate data we would like toaccept numerous and complicated data

                                                          ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

                                                          ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

                                                          ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

                                                          Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

                                                          increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

                                                          Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

                                                          References

                                                          1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

                                                          2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

                                                          3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

                                                          4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

                                                          5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

                                                          httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

                                                          7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

                                                          8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

                                                          9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

                                                          10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

                                                          11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

                                                          12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

                                                          13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

                                                          14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

                                                          15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

                                                          16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

                                                          17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

                                                          18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

                                                          19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

                                                          Mobile Netw Appl (2014) 19171ndash209 205

                                                          20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

                                                          21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

                                                          22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

                                                          23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

                                                          24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

                                                          25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

                                                          26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

                                                          27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

                                                          28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

                                                          29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

                                                          30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

                                                          31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

                                                          32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

                                                          33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

                                                          34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

                                                          35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

                                                          36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

                                                          37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

                                                          38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

                                                          39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

                                                          40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

                                                          41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

                                                          42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

                                                          43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

                                                          44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

                                                          45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

                                                          46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

                                                          47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

                                                          48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

                                                          49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

                                                          50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

                                                          51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

                                                          52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

                                                          53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

                                                          54 Cisco data center interconnect design and deployment guide(2010)

                                                          55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

                                                          56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

                                                          57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

                                                          58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

                                                          59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

                                                          206 Mobile Netw Appl (2014) 19171ndash209

                                                          60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

                                                          61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

                                                          62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

                                                          63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

                                                          64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

                                                          65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

                                                          66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

                                                          67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

                                                          68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

                                                          69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

                                                          70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

                                                          71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

                                                          72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

                                                          73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

                                                          74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

                                                          75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

                                                          76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

                                                          77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

                                                          78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

                                                          79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

                                                          80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

                                                          81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

                                                          82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

                                                          83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

                                                          84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

                                                          85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

                                                          86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

                                                          87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

                                                          88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

                                                          89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

                                                          90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

                                                          Media Inc93 Crockford D (2006) The applicationjson media type for

                                                          javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

                                                          SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

                                                          tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

                                                          (2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

                                                          97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

                                                          98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

                                                          99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

                                                          Mobile Netw Appl (2014) 19171ndash209 207

                                                          100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

                                                          101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

                                                          102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

                                                          103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

                                                          104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

                                                          105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

                                                          106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

                                                          107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

                                                          108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

                                                          109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

                                                          110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

                                                          111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

                                                          112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

                                                          113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

                                                          114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

                                                          115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

                                                          D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

                                                          117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

                                                          118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

                                                          the 7th ACM international conference on computing frontiersACM pp 277ndash286

                                                          119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

                                                          120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

                                                          121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

                                                          122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

                                                          123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

                                                          124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

                                                          125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

                                                          126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

                                                          127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

                                                          128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

                                                          129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

                                                          130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

                                                          131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

                                                          132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

                                                          133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

                                                          134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

                                                          135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

                                                          136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

                                                          137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

                                                          138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

                                                          139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

                                                          140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

                                                          208 Mobile Netw Appl (2014) 19171ndash209

                                                          141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

                                                          142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

                                                          143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

                                                          144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

                                                          145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

                                                          146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

                                                          147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

                                                          148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

                                                          149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

                                                          150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

                                                          151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

                                                          152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

                                                          153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

                                                          154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

                                                          155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

                                                          156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

                                                          Mobile Netw Appl (2014) 19171ndash209 209

                                                          • Big Data A Survey
                                                            • Abstract
                                                            • Background
                                                              • Dawn of big data era
                                                              • Definition and features of big data
                                                              • Big data value
                                                              • The development of big data
                                                              • Challenges of big data
                                                                • Related technologies
                                                                  • Relationship between cloud computing and big data
                                                                  • Relationship between IoT and big data
                                                                  • Data center
                                                                  • Relationship between hadoop and big data
                                                                    • Big data generation and acquisition
                                                                      • Data generation
                                                                        • Enterprise data
                                                                        • IoT data
                                                                        • Bio-medical data
                                                                        • Data generation from other fields
                                                                          • Big data acquisition
                                                                            • Data collection
                                                                            • Data transportation
                                                                            • Data pre-processing
                                                                                • Big data storage
                                                                                  • Storage system for massive data
                                                                                  • Distributed storage system
                                                                                  • Storage mechanism for big data
                                                                                    • Database technology
                                                                                      • Traditional data analysis
                                                                                      • Big data analytic methods
                                                                                      • Architecture for big data analysis
                                                                                        • Real-time vs offline analysis
                                                                                        • Analysis at different levels
                                                                                        • Analysis with different complexity
                                                                                          • Tools for big data mining and analysis
                                                                                            • Big data applications
                                                                                              • Key applications of big data
                                                                                                • Application evolutions
                                                                                                • Structured data analysis
                                                                                                • Text data analysis
                                                                                                • Web data analysis
                                                                                                • Multimedia data analysis
                                                                                                • Network data analysis
                                                                                                • Mobile data analysis
                                                                                                  • Key applications of big data
                                                                                                    • Application of big data in enterprises
                                                                                                    • Application of IoT based big data
                                                                                                    • Application of online social network-oriented big data
                                                                                                    • Applications of healthcare and medical big data
                                                                                                    • Collective intelligence
                                                                                                    • Smart grid
                                                                                                        • Conclusion open issues and outlook
                                                                                                          • Open issues
                                                                                                            • Theoretical research
                                                                                                            • Technology development
                                                                                                            • Practical implications
                                                                                                            • Data security
                                                                                                              • Outlook
                                                                                                                • Acknowledgments
                                                                                                                • References

                                                            or drop of the amount of topics 2) observing the weeklyand monthly trends of dialogs on Twitter developingmodels for the variation in the level of attention onspecific topics over time 3) understanding the transfor-mation trends of user behavior or interest by comparingratios of different sub-topics and 4) predicting trendswith external indicators involved in Twitter dialoguesAs a classic example the project discovered that thechange of food price inflation from the official statisticsof Indonesia matches the number of Tweets to rice priceon Twitter as shown in Fig 6

                                                            Generally speaking the application of big data fromonline SNS may help to better understand userrsquos behaviorand master the laws of social and economic activities fromthe following three aspects

                                                            ndash Early Warning to rapidly cope with crisis if anyby detecting abnormalities in the usage of electronicdevices and services

                                                            ndash Real-time Monitoring to provide accurate informationfor the formulation of policies and plans by monitoringthe current behavior emotion and preference of users

                                                            ndash Real-time Feedback acquire groupsrsquo feedbacks againstsome social activities based on real-time monitoring

                                                            634 Applications of healthcare and medical big data

                                                            Healthcare and medical data are continuously and rapidlygrowing complex data containing abundant and diverseinformation values Big data has unlimited potential for

                                                            effectively storing processing querying and analyzingmedical data The application of medical big data willprofoundly influence the health care business

                                                            For example Aetna Life Insurance Company selected102 patients from a pool of a thousand patients to com-plete an experiment in order to help predict the recoveryof patients with metabolic syndrome In an independentexperiment it scanned 600000 laboratory test results and180000 claims through a series of detection test results ofmetabolic syndrome of patients in three consecutive yearsIn addition it summarized the final result into an extremepersonalized treatment plan to assess the dangerous factorsand main treatment plans of patients Then doctors mayreduce morbidity by 50 in the next 10 years by pre-scribing statins and helping patients to lose weight by fivepounds or suggesting patients to reduce the total triglyc-eride in their bodies if the sugar content in their bodies isover 20

                                                            The Mount Sinai Medical Center in the US utilizestechnologies of Ayasdi a big data company to analyze allgenetic sequences of Escherichia Coli including over onemillion DNA variants to investigate why bacterial strainsresist antibiotics Ayasdirsquos uses topological data analysis abrand-new mathematic research method to understand datacharacteristics

                                                            HealthVault of Microsoft launched in 2007 is an excel-lent application of medical big data launched in 2007 Itsgoal is to manage individual health information in individualand family medical devices Presently health informationcan be entered and uploaded with mobile smart devices and

                                                            Fig 6 The correlation between Tweets about rice price and food price inflation

                                                            200 Mobile Netw Appl (2014) 19171ndash209

                                                            imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

                                                            635 Collective intelligence

                                                            With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

                                                            Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

                                                            In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

                                                            636 Smart grid

                                                            Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

                                                            supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

                                                            ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

                                                            ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

                                                            Mobile Netw Appl (2014) 19171ndash209 201

                                                            according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

                                                            ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

                                                            7 Conclusion open issues and outlook

                                                            In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

                                                            In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

                                                            71 Open issues

                                                            The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

                                                            711 Theoretical research

                                                            Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

                                                            ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

                                                            ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

                                                            ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

                                                            712 Technology development

                                                            The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

                                                            ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

                                                            202 Mobile Netw Appl (2014) 19171ndash209

                                                            ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

                                                            ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

                                                            ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

                                                            713 Practical implications

                                                            Although there are already many successful big data appli-cations many practical problems should be solved

                                                            ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

                                                            ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

                                                            ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

                                                            individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

                                                            ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

                                                            714 Data security

                                                            In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

                                                            ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

                                                            ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

                                                            Mobile Netw Appl (2014) 19171ndash209 203

                                                            quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

                                                            ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

                                                            ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

                                                            The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

                                                            72 Outlook

                                                            The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

                                                            not predict the future but may take precautions for possibleevents to occur in the future

                                                            ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

                                                            ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

                                                            ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

                                                            ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

                                                            204 Mobile Netw Appl (2014) 19171ndash209

                                                            utilizes relational diagrams to express interpersonalrelationship

                                                            ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

                                                            ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

                                                            ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

                                                            ndash Compared with accurate data we would like toaccept numerous and complicated data

                                                            ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

                                                            ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

                                                            ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

                                                            Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

                                                            increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

                                                            Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

                                                            References

                                                            1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

                                                            2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

                                                            3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

                                                            4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

                                                            5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

                                                            httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

                                                            7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

                                                            8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

                                                            9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

                                                            10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

                                                            11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

                                                            12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

                                                            13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

                                                            14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

                                                            15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

                                                            16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

                                                            17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

                                                            18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

                                                            19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

                                                            Mobile Netw Appl (2014) 19171ndash209 205

                                                            20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

                                                            21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

                                                            22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

                                                            23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

                                                            24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

                                                            25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

                                                            26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

                                                            27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

                                                            28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

                                                            29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

                                                            30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

                                                            31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

                                                            32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

                                                            33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

                                                            34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

                                                            35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

                                                            36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

                                                            37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

                                                            38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

                                                            39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

                                                            40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

                                                            41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

                                                            42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

                                                            43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

                                                            44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

                                                            45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

                                                            46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

                                                            47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

                                                            48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

                                                            49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

                                                            50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

                                                            51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

                                                            52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

                                                            53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

                                                            54 Cisco data center interconnect design and deployment guide(2010)

                                                            55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

                                                            56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

                                                            57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

                                                            58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

                                                            59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

                                                            206 Mobile Netw Appl (2014) 19171ndash209

                                                            60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

                                                            61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

                                                            62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

                                                            63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

                                                            64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

                                                            65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

                                                            66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

                                                            67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

                                                            68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

                                                            69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

                                                            70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

                                                            71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

                                                            72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

                                                            73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

                                                            74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

                                                            75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

                                                            76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

                                                            77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

                                                            78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

                                                            79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

                                                            80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

                                                            81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

                                                            82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

                                                            83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

                                                            84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

                                                            85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

                                                            86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

                                                            87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

                                                            88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

                                                            89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

                                                            90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

                                                            Media Inc93 Crockford D (2006) The applicationjson media type for

                                                            javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

                                                            SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

                                                            tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

                                                            (2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

                                                            97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

                                                            98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

                                                            99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

                                                            Mobile Netw Appl (2014) 19171ndash209 207

                                                            100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

                                                            101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

                                                            102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

                                                            103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

                                                            104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

                                                            105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

                                                            106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

                                                            107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

                                                            108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

                                                            109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

                                                            110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

                                                            111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

                                                            112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

                                                            113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

                                                            114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

                                                            115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

                                                            D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

                                                            117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

                                                            118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

                                                            the 7th ACM international conference on computing frontiersACM pp 277ndash286

                                                            119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

                                                            120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

                                                            121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

                                                            122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

                                                            123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

                                                            124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

                                                            125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

                                                            126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

                                                            127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

                                                            128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

                                                            129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

                                                            130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

                                                            131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

                                                            132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

                                                            133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

                                                            134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

                                                            135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

                                                            136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

                                                            137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

                                                            138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

                                                            139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

                                                            140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

                                                            208 Mobile Netw Appl (2014) 19171ndash209

                                                            141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

                                                            142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

                                                            143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

                                                            144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

                                                            145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

                                                            146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

                                                            147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

                                                            148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

                                                            149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

                                                            150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

                                                            151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

                                                            152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

                                                            153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

                                                            154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

                                                            155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

                                                            156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

                                                            Mobile Netw Appl (2014) 19171ndash209 209

                                                            • Big Data A Survey
                                                              • Abstract
                                                              • Background
                                                                • Dawn of big data era
                                                                • Definition and features of big data
                                                                • Big data value
                                                                • The development of big data
                                                                • Challenges of big data
                                                                  • Related technologies
                                                                    • Relationship between cloud computing and big data
                                                                    • Relationship between IoT and big data
                                                                    • Data center
                                                                    • Relationship between hadoop and big data
                                                                      • Big data generation and acquisition
                                                                        • Data generation
                                                                          • Enterprise data
                                                                          • IoT data
                                                                          • Bio-medical data
                                                                          • Data generation from other fields
                                                                            • Big data acquisition
                                                                              • Data collection
                                                                              • Data transportation
                                                                              • Data pre-processing
                                                                                  • Big data storage
                                                                                    • Storage system for massive data
                                                                                    • Distributed storage system
                                                                                    • Storage mechanism for big data
                                                                                      • Database technology
                                                                                        • Traditional data analysis
                                                                                        • Big data analytic methods
                                                                                        • Architecture for big data analysis
                                                                                          • Real-time vs offline analysis
                                                                                          • Analysis at different levels
                                                                                          • Analysis with different complexity
                                                                                            • Tools for big data mining and analysis
                                                                                              • Big data applications
                                                                                                • Key applications of big data
                                                                                                  • Application evolutions
                                                                                                  • Structured data analysis
                                                                                                  • Text data analysis
                                                                                                  • Web data analysis
                                                                                                  • Multimedia data analysis
                                                                                                  • Network data analysis
                                                                                                  • Mobile data analysis
                                                                                                    • Key applications of big data
                                                                                                      • Application of big data in enterprises
                                                                                                      • Application of IoT based big data
                                                                                                      • Application of online social network-oriented big data
                                                                                                      • Applications of healthcare and medical big data
                                                                                                      • Collective intelligence
                                                                                                      • Smart grid
                                                                                                          • Conclusion open issues and outlook
                                                                                                            • Open issues
                                                                                                              • Theoretical research
                                                                                                              • Technology development
                                                                                                              • Practical implications
                                                                                                              • Data security
                                                                                                                • Outlook
                                                                                                                  • Acknowledgments
                                                                                                                  • References

                                                              imported from individual medical records by a third-partyagency In addition it can be integrated with a third-partyapplication with the software development kit (SDK) andopen interface

                                                              635 Collective intelligence

                                                              With the rapid development of wireless communication andsensor technologies mobile phones and tablet have increas-ingly stronger computing and sensing capacities As a resultcrowd sensing is becoming a key issue of mobile comput-ing In crowd sensing a large number of general users utilizemobile devices as basic sensing units to conduct coordina-tion with mobile networks for distribution of sensed tasksand collection and utilization of sensed data It can help uscomplete large-scale and complex social sensing tasks Incrowd sensing participants who complete complex sensingtasks do not need to have professional skills Crowd sensingin the form of Crowdsourcing has been successfully appliedto geotagged photograph positioning and navigation urbanroad traffic sensing market forecast opinion mining andother labor-intensive applications

                                                              Crowdsourcing a new approach for problem solvingtakes a large number of general users as the foundation anddistributes tasks in a free and voluntary manner As a matterof fact Crowdsourcing has been applied by many compa-nies before the emergence of big data For example P amp GBMW and Audi improved their R amp D and design capacitiesby virtue of Crowdsourcing The main idea of Crowdsourc-ing is to distribute tasks to general users and to completetasks that individual users could not or do not want toaccomplish With no need for intentionally deploying sens-ing modules and employing professionals Crowdsourcingcan broaden the scope of a sensing system to reach the cityscale and even larger scales

                                                              In the big data era Spatial Crowdsourcing becomes a hottopic The operation framework of Spatial Crowdsourcingis shown as follows A user may request the service andresources related to a specified location Then the mobileusers who are willing to participate in the task will moveto the specified location to acquire related data (such asvideo audio or pictures) Finally the acquired data willbe send to the service requester With the rapid growthof mobile devices and the increasingly powerful functionsprovided by mobile devices it can be forecasted that Spa-tial Crowdsourcing will be more prevailing than traditionalCrowdsourcing eg Amazon Turk and Crowdflower

                                                              636 Smart grid

                                                              Smart Grid is the next generation power grid constitutedby traditional energy networks integrated with computa-tion communications and control for optimized generation

                                                              supply and consumption of electric energy Smart Gridrelated big data are generated from various sources suchas (i) power utilization habits of users (ii) phasor mea-surement data which are measured by phasor measurementunit (PMU) deployed national-wide (iii) energy consump-tion data measured by the smart meters in the AdvancedMetering Infrastructure (AMI) (iv) energy market pricingand bidding data (v) management control and maintenancedata for devices and equipment in the power generationtransmission and distribution networks (such as CircuitBreaker Monitors and transformers) Smart Grid bringsabout the following challenges on exploiting big data

                                                              ndash Grid planning By analyzing data in the Smart Grid theregions can be identified that have excessive high elec-trical load or high power outage frequencies Even thetransmission lines with high failure probability can beidentified Such analytical results may contribute to gridupgrading transformation and maintenance etc Forexample researchers from University of California LosAngeles designed an ldquoelectric maprdquo according to thebig data theory and made a California map by integrat-ing census information and real-time power utilizationinformation provided by electric power companies Themap takes a block as a unit to demonstrate the powerconsumption of every block at the moment It can evencompare the power consumption of the block with theaverage income per capita and building types so as toreveal more accurate power usage habits of all kinds ofgroups in the community This map provides effectiveand visual load forecast for power grid planning in acity Preferential transformation on the power grid facil-ities in blocks with high power outage frequencies andserious overloads may be conducted as displayed in themap

                                                              ndash Interaction between power generation and power con-sumption An ideal power grid shall balance powergeneration and consumption However the traditionalpower grid is constructed based on one-directionalapproach of transmission-transformation-distribution-consumption which does not allow adjust the gen-eration capacity according to the demand of powerconsumption thus leading to electric energy redun-dancy and waste Therefore smart electric meters aredeveloped to improve power supply efficiency TXUEnergy has several successful deployment of smartelectric meters which can help supplier read power uti-lization data in every 15min other than every monthin the past Labor cost for meter reading is greatlyreduced because power utilization data (a source of bigdata) are frequently and rapidly acquired and analyzedpower supply companies can adjust the electricity price

                                                              Mobile Netw Appl (2014) 19171ndash209 201

                                                              according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

                                                              ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

                                                              7 Conclusion open issues and outlook

                                                              In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

                                                              In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

                                                              71 Open issues

                                                              The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

                                                              711 Theoretical research

                                                              Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

                                                              ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

                                                              ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

                                                              ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

                                                              712 Technology development

                                                              The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

                                                              ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

                                                              202 Mobile Netw Appl (2014) 19171ndash209

                                                              ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

                                                              ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

                                                              ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

                                                              713 Practical implications

                                                              Although there are already many successful big data appli-cations many practical problems should be solved

                                                              ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

                                                              ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

                                                              ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

                                                              individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

                                                              ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

                                                              714 Data security

                                                              In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

                                                              ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

                                                              ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

                                                              Mobile Netw Appl (2014) 19171ndash209 203

                                                              quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

                                                              ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

                                                              ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

                                                              The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

                                                              72 Outlook

                                                              The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

                                                              not predict the future but may take precautions for possibleevents to occur in the future

                                                              ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

                                                              ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

                                                              ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

                                                              ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

                                                              204 Mobile Netw Appl (2014) 19171ndash209

                                                              utilizes relational diagrams to express interpersonalrelationship

                                                              ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

                                                              ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

                                                              ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

                                                              ndash Compared with accurate data we would like toaccept numerous and complicated data

                                                              ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

                                                              ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

                                                              ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

                                                              Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

                                                              increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

                                                              Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

                                                              References

                                                              1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

                                                              2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

                                                              3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

                                                              4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

                                                              5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

                                                              httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

                                                              7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

                                                              8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

                                                              9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

                                                              10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

                                                              11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

                                                              12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

                                                              13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

                                                              14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

                                                              15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

                                                              16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

                                                              17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

                                                              18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

                                                              19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

                                                              Mobile Netw Appl (2014) 19171ndash209 205

                                                              20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

                                                              21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

                                                              22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

                                                              23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

                                                              24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

                                                              25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

                                                              26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

                                                              27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

                                                              28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

                                                              29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

                                                              30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

                                                              31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

                                                              32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

                                                              33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

                                                              34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

                                                              35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

                                                              36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

                                                              37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

                                                              38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

                                                              39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

                                                              40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

                                                              41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

                                                              42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

                                                              43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

                                                              44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

                                                              45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

                                                              46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

                                                              47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

                                                              48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

                                                              49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

                                                              50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

                                                              51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

                                                              52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

                                                              53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

                                                              54 Cisco data center interconnect design and deployment guide(2010)

                                                              55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

                                                              56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

                                                              57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

                                                              58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

                                                              59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

                                                              206 Mobile Netw Appl (2014) 19171ndash209

                                                              60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

                                                              61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

                                                              62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

                                                              63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

                                                              64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

                                                              65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

                                                              66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

                                                              67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

                                                              68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

                                                              69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

                                                              70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

                                                              71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

                                                              72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

                                                              73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

                                                              74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

                                                              75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

                                                              76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

                                                              77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

                                                              78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

                                                              79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

                                                              80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

                                                              81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

                                                              82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

                                                              83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

                                                              84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

                                                              85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

                                                              86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

                                                              87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

                                                              88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

                                                              89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

                                                              90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

                                                              Media Inc93 Crockford D (2006) The applicationjson media type for

                                                              javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

                                                              SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

                                                              tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

                                                              (2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

                                                              97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

                                                              98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

                                                              99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

                                                              Mobile Netw Appl (2014) 19171ndash209 207

                                                              100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

                                                              101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

                                                              102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

                                                              103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

                                                              104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

                                                              105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

                                                              106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

                                                              107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

                                                              108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

                                                              109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

                                                              110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

                                                              111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

                                                              112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

                                                              113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

                                                              114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

                                                              115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

                                                              D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

                                                              117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

                                                              118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

                                                              the 7th ACM international conference on computing frontiersACM pp 277ndash286

                                                              119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

                                                              120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

                                                              121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

                                                              122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

                                                              123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

                                                              124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

                                                              125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

                                                              126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

                                                              127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

                                                              128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

                                                              129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

                                                              130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

                                                              131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

                                                              132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

                                                              133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

                                                              134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

                                                              135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

                                                              136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

                                                              137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

                                                              138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

                                                              139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

                                                              140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

                                                              208 Mobile Netw Appl (2014) 19171ndash209

                                                              141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

                                                              142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

                                                              143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

                                                              144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

                                                              145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

                                                              146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

                                                              147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

                                                              148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

                                                              149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

                                                              150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

                                                              151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

                                                              152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

                                                              153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

                                                              154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

                                                              155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

                                                              156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

                                                              Mobile Netw Appl (2014) 19171ndash209 209

                                                              • Big Data A Survey
                                                                • Abstract
                                                                • Background
                                                                  • Dawn of big data era
                                                                  • Definition and features of big data
                                                                  • Big data value
                                                                  • The development of big data
                                                                  • Challenges of big data
                                                                    • Related technologies
                                                                      • Relationship between cloud computing and big data
                                                                      • Relationship between IoT and big data
                                                                      • Data center
                                                                      • Relationship between hadoop and big data
                                                                        • Big data generation and acquisition
                                                                          • Data generation
                                                                            • Enterprise data
                                                                            • IoT data
                                                                            • Bio-medical data
                                                                            • Data generation from other fields
                                                                              • Big data acquisition
                                                                                • Data collection
                                                                                • Data transportation
                                                                                • Data pre-processing
                                                                                    • Big data storage
                                                                                      • Storage system for massive data
                                                                                      • Distributed storage system
                                                                                      • Storage mechanism for big data
                                                                                        • Database technology
                                                                                          • Traditional data analysis
                                                                                          • Big data analytic methods
                                                                                          • Architecture for big data analysis
                                                                                            • Real-time vs offline analysis
                                                                                            • Analysis at different levels
                                                                                            • Analysis with different complexity
                                                                                              • Tools for big data mining and analysis
                                                                                                • Big data applications
                                                                                                  • Key applications of big data
                                                                                                    • Application evolutions
                                                                                                    • Structured data analysis
                                                                                                    • Text data analysis
                                                                                                    • Web data analysis
                                                                                                    • Multimedia data analysis
                                                                                                    • Network data analysis
                                                                                                    • Mobile data analysis
                                                                                                      • Key applications of big data
                                                                                                        • Application of big data in enterprises
                                                                                                        • Application of IoT based big data
                                                                                                        • Application of online social network-oriented big data
                                                                                                        • Applications of healthcare and medical big data
                                                                                                        • Collective intelligence
                                                                                                        • Smart grid
                                                                                                            • Conclusion open issues and outlook
                                                                                                              • Open issues
                                                                                                                • Theoretical research
                                                                                                                • Technology development
                                                                                                                • Practical implications
                                                                                                                • Data security
                                                                                                                  • Outlook
                                                                                                                    • Acknowledgments
                                                                                                                    • References

                                                                according to peak and low periods of power consump-tion TXU Energy utilized such price level to stabilizethe peak and low fluctuations of power consumption Asa matter of fact the application of big data in the smartgrid can help the realization of time-sharing dynamicpricing which is a win-win situation for both energysuppliers and users

                                                                ndash The access of intermittent renewable energy At presentmany new energy resources such as wind and solarcan be connected to power grids However since thepower generation capacities of new energy resourcesare closely related to climate conditions that featurerandomness and intermittency it is challenging to con-nect them to power grids If the big data of powergrids is effectively analyzed such intermittent renew-able new energy sources can be efficiently managedthe electricity generated by the new energy resourcescan be allocated to regions with electricity shortageSuch energy resources can complement the traditionalhydropower and thermal power generations

                                                                7 Conclusion open issues and outlook

                                                                In this paper we review the background and state-of-the-artof big data Firstly we introduce the general background ofbig data and review related technologies such as could com-puting IoT data centers and Hadoop Then we focus on thefour phases of the value chain of big data ie data gener-ation data acquisition data storage and data analysis Foreach phase we introduce the general background discussthe technical challenges and review the latest advancesWe finally reviewed the several representative applicationsof big data including enterprise management IoT socialnetworks medical applications collective intelligence andsmart grid These discussions aim to provide a comprehen-sive overview and big-picture to readers of this excitingarea

                                                                In the remainder of this section we summarize theresearch hot spots and suggest possible research directionsof big data We also discuss potential development trends inthis broad research and application area

                                                                71 Open issues

                                                                The analysis of big data is confronted with many challengesbut the current research is still in early stage Consider-able research efforts are needed to improve the efficiency ofdisplay storage and analysis of big data

                                                                711 Theoretical research

                                                                Although big data is a hot research area with great poten-tial in both academia and industry there are many importantproblems remain to be solved which are discussed below

                                                                ndash Fundamental problems of big data There is a com-pelling need for a rigorous and holistic definition of bigdata a structural model of big data a formal descriptionof big data and a theoretical system of data science Atpresent many discussions of big data look more likecommercial speculation than scientific research This isbecause big data is not formally and structurally definedand the existing models are not strictly verified

                                                                ndash Standardization of big data An evaluation systemof data quality and an evaluation standardbenchmarkof data computing efficiency should be developedMany solutions of big data applications claim theycan improve data processing and analysis capacitiesin all aspects but there is still not a unified evalua-tion standard and benchmark to balance the computingefficiency of big data with rigorous mathematical meth-ods The performance can only be evaluated when thesystem is implemented and deployed which could nothorizontally compare advantages and disadvantages ofvarious alternative solutions even before and after theimplementation of big data In addition since dataquality is an important basis of data preprocessing sim-plification and screening it is also an urgent problemto effectively and rigorously evaluate data quality

                                                                ndash Evolution of big data computing modes This includesmemory mode data flow mode PRAM mode andMR mode etc The emergence of big data triggers theadvances of algorithm design which has been trans-formed from a computing-intensive approach into adata-intensive approach Data transfer has been a mainbottleneck of big data computing Therefore many newcomputing models tailored for big data have emergedand more such models are on the horizon

                                                                712 Technology development

                                                                The big data technology is still in its infancy Many keytechnical problems such as cloud computing grid comput-ing stream computing parallel computing big data archi-tecture big data model and software systems supporting bigdata etc should be fully investigated

                                                                ndash Format conversion of big data Due to wide and diversedata sources heterogeneity is always a characteristicof big data as well as a key factor which restricts theefficiency of data format conversion If such formatconversion can be made more efficient the applicationof big data may create more values

                                                                202 Mobile Netw Appl (2014) 19171ndash209

                                                                ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

                                                                ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

                                                                ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

                                                                713 Practical implications

                                                                Although there are already many successful big data appli-cations many practical problems should be solved

                                                                ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

                                                                ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

                                                                ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

                                                                individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

                                                                ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

                                                                714 Data security

                                                                In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

                                                                ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

                                                                ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

                                                                Mobile Netw Appl (2014) 19171ndash209 203

                                                                quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

                                                                ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

                                                                ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

                                                                The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

                                                                72 Outlook

                                                                The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

                                                                not predict the future but may take precautions for possibleevents to occur in the future

                                                                ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

                                                                ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

                                                                ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

                                                                ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

                                                                204 Mobile Netw Appl (2014) 19171ndash209

                                                                utilizes relational diagrams to express interpersonalrelationship

                                                                ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

                                                                ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

                                                                ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

                                                                ndash Compared with accurate data we would like toaccept numerous and complicated data

                                                                ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

                                                                ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

                                                                ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

                                                                Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

                                                                increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

                                                                Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

                                                                References

                                                                1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

                                                                2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

                                                                3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

                                                                4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

                                                                5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

                                                                httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

                                                                7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

                                                                8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

                                                                9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

                                                                10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

                                                                11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

                                                                12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

                                                                13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

                                                                14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

                                                                15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

                                                                16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

                                                                17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

                                                                18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

                                                                19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

                                                                Mobile Netw Appl (2014) 19171ndash209 205

                                                                20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

                                                                21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

                                                                22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

                                                                23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

                                                                24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

                                                                25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

                                                                26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

                                                                27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

                                                                28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

                                                                29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

                                                                30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

                                                                31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

                                                                32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

                                                                33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

                                                                34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

                                                                35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

                                                                36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

                                                                37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

                                                                38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

                                                                39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

                                                                40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

                                                                41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

                                                                42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

                                                                43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

                                                                44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

                                                                45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

                                                                46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

                                                                47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

                                                                48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

                                                                49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

                                                                50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

                                                                51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

                                                                52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

                                                                53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

                                                                54 Cisco data center interconnect design and deployment guide(2010)

                                                                55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

                                                                56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

                                                                57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

                                                                58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

                                                                59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

                                                                206 Mobile Netw Appl (2014) 19171ndash209

                                                                60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

                                                                61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

                                                                62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

                                                                63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

                                                                64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

                                                                65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

                                                                66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

                                                                67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

                                                                68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

                                                                69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

                                                                70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

                                                                71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

                                                                72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

                                                                73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

                                                                74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

                                                                75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

                                                                76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

                                                                77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

                                                                78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

                                                                79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

                                                                80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

                                                                81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

                                                                82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

                                                                83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

                                                                84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

                                                                85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

                                                                86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

                                                                87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

                                                                88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

                                                                89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

                                                                90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

                                                                Media Inc93 Crockford D (2006) The applicationjson media type for

                                                                javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

                                                                SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

                                                                tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

                                                                (2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

                                                                97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

                                                                98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

                                                                99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

                                                                Mobile Netw Appl (2014) 19171ndash209 207

                                                                100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

                                                                101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

                                                                102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

                                                                103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

                                                                104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

                                                                105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

                                                                106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

                                                                107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

                                                                108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

                                                                109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

                                                                110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

                                                                111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

                                                                112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

                                                                113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

                                                                114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

                                                                115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

                                                                D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

                                                                117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

                                                                118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

                                                                the 7th ACM international conference on computing frontiersACM pp 277ndash286

                                                                119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

                                                                120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

                                                                121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

                                                                122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

                                                                123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

                                                                124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

                                                                125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

                                                                126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

                                                                127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

                                                                128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

                                                                129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

                                                                130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

                                                                131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

                                                                132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

                                                                133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

                                                                134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

                                                                135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

                                                                136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

                                                                137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

                                                                138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

                                                                139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

                                                                140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

                                                                208 Mobile Netw Appl (2014) 19171ndash209

                                                                141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

                                                                142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

                                                                143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

                                                                144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

                                                                145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

                                                                146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

                                                                147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

                                                                148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

                                                                149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

                                                                150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

                                                                151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

                                                                152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

                                                                153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

                                                                154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

                                                                155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

                                                                156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

                                                                Mobile Netw Appl (2014) 19171ndash209 209

                                                                • Big Data A Survey
                                                                  • Abstract
                                                                  • Background
                                                                    • Dawn of big data era
                                                                    • Definition and features of big data
                                                                    • Big data value
                                                                    • The development of big data
                                                                    • Challenges of big data
                                                                      • Related technologies
                                                                        • Relationship between cloud computing and big data
                                                                        • Relationship between IoT and big data
                                                                        • Data center
                                                                        • Relationship between hadoop and big data
                                                                          • Big data generation and acquisition
                                                                            • Data generation
                                                                              • Enterprise data
                                                                              • IoT data
                                                                              • Bio-medical data
                                                                              • Data generation from other fields
                                                                                • Big data acquisition
                                                                                  • Data collection
                                                                                  • Data transportation
                                                                                  • Data pre-processing
                                                                                      • Big data storage
                                                                                        • Storage system for massive data
                                                                                        • Distributed storage system
                                                                                        • Storage mechanism for big data
                                                                                          • Database technology
                                                                                            • Traditional data analysis
                                                                                            • Big data analytic methods
                                                                                            • Architecture for big data analysis
                                                                                              • Real-time vs offline analysis
                                                                                              • Analysis at different levels
                                                                                              • Analysis with different complexity
                                                                                                • Tools for big data mining and analysis
                                                                                                  • Big data applications
                                                                                                    • Key applications of big data
                                                                                                      • Application evolutions
                                                                                                      • Structured data analysis
                                                                                                      • Text data analysis
                                                                                                      • Web data analysis
                                                                                                      • Multimedia data analysis
                                                                                                      • Network data analysis
                                                                                                      • Mobile data analysis
                                                                                                        • Key applications of big data
                                                                                                          • Application of big data in enterprises
                                                                                                          • Application of IoT based big data
                                                                                                          • Application of online social network-oriented big data
                                                                                                          • Applications of healthcare and medical big data
                                                                                                          • Collective intelligence
                                                                                                          • Smart grid
                                                                                                              • Conclusion open issues and outlook
                                                                                                                • Open issues
                                                                                                                  • Theoretical research
                                                                                                                  • Technology development
                                                                                                                  • Practical implications
                                                                                                                  • Data security
                                                                                                                    • Outlook
                                                                                                                      • Acknowledgments
                                                                                                                      • References

                                                                  ndash Big data transfer Big data transfer involves big datageneration acquisition transmission storage and otherdata transformations in the spatial domain As dis-cussed big data transfer usually incurs high costswhich is the bottleneck for big data computing How-ever data transfer is inevitable in big data applicationsImproving the transfer efficiency of big data is a keyfactor to improve big data computing

                                                                  ndash Real-time performance of big data The real-time per-formance of big data is also a key problem in manyapplication scenarios Effective means to define the lifecycle of data compute the rate of depreciation of dataand build computing models of real-time and onlineapplications will influence the analysis results of bigdata

                                                                  ndash Processing of big data As big data research isadvanced new problems on big data processing arisefrom the traditional data analysis including (i) datare-utilization with the increase of data scale morevalues may be mined from re-utilization of existingdata (ii) data re-organization datasets in different busi-nesses can be re-organized which can be mined morevalue (iii) data exhaust which means wrong data dur-ing acquisition In big data not only the correct data butalso the wrong data should be utilized to generate morevalue

                                                                  713 Practical implications

                                                                  Although there are already many successful big data appli-cations many practical problems should be solved

                                                                  ndash Big data management The emergence of big databrings about new challenges to traditional data man-agement At present many research efforts are beingmade on big data oriented database and Internet tech-nologies storage models and databases suitable fornew hardware heterogeneous and multi-structured dataintegration data management of mobile and pervasivecomputing data management of SNS and distributeddata management

                                                                  ndash Searching mining and analysis of big data Data pro-cessing is always a research hotspot in big data Relatedproblems include searching and mining of SNS modelsbig data searching algorithms distributed searchingP2P searching visualized analysis of big data mas-sive recommendation systems social media systemsreal-time big data mining image mining text min-ing semantic mining multi-structured data mining andmachine learning etc

                                                                  ndash Integration and provenance of big data As discussedthe value acquired from comprehensive utilization ofmultiple datasets is far higher than the sum value of

                                                                  individual dataset Therefore the integration of differ-ent data sources is a timely problem Data integrationis confronted with many challenges such as differentdata patterns and a large amount of redundant dataData provenance is the process of data generation andevolution over time and mainly used to investigate mul-tiple datasets other than a single dataset Therefore itis worth studying on how to integrate data provenanceinformation featuring different standards and from dif-ferent datasets

                                                                  ndash Big data application At present the application of bigdata is just beginning and we shall explore more effi-ciently ways to fully utilize big data Therefore big dataapplications in science engineering medicine medi-cal care finance business law enforcement educationtransportation retail and telecommunication big dataapplications in small and medium-sized businesses bigdata applications in government departments big dataservices and human-computer interaction of big dataetc are all important research problems

                                                                  714 Data security

                                                                  In IT safety and privacy are always two key concerns Inthe big data era as data volume is fast growing there aremore severe safety risks while the traditional data protec-tion methods have already been shown not applicable to bigdata In particular big data safety is confronted with thefollowing security related challenges

                                                                  ndash Big data privacy Big data privacy includes two aspects(i) Protection of personal privacy during data acquisi-tion personal interests habits and body properties etcof users may be more easily acquired and users maynot be aware (ii) Personal privacy data may also beleaked during storage transmission and usage evenif acquired with the permission of users For exampleFacebook is deemed as a big data company with themost SNS data currently According to a report [156]Ron Bowes a researcher of Skull Security acquireddata in the public pages of Facebook users who fail tomodify their privacy setting via an information acqui-sition tool Ron Bowes packaged such data into a 28GB package and created a BitTorrent (BT) seed for oth-ers to download The analysis capacity of big data maylead to privacy mining from seemingly simple informa-tion Therefore privacy protection will become a newand challenging problem

                                                                  ndash Data quality Data quality influences big data utiliza-tion Low quality data wastes transmission and storageresources with poor usability There are a lot of factorsthat may restrict data quality for example generationacquisition and transmission may all influence data

                                                                  Mobile Netw Appl (2014) 19171ndash209 203

                                                                  quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

                                                                  ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

                                                                  ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

                                                                  The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

                                                                  72 Outlook

                                                                  The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

                                                                  not predict the future but may take precautions for possibleevents to occur in the future

                                                                  ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

                                                                  ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

                                                                  ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

                                                                  ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

                                                                  204 Mobile Netw Appl (2014) 19171ndash209

                                                                  utilizes relational diagrams to express interpersonalrelationship

                                                                  ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

                                                                  ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

                                                                  ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

                                                                  ndash Compared with accurate data we would like toaccept numerous and complicated data

                                                                  ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

                                                                  ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

                                                                  ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

                                                                  Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

                                                                  increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

                                                                  Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

                                                                  References

                                                                  1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

                                                                  2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

                                                                  3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

                                                                  4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

                                                                  5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

                                                                  httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

                                                                  7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

                                                                  8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

                                                                  9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

                                                                  10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

                                                                  11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

                                                                  12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

                                                                  13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

                                                                  14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

                                                                  15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

                                                                  16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

                                                                  17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

                                                                  18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

                                                                  19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

                                                                  Mobile Netw Appl (2014) 19171ndash209 205

                                                                  20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

                                                                  21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

                                                                  22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

                                                                  23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

                                                                  24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

                                                                  25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

                                                                  26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

                                                                  27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

                                                                  28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

                                                                  29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

                                                                  30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

                                                                  31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

                                                                  32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

                                                                  33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

                                                                  34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

                                                                  35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

                                                                  36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

                                                                  37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

                                                                  38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

                                                                  39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

                                                                  40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

                                                                  41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

                                                                  42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

                                                                  43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

                                                                  44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

                                                                  45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

                                                                  46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

                                                                  47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

                                                                  48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

                                                                  49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

                                                                  50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

                                                                  51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

                                                                  52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

                                                                  53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

                                                                  54 Cisco data center interconnect design and deployment guide(2010)

                                                                  55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

                                                                  56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

                                                                  57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

                                                                  58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

                                                                  59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

                                                                  206 Mobile Netw Appl (2014) 19171ndash209

                                                                  60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

                                                                  61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

                                                                  62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

                                                                  63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

                                                                  64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

                                                                  65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

                                                                  66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

                                                                  67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

                                                                  68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

                                                                  69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

                                                                  70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

                                                                  71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

                                                                  72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

                                                                  73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

                                                                  74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

                                                                  75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

                                                                  76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

                                                                  77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

                                                                  78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

                                                                  79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

                                                                  80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

                                                                  81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

                                                                  82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

                                                                  83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

                                                                  84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

                                                                  85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

                                                                  86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

                                                                  87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

                                                                  88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

                                                                  89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

                                                                  90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

                                                                  Media Inc93 Crockford D (2006) The applicationjson media type for

                                                                  javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

                                                                  SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

                                                                  tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

                                                                  (2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

                                                                  97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

                                                                  98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

                                                                  99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

                                                                  Mobile Netw Appl (2014) 19171ndash209 207

                                                                  100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

                                                                  101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

                                                                  102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

                                                                  103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

                                                                  104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

                                                                  105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

                                                                  106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

                                                                  107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

                                                                  108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

                                                                  109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

                                                                  110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

                                                                  111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

                                                                  112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

                                                                  113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

                                                                  114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

                                                                  115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

                                                                  D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

                                                                  117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

                                                                  118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

                                                                  the 7th ACM international conference on computing frontiersACM pp 277ndash286

                                                                  119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

                                                                  120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

                                                                  121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

                                                                  122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

                                                                  123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

                                                                  124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

                                                                  125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

                                                                  126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

                                                                  127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

                                                                  128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

                                                                  129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

                                                                  130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

                                                                  131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

                                                                  132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

                                                                  133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

                                                                  134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

                                                                  135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

                                                                  136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

                                                                  137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

                                                                  138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

                                                                  139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

                                                                  140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

                                                                  208 Mobile Netw Appl (2014) 19171ndash209

                                                                  141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

                                                                  142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

                                                                  143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

                                                                  144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

                                                                  145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

                                                                  146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

                                                                  147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

                                                                  148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

                                                                  149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

                                                                  150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

                                                                  151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

                                                                  152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

                                                                  153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

                                                                  154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

                                                                  155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

                                                                  156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

                                                                  Mobile Netw Appl (2014) 19171ndash209 209

                                                                  • Big Data A Survey
                                                                    • Abstract
                                                                    • Background
                                                                      • Dawn of big data era
                                                                      • Definition and features of big data
                                                                      • Big data value
                                                                      • The development of big data
                                                                      • Challenges of big data
                                                                        • Related technologies
                                                                          • Relationship between cloud computing and big data
                                                                          • Relationship between IoT and big data
                                                                          • Data center
                                                                          • Relationship between hadoop and big data
                                                                            • Big data generation and acquisition
                                                                              • Data generation
                                                                                • Enterprise data
                                                                                • IoT data
                                                                                • Bio-medical data
                                                                                • Data generation from other fields
                                                                                  • Big data acquisition
                                                                                    • Data collection
                                                                                    • Data transportation
                                                                                    • Data pre-processing
                                                                                        • Big data storage
                                                                                          • Storage system for massive data
                                                                                          • Distributed storage system
                                                                                          • Storage mechanism for big data
                                                                                            • Database technology
                                                                                              • Traditional data analysis
                                                                                              • Big data analytic methods
                                                                                              • Architecture for big data analysis
                                                                                                • Real-time vs offline analysis
                                                                                                • Analysis at different levels
                                                                                                • Analysis with different complexity
                                                                                                  • Tools for big data mining and analysis
                                                                                                    • Big data applications
                                                                                                      • Key applications of big data
                                                                                                        • Application evolutions
                                                                                                        • Structured data analysis
                                                                                                        • Text data analysis
                                                                                                        • Web data analysis
                                                                                                        • Multimedia data analysis
                                                                                                        • Network data analysis
                                                                                                        • Mobile data analysis
                                                                                                          • Key applications of big data
                                                                                                            • Application of big data in enterprises
                                                                                                            • Application of IoT based big data
                                                                                                            • Application of online social network-oriented big data
                                                                                                            • Applications of healthcare and medical big data
                                                                                                            • Collective intelligence
                                                                                                            • Smart grid
                                                                                                                • Conclusion open issues and outlook
                                                                                                                  • Open issues
                                                                                                                    • Theoretical research
                                                                                                                    • Technology development
                                                                                                                    • Practical implications
                                                                                                                    • Data security
                                                                                                                      • Outlook
                                                                                                                        • Acknowledgments
                                                                                                                        • References

                                                                    quality Data quality is mainly manifested in its accu-racy completeness redundancy and consistency Eventhough a lot of measures have been taken to improvedata quality the related problems have not been welladdressed yet Therefore effective methods to automat-ically detect data quality and repair some damaged dataneed to be investigated

                                                                    ndash Big data safety mechanism Big data brings about chal-lenges to data encryption due to its large scale andhigh diversity The performance of previous encryp-tion methods on small and medium-scale data couldnot meet the demands of big data so efficient big datacryptography approaches shall be developed Effec-tive schemes for safety management access controland safety communications shall be investigated forstructured semi-structured and unstructured data Inaddition under the multi-tenant mode isolation con-fidentiality completeness availability controllabilityand traceability of tenantsrsquo data should be enabled onthe premise of efficiency assurance

                                                                    ndash Big data application in information security Big datanot only brings about challenges to information secu-rity but also offers new opportunities for the develop-ment of information security mechanisms For examplewe may discover potential safety loopholes and APT(Advanced Persistent Threat) after analysis of big datain the form of log files of an Intrusion Detection Sys-tem In addition virus characteristics loophole char-acteristics and attack characteristics etc may also bemore easily identified through the analysis of big data

                                                                    The safety of big data has drawn great attention ofresearchers However there is only limited research onthe representation of multi-source heterogeneous big datameasurement and semantic comprehension methods mod-eling theories and computing models distributed storageof energy efficiency optimization and processed hardwareand software system architectures etc Particularly bigdata security including credibility backup and recoverycompleteness maintenance and security should be furtherinvestigated

                                                                    72 Outlook

                                                                    The emergence of big data opens great opportunities In theIT era the ldquoTrdquo (Technology) was the main concern whiletechnology drives the development of data In the big dataera with the prominence of data value and advances in ldquoIrdquo(Information) data will drive the progress of technologiesin the near future Big data will not only have the socialand economic impact but also influence everyonersquos waysof living and thinking which is just happening We could

                                                                    not predict the future but may take precautions for possibleevents to occur in the future

                                                                    ndash Data with a larger scale higher diversity and morecomplex structures Although technologies representedby Hadoop have achieved a great success such tech-nologies are expected to fall behind and will be replacedgiven the rapid development of big data The the-oretical basis of Hadoop has emerged as early as2006 Many researchers have concerned better waysto cope with larger-scale higher diversity and morecomplexly structured data These efforts are repre-sented by (Globally-Distributed Database) Spanner ofGoogle and fault-tolerant expandable distributed rela-tional database F1 In the future the storage technologyof big data will employ distributed databases supporttransaction mechanisms similar to relational databasesand effectively handle data through grammars similar toSQL

                                                                    ndash Data resource performance Since big data containshuge values mastering big data means masteringresources Through the analysis of the value chain ofbig data it can be seen that its value comes from thedata itself technologies and ideas and the core is dataresources The reorganization and integration of dif-ferent datasets can create more values From now onenterprises that master big data resources may obtainhuge benefits by renting and assigning the rights to usetheir data

                                                                    ndash Big data promotes the cross fusion of science Big datanot only promotes the comprehensive fusion of cloudcomputing IoT data center and mobile networks etcbut also forces the cross fusion of many disciplinesThe development of big data shall explore innovativetechnologies and methods in terms of data acquisitionstorage processing analysis and information securityetc Then impacts of big data on production manage-ment business operation and decision making etcshall be examined for modern enterprises from themanagement perspective Moreover the application ofbig data to specific fields needs the participation ofinterdisciplinary talents

                                                                    ndash Visualization In many human-computer interactionscenarios the principle of What You See Is What YouGet is followed eg as in text and image editors Inbig data applications mixed data is very useful fordecision making Only when the analytical results arefriendly displayed it may be effectively utilized byusers Reports histograms pie charts and regressioncurves etc are frequently used to visualize results ofdata analysis New presentation forms will occur in thefuture eg Microsoft Renlifang a social search engine

                                                                    204 Mobile Netw Appl (2014) 19171ndash209

                                                                    utilizes relational diagrams to express interpersonalrelationship

                                                                    ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

                                                                    ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

                                                                    ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

                                                                    ndash Compared with accurate data we would like toaccept numerous and complicated data

                                                                    ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

                                                                    ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

                                                                    ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

                                                                    Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

                                                                    increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

                                                                    Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

                                                                    References

                                                                    1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

                                                                    2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

                                                                    3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

                                                                    4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

                                                                    5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

                                                                    httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

                                                                    7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

                                                                    8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

                                                                    9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

                                                                    10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

                                                                    11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

                                                                    12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

                                                                    13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

                                                                    14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

                                                                    15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

                                                                    16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

                                                                    17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

                                                                    18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

                                                                    19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

                                                                    Mobile Netw Appl (2014) 19171ndash209 205

                                                                    20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

                                                                    21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

                                                                    22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

                                                                    23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

                                                                    24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

                                                                    25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

                                                                    26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

                                                                    27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

                                                                    28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

                                                                    29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

                                                                    30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

                                                                    31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

                                                                    32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

                                                                    33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

                                                                    34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

                                                                    35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

                                                                    36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

                                                                    37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

                                                                    38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

                                                                    39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

                                                                    40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

                                                                    41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

                                                                    42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

                                                                    43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

                                                                    44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

                                                                    45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

                                                                    46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

                                                                    47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

                                                                    48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

                                                                    49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

                                                                    50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

                                                                    51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

                                                                    52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

                                                                    53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

                                                                    54 Cisco data center interconnect design and deployment guide(2010)

                                                                    55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

                                                                    56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

                                                                    57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

                                                                    58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

                                                                    59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

                                                                    206 Mobile Netw Appl (2014) 19171ndash209

                                                                    60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

                                                                    61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

                                                                    62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

                                                                    63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

                                                                    64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

                                                                    65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

                                                                    66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

                                                                    67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

                                                                    68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

                                                                    69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

                                                                    70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

                                                                    71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

                                                                    72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

                                                                    73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

                                                                    74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

                                                                    75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

                                                                    76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

                                                                    77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

                                                                    78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

                                                                    79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

                                                                    80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

                                                                    81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

                                                                    82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

                                                                    83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

                                                                    84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

                                                                    85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

                                                                    86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

                                                                    87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

                                                                    88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

                                                                    89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

                                                                    90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

                                                                    Media Inc93 Crockford D (2006) The applicationjson media type for

                                                                    javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

                                                                    SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

                                                                    tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

                                                                    (2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

                                                                    97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

                                                                    98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

                                                                    99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

                                                                    Mobile Netw Appl (2014) 19171ndash209 207

                                                                    100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

                                                                    101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

                                                                    102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

                                                                    103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

                                                                    104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

                                                                    105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

                                                                    106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

                                                                    107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

                                                                    108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

                                                                    109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

                                                                    110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

                                                                    111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

                                                                    112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

                                                                    113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

                                                                    114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

                                                                    115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

                                                                    D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

                                                                    117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

                                                                    118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

                                                                    the 7th ACM international conference on computing frontiersACM pp 277ndash286

                                                                    119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

                                                                    120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

                                                                    121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

                                                                    122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

                                                                    123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

                                                                    124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

                                                                    125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

                                                                    126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

                                                                    127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

                                                                    128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

                                                                    129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

                                                                    130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

                                                                    131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

                                                                    132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

                                                                    133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

                                                                    134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

                                                                    135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

                                                                    136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

                                                                    137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

                                                                    138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

                                                                    139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

                                                                    140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

                                                                    208 Mobile Netw Appl (2014) 19171ndash209

                                                                    141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

                                                                    142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

                                                                    143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

                                                                    144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

                                                                    145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

                                                                    146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

                                                                    147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

                                                                    148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

                                                                    149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

                                                                    150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

                                                                    151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

                                                                    152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

                                                                    153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

                                                                    154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

                                                                    155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

                                                                    156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

                                                                    Mobile Netw Appl (2014) 19171ndash209 209

                                                                    • Big Data A Survey
                                                                      • Abstract
                                                                      • Background
                                                                        • Dawn of big data era
                                                                        • Definition and features of big data
                                                                        • Big data value
                                                                        • The development of big data
                                                                        • Challenges of big data
                                                                          • Related technologies
                                                                            • Relationship between cloud computing and big data
                                                                            • Relationship between IoT and big data
                                                                            • Data center
                                                                            • Relationship between hadoop and big data
                                                                              • Big data generation and acquisition
                                                                                • Data generation
                                                                                  • Enterprise data
                                                                                  • IoT data
                                                                                  • Bio-medical data
                                                                                  • Data generation from other fields
                                                                                    • Big data acquisition
                                                                                      • Data collection
                                                                                      • Data transportation
                                                                                      • Data pre-processing
                                                                                          • Big data storage
                                                                                            • Storage system for massive data
                                                                                            • Distributed storage system
                                                                                            • Storage mechanism for big data
                                                                                              • Database technology
                                                                                                • Traditional data analysis
                                                                                                • Big data analytic methods
                                                                                                • Architecture for big data analysis
                                                                                                  • Real-time vs offline analysis
                                                                                                  • Analysis at different levels
                                                                                                  • Analysis with different complexity
                                                                                                    • Tools for big data mining and analysis
                                                                                                      • Big data applications
                                                                                                        • Key applications of big data
                                                                                                          • Application evolutions
                                                                                                          • Structured data analysis
                                                                                                          • Text data analysis
                                                                                                          • Web data analysis
                                                                                                          • Multimedia data analysis
                                                                                                          • Network data analysis
                                                                                                          • Mobile data analysis
                                                                                                            • Key applications of big data
                                                                                                              • Application of big data in enterprises
                                                                                                              • Application of IoT based big data
                                                                                                              • Application of online social network-oriented big data
                                                                                                              • Applications of healthcare and medical big data
                                                                                                              • Collective intelligence
                                                                                                              • Smart grid
                                                                                                                  • Conclusion open issues and outlook
                                                                                                                    • Open issues
                                                                                                                      • Theoretical research
                                                                                                                      • Technology development
                                                                                                                      • Practical implications
                                                                                                                      • Data security
                                                                                                                        • Outlook
                                                                                                                          • Acknowledgments
                                                                                                                          • References

                                                                      utilizes relational diagrams to express interpersonalrelationship

                                                                      ndash Data-oriented It is well-known that programs con-sist of data structures and algorithms and data struc-tures are used to store data In the history of programdesign it is observed that the role of data is becomingincreasingly more significant In the small scale dataera in which logic is more complex than data pro-gram design is mainly process-oriented As businessdata is becoming more complex object-oriented designmethods are developed Nowadays the complexity ofbusiness data has far surpassed business logic Con-sequently programs are gradually transformed fromalgorithm-intensive to data-intensive It is anticipatedthat data-oriented program design methods are certainto emerge which will have far-reaching influence onthe development of IT in software engineering archi-tecture and model design among others

                                                                      ndash Big data triggers the revolution of thinking Graduallybig data and its analysis will profoundly influence ourways of thinking In [11] the authors summarize thethinking revolution triggered by big data as follows

                                                                      ndash During data analysis we will try to utilize alldata other than only analyzing a small set ofsample data

                                                                      ndash Compared with accurate data we would like toaccept numerous and complicated data

                                                                      ndash We shall pay greater attention to correlationsbetween things other than exploring causalrelationship

                                                                      ndash The simple algorithms of big data are moreeffective than complex algorithms of smalldata

                                                                      ndash Analytical results of big data will reduce hastyand subjective factors during decision makingand data scientists will replace ldquoexpertsrdquo

                                                                      Throughout the history of human society the demandsand willingness of human beings are always the source pow-ers to promote scientific and technological progress Bigdata may provides reference answers for human beings tomake decisions through mining and analytical processingbut it could not replace human thinking It is human think-ing that promotes the widespread utilizations of big dataBig data is more like an extendable and expandable humanbrain other than a substitute of the human brain With theemergence of IoT development of mobile sensing technol-ogy and progress of data acquisition technology people arenot only the users and consumers of big data but also itsproducers and participants Social relation sensing crowd-sourcing analysis of big data in SNS and other applicationsclosely related to human activities based on big data will be

                                                                      increasingly concerned and will certainly cause enormoustransformations of social activities in the future society

                                                                      Acknowledgments This work was supported by China NationalNatural Science Foundation (No 61300224) the Ministry of Sci-ence and Technology (MOST) China the International Science andTechnology Collaboration Program (Project No2014DFT10070) andthe Hubei Provincial Key Project (No 2013CFA051) Shiwen Maorsquosresearch is supported in part by the US NSF under grants CNS-1320664 CNS-1247955 and CNS-0953513 and through the NSFBroadband Wireless Access amp Applications Center (BWAC) site atAuburn University

                                                                      References

                                                                      1 Gantz J Reinsel D (2011) Extracting value from chaos IDCiView pp 1ndash12

                                                                      2 Fact sheet Big data across the federal government (2012) httpwwwwhitehousegovsitesdefaultfilesmicrositesostpbig datafact sheet 3 29 2012pdf

                                                                      3 Cukier K (2010) Data data everywhere a special report onmanaging information Economist Newspaper

                                                                      4 Drowning in numbers - digital data will flood the planet- and helpus understand it better (2011) httpwwweconomistcomblogsdailychart201111bigdata-0

                                                                      5 Lohr S (2012) The age of big data New York Times pp 116 Yuki N (2011) Following digital breadcrumbs to big data gold

                                                                      httpwwwnprorg20111129142521910thedigitalbreadcrumbs-that-lead-to-big-data

                                                                      7 Yuki N The search for analysts to make sense of big data (2011)httpwwwnprorg20111130142893065the-searchforanalysts-to-make-sense-of-big-data

                                                                      8 Big data (2008) httpwwwnaturecomnewsspecialsbigdataindexhtml

                                                                      9 Special online collection dealing with big data (2011) httpwwwsciencemagorgsitespecialdata

                                                                      10 Manyika J McKinsey Global Institute Chui M Brown BBughin J Dobbs R Roxburgh C Byers AH (2011) Big datathe next frontier for innovation competition and productivityMcKinsey Global Institute

                                                                      11 Mayer-Schonberger V Cukier K (2013) Big data a revolu-tion that will transform how we live work and think EamonDolanHoughton Mifflin Harcourt

                                                                      12 Laney D (2001) 3-d data management controlling data volumevelocity and variety META Group Research Note 6 February

                                                                      13 Zikopoulos P Eaton C et al (2011) Understanding big data ana-lytics for enterprise class hadoop and streaming data McGraw-Hill Osborne Media

                                                                      14 Meijer E (2011) The world according to linq Communicationsof the ACM 54(10)45ndash51

                                                                      15 Beyer M (2011) Gartner says solving big data challenge involvesmore than just managing volumes of data Gartner httpwwwgartnercomitpagejsp

                                                                      16 O R Team (2011) Big data now current perspectives fromOReilly Radar OReilly Media

                                                                      17 Grobelnik M (2012) Big data tutorial httpvideolecturesneteswc2012grobelnikbigdata

                                                                      18 Ginsberg J Mohebbi MH Patel RS Brammer L Smolinski MSBrilliant L (2008) Detecting influenza epidemics using searchengine query data Nature 457(7232)1012ndash1014

                                                                      19 DeWitt D Gray J (1992) Parallel database systems the future ofhigh performance database systems Commun ACM 35(6)85ndash98

                                                                      Mobile Netw Appl (2014) 19171ndash209 205

                                                                      20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

                                                                      21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

                                                                      22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

                                                                      23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

                                                                      24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

                                                                      25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

                                                                      26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

                                                                      27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

                                                                      28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

                                                                      29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

                                                                      30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

                                                                      31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

                                                                      32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

                                                                      33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

                                                                      34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

                                                                      35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

                                                                      36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

                                                                      37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

                                                                      38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

                                                                      39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

                                                                      40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

                                                                      41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

                                                                      42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

                                                                      43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

                                                                      44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

                                                                      45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

                                                                      46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

                                                                      47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

                                                                      48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

                                                                      49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

                                                                      50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

                                                                      51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

                                                                      52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

                                                                      53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

                                                                      54 Cisco data center interconnect design and deployment guide(2010)

                                                                      55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

                                                                      56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

                                                                      57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

                                                                      58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

                                                                      59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

                                                                      206 Mobile Netw Appl (2014) 19171ndash209

                                                                      60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

                                                                      61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

                                                                      62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

                                                                      63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

                                                                      64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

                                                                      65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

                                                                      66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

                                                                      67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

                                                                      68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

                                                                      69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

                                                                      70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

                                                                      71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

                                                                      72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

                                                                      73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

                                                                      74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

                                                                      75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

                                                                      76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

                                                                      77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

                                                                      78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

                                                                      79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

                                                                      80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

                                                                      81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

                                                                      82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

                                                                      83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

                                                                      84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

                                                                      85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

                                                                      86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

                                                                      87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

                                                                      88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

                                                                      89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

                                                                      90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

                                                                      Media Inc93 Crockford D (2006) The applicationjson media type for

                                                                      javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

                                                                      SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

                                                                      tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

                                                                      (2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

                                                                      97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

                                                                      98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

                                                                      99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

                                                                      Mobile Netw Appl (2014) 19171ndash209 207

                                                                      100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

                                                                      101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

                                                                      102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

                                                                      103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

                                                                      104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

                                                                      105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

                                                                      106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

                                                                      107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

                                                                      108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

                                                                      109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

                                                                      110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

                                                                      111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

                                                                      112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

                                                                      113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

                                                                      114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

                                                                      115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

                                                                      D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

                                                                      117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

                                                                      118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

                                                                      the 7th ACM international conference on computing frontiersACM pp 277ndash286

                                                                      119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

                                                                      120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

                                                                      121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

                                                                      122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

                                                                      123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

                                                                      124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

                                                                      125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

                                                                      126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

                                                                      127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

                                                                      128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

                                                                      129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

                                                                      130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

                                                                      131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

                                                                      132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

                                                                      133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

                                                                      134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

                                                                      135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

                                                                      136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

                                                                      137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

                                                                      138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

                                                                      139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

                                                                      140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

                                                                      208 Mobile Netw Appl (2014) 19171ndash209

                                                                      141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

                                                                      142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

                                                                      143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

                                                                      144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

                                                                      145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

                                                                      146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

                                                                      147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

                                                                      148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

                                                                      149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

                                                                      150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

                                                                      151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

                                                                      152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

                                                                      153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

                                                                      154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

                                                                      155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

                                                                      156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

                                                                      Mobile Netw Appl (2014) 19171ndash209 209

                                                                      • Big Data A Survey
                                                                        • Abstract
                                                                        • Background
                                                                          • Dawn of big data era
                                                                          • Definition and features of big data
                                                                          • Big data value
                                                                          • The development of big data
                                                                          • Challenges of big data
                                                                            • Related technologies
                                                                              • Relationship between cloud computing and big data
                                                                              • Relationship between IoT and big data
                                                                              • Data center
                                                                              • Relationship between hadoop and big data
                                                                                • Big data generation and acquisition
                                                                                  • Data generation
                                                                                    • Enterprise data
                                                                                    • IoT data
                                                                                    • Bio-medical data
                                                                                    • Data generation from other fields
                                                                                      • Big data acquisition
                                                                                        • Data collection
                                                                                        • Data transportation
                                                                                        • Data pre-processing
                                                                                            • Big data storage
                                                                                              • Storage system for massive data
                                                                                              • Distributed storage system
                                                                                              • Storage mechanism for big data
                                                                                                • Database technology
                                                                                                  • Traditional data analysis
                                                                                                  • Big data analytic methods
                                                                                                  • Architecture for big data analysis
                                                                                                    • Real-time vs offline analysis
                                                                                                    • Analysis at different levels
                                                                                                    • Analysis with different complexity
                                                                                                      • Tools for big data mining and analysis
                                                                                                        • Big data applications
                                                                                                          • Key applications of big data
                                                                                                            • Application evolutions
                                                                                                            • Structured data analysis
                                                                                                            • Text data analysis
                                                                                                            • Web data analysis
                                                                                                            • Multimedia data analysis
                                                                                                            • Network data analysis
                                                                                                            • Mobile data analysis
                                                                                                              • Key applications of big data
                                                                                                                • Application of big data in enterprises
                                                                                                                • Application of IoT based big data
                                                                                                                • Application of online social network-oriented big data
                                                                                                                • Applications of healthcare and medical big data
                                                                                                                • Collective intelligence
                                                                                                                • Smart grid
                                                                                                                    • Conclusion open issues and outlook
                                                                                                                      • Open issues
                                                                                                                        • Theoretical research
                                                                                                                        • Technology development
                                                                                                                        • Practical implications
                                                                                                                        • Data security
                                                                                                                          • Outlook
                                                                                                                            • Acknowledgments
                                                                                                                            • References

                                                                        20 Walter T (2009) Teradata past present and future UCI ISGlecture series on scalable data management

                                                                        21 Ghemawat S Gobioff H Leung S-T (2003) The google file sys-tem In ACM SIGOPS Operating Systems Review vol 37 ACMpp 29ndash43

                                                                        22 Dean J Ghemawat S (2008) Mapreduce simplified data process-ing on large clusters Commun ACM 51(1)107ndash113

                                                                        23 Hey AJG Tansley S Tolle KM et al (2009) The fourth paradigmdata-intensive scientific discovery

                                                                        24 Howard JH Kazar ML Menees SG Nichols DASatyanarayanan M Sidebotham RN West MJ (1988) Scale andperformance in a distributed file system ACM Trans ComputSyst (TOCS) 6(1)51ndash81

                                                                        25 Cattell R (2011) Scalable sql and nosql data stores ACM SIG-MOD Record 39(4)12ndash27

                                                                        26 Labrinidis A Jagadish HV (2012) Challenges and opportunitieswith big data Proc VLDB Endowment 5(12)2032ndash2033

                                                                        27 Chaudhuri S Dayal U Narasayya V (2011) An overviewof business intelligence technology Commun ACM 54(8)88ndash98

                                                                        28 Agrawal D Bernstein P Bertino E Davidson S Dayal UFranklin M Gehrke J Haas L Halevy A Han J et al (2012) Chal-lenges and opportunities with big data A community white paperdeveloped by leading researches across the United States

                                                                        29 Sun Y Chen M Liu B Mao S (2013) Far a fault-avoidantrouting method for data center networks with regular topologyIn Proceedings of ACMIEEE symposium on architectures fornetworking and communications systems (ANCSrsquo13) ACM

                                                                        30 Wiki (2013) Applications and organizations using hadoophttpwikiapacheorghadoopPoweredBy

                                                                        31 Bahga A Madisetti VK (2012) Analyzing massive machinemaintenance data in a computing cloud IEEE Transac ParallelDistrib Syst 23(10)1831ndash1843

                                                                        32 Gunarathne T Wu T-L Choi JY Bae S-H Qiu J (2011)Cloud computing paradigms for pleasingly parallel biomedicalapplications Concurr Comput Prac Experience 23(17)2338ndash2354

                                                                        33 Gantz J Reinsel D (2010) The digital universe decade-are youready External publication of IDC (Analyse the Future) informa-tion and data pp 1ndash16

                                                                        34 Bryant RE (2011) Data-intensive scalable computing for scien-tific applications Comput Sci Eng 13(6)25ndash33

                                                                        35 Wahab MHA Mohd MNH Hanafi HF Mohsin MFM (2008)Data pre-processing on web server logs for generalized asso-ciation rules mining algorithm World Acad Sci Eng Technol482008

                                                                        36 Nanopoulos A Manolopoulos Y Zakrzewicz M Morzy T(2002) Indexing web access-logs for pattern queries In Proceed-ings of the 4th international workshop on web information anddata management ACM pp 63ndash68

                                                                        37 Joshi KP Joshi A Yesha Y (2003) On using a warehouse toanalyze web logs Distrib Parallel Databases 13(2)161ndash180

                                                                        38 Chandramohan V Christensen K (2002) A first look at wiredsensor networks for video surveillance systems In Proceed-ings LCN 2002 27th annual IEEE conference on local computernetworks IEEE pp 728ndash729

                                                                        39 Selavo L Wood A Cao Q Sookoor T Liu H Srinivasan A WuY Kang W Stankovic J Young D et al (2007) Luster wirelesssensor network for environmental research In Proceedings ofthe 5th international conference on Embedded networked sensorsystems ACM pp 103ndash116

                                                                        40 Barrenetxea G Ingelrest F Schaefer G Vetterli M CouachO Parlange M (2008) Sensorscope out-of-the-box environmen-tal monitoring In Information processing in sensor networks2008 international conference on IPSNrsquo08 IEEE pp 332ndash343

                                                                        41 Kim Y Schmid T Charbiwala ZM Friedman J Srivastava MB(2008) Nawms nonintrusive autonomous water monitoring sys-tem In Proceedings of the 6th ACM conference on Embeddednetwork sensor systems ACM pp 309ndash322

                                                                        42 Kim S Pakzad S Culler D Demmel J Fenves G Glaser STuron M (2007) Health monitoring of civil infrastructures usingwireless sensor networks In Information Processing in SensorNetworks 2007 6th International Symposium on IPSN 2007IEEE pp 254ndash263

                                                                        43 Ceriotti M Mottola L Picco GP Murphy AL Guna S Corra MPozzi M Zonta D Zanon P (2009) Monitoring heritage build-ings with wireless sensor networks the torre aquila deploymentIn Proceedings of the 2009 International Conference on Infor-mation Processing in Sensor Networks IEEE Computer Societypp 277ndash288

                                                                        44 Tolle G Polastre J Szewczyk R Culler D Turner N Tu KBurgess S Dawson T Buonadonna P Gay D et al (2005) Amacroscope in the redwoods In Proceedings of the 3rd interna-tional conference on embedded networked sensor systems ACMpp 51ndash63

                                                                        45 Wang F Liu J (2011) Networked wireless sensor data collectionissues challenges and approaches IEEE Commun Surv Tutor13(4)673ndash687

                                                                        46 Cho J Garcia-Molina H (2002) Parallel crawlers In Proceedingsof the 11th international conference on World Wide Web ACMpp 124ndash135

                                                                        47 Choudhary S Dincturk ME Mirtaheri SM Moosavi A vonBochmann G Jourdan G-V Onut I-V (2012) Crawling rich inter-net applications the state of the art In CASCON pp 146ndash160

                                                                        48 Ghani N Dixit S Wang T-S (2000) On ip-over-wdm integrationIEEE Commun Mag 38(3)72ndash84

                                                                        49 Manchester J Anderson J Doshi B Dravida S Ip over sonet(1998) IEEE Commun Mag 36(5)136ndash142

                                                                        50 Jinno M Takara H Kozicki B (2009) Dynamic optical mesh net-works drivers challenges and solutions for the future In Opticalcommunication 2009 35th European conference on ECOCrsquo09IEEE pp 1ndash4

                                                                        51 Barroso LA Holzle U (2009) The datacenter as a computer anintroduction to the design of warehouse-scale machines SyntLect Comput Archit 4(1)1ndash108

                                                                        52 Armstrong J (2009) Ofdm for optical communications J LightTechnol 27(3)189ndash204

                                                                        53 Shieh W (2011) Ofdm for flexible high-speed optical networksJ Light Technol 29(10)1560ndash1577

                                                                        54 Cisco data center interconnect design and deployment guide(2010)

                                                                        55 Greenberg A Hamilton JR Jain N Kandula S Kim C LahiriP Maltz DA Patel P Sengupta S (2009) Vl2 a scalable andflexible data center network In ACM SIGCOMM computercommunication review vol 39 ACM pp 51ndash62

                                                                        56 Guo C Lu G Li D Wu H Zhang X Shi Y Tian C Zhang YLu S (2009) Bcube a high performance server-centric networkarchitecture for modular data centers ACM SIGCOMM ComputCommun Rev 39(4)63ndash74

                                                                        57 Farrington N Porter G Radhakrishnan S Bazzaz HHSubramanya V Fainman Y Papen G Vahdat A (2011) Heliosa hybrid electricaloptical switch architecture for modular datacenters ACM SIGCOMM Comput Commun Rev 41(4)339ndash350

                                                                        58 Abu-Libdeh H Costa P Rowstron A OrsquoShea G Donnelly A(2010) Symbiotic routing in future data centers ACM SIG-COMM Comput Commun Rev 40(4)51ndash62

                                                                        59 Lam C Liu H Koley B Zhao X Kamalov V Gill V Fiberoptic communication technologies whatrsquos needed for datacenternetwork operations (2010) IEEE Commun Mag 48(7)32ndash39

                                                                        206 Mobile Netw Appl (2014) 19171ndash209

                                                                        60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

                                                                        61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

                                                                        62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

                                                                        63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

                                                                        64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

                                                                        65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

                                                                        66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

                                                                        67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

                                                                        68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

                                                                        69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

                                                                        70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

                                                                        71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

                                                                        72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

                                                                        73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

                                                                        74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

                                                                        75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

                                                                        76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

                                                                        77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

                                                                        78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

                                                                        79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

                                                                        80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

                                                                        81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

                                                                        82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

                                                                        83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

                                                                        84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

                                                                        85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

                                                                        86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

                                                                        87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

                                                                        88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

                                                                        89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

                                                                        90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

                                                                        Media Inc93 Crockford D (2006) The applicationjson media type for

                                                                        javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

                                                                        SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

                                                                        tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

                                                                        (2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

                                                                        97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

                                                                        98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

                                                                        99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

                                                                        Mobile Netw Appl (2014) 19171ndash209 207

                                                                        100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

                                                                        101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

                                                                        102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

                                                                        103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

                                                                        104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

                                                                        105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

                                                                        106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

                                                                        107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

                                                                        108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

                                                                        109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

                                                                        110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

                                                                        111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

                                                                        112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

                                                                        113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

                                                                        114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

                                                                        115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

                                                                        D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

                                                                        117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

                                                                        118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

                                                                        the 7th ACM international conference on computing frontiersACM pp 277ndash286

                                                                        119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

                                                                        120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

                                                                        121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

                                                                        122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

                                                                        123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

                                                                        124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

                                                                        125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

                                                                        126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

                                                                        127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

                                                                        128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

                                                                        129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

                                                                        130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

                                                                        131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

                                                                        132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

                                                                        133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

                                                                        134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

                                                                        135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

                                                                        136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

                                                                        137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

                                                                        138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

                                                                        139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

                                                                        140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

                                                                        208 Mobile Netw Appl (2014) 19171ndash209

                                                                        141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

                                                                        142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

                                                                        143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

                                                                        144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

                                                                        145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

                                                                        146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

                                                                        147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

                                                                        148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

                                                                        149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

                                                                        150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

                                                                        151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

                                                                        152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

                                                                        153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

                                                                        154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

                                                                        155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

                                                                        156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

                                                                        Mobile Netw Appl (2014) 19171ndash209 209

                                                                        • Big Data A Survey
                                                                          • Abstract
                                                                          • Background
                                                                            • Dawn of big data era
                                                                            • Definition and features of big data
                                                                            • Big data value
                                                                            • The development of big data
                                                                            • Challenges of big data
                                                                              • Related technologies
                                                                                • Relationship between cloud computing and big data
                                                                                • Relationship between IoT and big data
                                                                                • Data center
                                                                                • Relationship between hadoop and big data
                                                                                  • Big data generation and acquisition
                                                                                    • Data generation
                                                                                      • Enterprise data
                                                                                      • IoT data
                                                                                      • Bio-medical data
                                                                                      • Data generation from other fields
                                                                                        • Big data acquisition
                                                                                          • Data collection
                                                                                          • Data transportation
                                                                                          • Data pre-processing
                                                                                              • Big data storage
                                                                                                • Storage system for massive data
                                                                                                • Distributed storage system
                                                                                                • Storage mechanism for big data
                                                                                                  • Database technology
                                                                                                    • Traditional data analysis
                                                                                                    • Big data analytic methods
                                                                                                    • Architecture for big data analysis
                                                                                                      • Real-time vs offline analysis
                                                                                                      • Analysis at different levels
                                                                                                      • Analysis with different complexity
                                                                                                        • Tools for big data mining and analysis
                                                                                                          • Big data applications
                                                                                                            • Key applications of big data
                                                                                                              • Application evolutions
                                                                                                              • Structured data analysis
                                                                                                              • Text data analysis
                                                                                                              • Web data analysis
                                                                                                              • Multimedia data analysis
                                                                                                              • Network data analysis
                                                                                                              • Mobile data analysis
                                                                                                                • Key applications of big data
                                                                                                                  • Application of big data in enterprises
                                                                                                                  • Application of IoT based big data
                                                                                                                  • Application of online social network-oriented big data
                                                                                                                  • Applications of healthcare and medical big data
                                                                                                                  • Collective intelligence
                                                                                                                  • Smart grid
                                                                                                                      • Conclusion open issues and outlook
                                                                                                                        • Open issues
                                                                                                                          • Theoretical research
                                                                                                                          • Technology development
                                                                                                                          • Practical implications
                                                                                                                          • Data security
                                                                                                                            • Outlook
                                                                                                                              • Acknowledgments
                                                                                                                              • References

                                                                          60 Wang G Andersen DG Kaminsky M Papagiannaki K NgTS Kozuch M Ryan M (2010) c-through Part-time optics indata centers In ACM SIGCOMM Computer CommunicationReview vol 40 ACM pp 327ndash338

                                                                          61 Ye X Yin Y Yoo SJB Mejia P Proietti R Akella V (2010) Dosa scalable optical switch for datacenters In Proceedings of the6th ACMIEEE symposium on architectures for networking andcommunications systems ACM p 24

                                                                          62 Singla A Singh A Ramachandran K Xu L Zhang Y (2010) Pro-teus a topology malleable data center network In Proceedingsof the 9th ACM SIGCOMM workshop on hot topics in networksACM p 8

                                                                          63 Liboiron-Ladouceur O Cerutti I Raponi PG Andriolli NCastoldi P (2011) Energy-efficient design of a scalable opti-cal multiplane interconnection architecture IEEE J Sel TopQuantum Electron 17(2)377ndash383

                                                                          64 Kodi AK Louri A (2011) Energy-efficient and bandwidth-reconfigurable photonic networks for high-performance comput-ing (hpc) systems IEEE J Sel Top Quantum Electron 17(2)384ndash395

                                                                          65 Zhou X Zhang Z Zhu Y Li Y Kumar S Vahdat A Zhao BYZheng H (2012) Mirror mirror on the ceiling flexible wirelesslinks for data centers ACM SIGCOMM Comput Commun Rev42(4)443ndash454

                                                                          66 Lenzerini M (2002) Data integration a theoretical perspectiveIn Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems ACMpp 233ndash246

                                                                          67 Cafarella MJ Halevy A Khoussainova N (2009) Data integra-tion for the relational web Proc VLDB Endowment 2(1)1090ndash1101

                                                                          68 Maletic JI Marcus A (2000) Data cleansing beyond integrityanalysis In IQ Citeseer pp 200ndash209

                                                                          69 Kohavi R Mason L Parekh R Zheng Z (2004) Lessons andchallenges from mining retail e-commerce data Mach Learn57(1-2)83ndash113

                                                                          70 Chen H Ku W-S Wang H Sun M-T (2010) Leveraging spatio-temporal redundancy for rfid data cleansing In Proceedings ofthe 2010 ACM SIGMOD international conference on manage-ment of data ACM pp 51ndash62

                                                                          71 Zhao Z Ng W (2012) A model-based approach for rfid datastream cleansing In Proceedings of the 21st ACM internationalconference on information and knowledge management ACMpp 862ndash871

                                                                          72 Khoussainova N Balazinska M Suciu D (2008) Probabilisticevent extraction from rfid data In Data Engineering 2008IEEE 24th international conference on ICDE 2008 IEEE pp1480ndash1482

                                                                          73 Herbert KG Wang JTL (2007) Biological data cleaning a casestudy Int J Inf Qual 1(1)60ndash82

                                                                          74 Tsai T-H Lin C-Y (2012) Exploring contextual redundancy inimproving object-based video coding for video sensor networkssurveillance IEEE Transac Multmed 14(3)669ndash682

                                                                          75 Sarawagi S Bhamidipaty A (2002) Interactive deduplicationusing active learning In Proceedings of the eighth ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 269ndash278

                                                                          76 Kamath U Compton J Dogan RI Jong KD Shehu A (2012)An evolutionary algorithm approach for feature generation fromsequence data and its application to dna splice site predic-tion IEEEACM Transac Comput Biol Bioinforma (TCBB)9(5)1387ndash1398

                                                                          77 Leung K-S Lee KH Wang J-F Ng EYT Chan HLY Tsui SKWMok TSK Tse PC-H Sung JJ-Y (2011) Data mining on dnasequences of hepatitis b virus IEEEACM Transac Comput BiolBioinforma 8(2)428ndash440

                                                                          78 Huang Z Shen H Liu J Zhou X (2011) Effective data co-reduction for multimedia similarity search In Proceedings of the2011 ACM SIGMOD International Conference on Managementof data ACM pp 1021ndash1032

                                                                          79 Bleiholder J Naumann F (2008) Data fusion ACM Comput Surv(CSUR) 41(1)1

                                                                          80 Brewer EA (2000) Towards robust distributed systems InPODC p 7

                                                                          81 Gilbert S Lynch N (2002) Brewerrsquos conjecture and the feasibil-ity of consistent available partition-tolerant web services ACMSIGACT News 33(2)51ndash59

                                                                          82 McKusick MK Quinlan S (2009) Gfs eqvolution on fast-forward ACM Queue 7(7)10

                                                                          83 Chaiken R Jenkins B Larson P-A Ramsey B Shakib D WeaverS Zhou J (2008) Scope easy and efficient parallel process-ing of massive data sets Proc VLDB Endowment 1(2)1265ndash1276

                                                                          84 Beaver D Kumar S Li HC Sobel J Vajgel P et al (2010) Findinga needle in haystack facebookrsquos photo storage In OSDI vol 10pp 1ndash8

                                                                          85 DeCandia G Hastorun D Jampani M Kakulapati G LakshmanA Pilchin A Sivasubramanian S Vosshall P Vogels W (2007)Dynamo amazonrsquos highly available key-value store In SOSPvol 7 pp 205ndash220

                                                                          86 Karger D Lehman E Leighton T Panigrahy R Levine MLewin D (1997) Consistent hashing and random trees distributedcaching protocols for relieving hot spots on the world wide webIn Proceedings of the twenty-ninth annual ACM symposium ontheory of computing ACM pp 654ndash663

                                                                          87 Chang F Dean J Ghemawat S Hsieh WC Wallach DA BurrowsM Chandra T Fikes A Gruber RE (2008) Bigtable a distributedstorage system for structured data ACM Trans Comput Syst(TOCS) 26(2)4

                                                                          88 Burrows M (2006) The chubby lock service for loosely-coupleddistributed systems In Proceedings of the 7th symposium onOperating systems design and implementation USENIX Associ-ation pp 335ndash350

                                                                          89 Lakshman A Malik P (2009) Cassandra structured storagesystem on a p2p network In Proceedings of the 28th ACMsymposium on principles of distributed computing ACMpp 5ndash5

                                                                          90 George L (2011) HBase the definitive guide OrsquoReilly Media Inc91 Judd D (2008) hypertable-09 04-alpha92 Chodorow K (2013) MongoDB the definitive guide OrsquoReilly

                                                                          Media Inc93 Crockford D (2006) The applicationjson media type for

                                                                          javascript object notation (json)94 Murty J (2009) Programming amazon web services S3 EC2

                                                                          SQS FPS and SimpleDB OrsquoReilly Media Inc95 Anderson JC Lehnardt J Slater N (2010) CouchDB the defini-

                                                                          tive guide OrsquoReilly Media Inc96 Blanas S Patel JM Ercegovac V Rao J Shekita EJ Tian Y

                                                                          (2010) A comparison of join algorithms for log processing inmapreduce In Proceedings of the 2010 ACM SIGMOD inter-national conference on management of data ACM pp 975ndash986

                                                                          97 Yang H-C Parker DS (2009) Traverse simplified indexingon large map-reduce-merge clusters In Database systems foradvanced applications Springer pp 308ndash322

                                                                          98 Pike R Dorward S Griesemer R Quinlan S (2005) Interpretingthe data parallel analysis with sawzall Sci Program 13(4)277ndash298

                                                                          99 Gates AF Natkovich O Chopra S Kamath P NarayanamurthySM Olston C Reed B Srinivasan S Srivastava U (2009) Build-ing a high-level dataflow system on top of map-reduce the pigexperience Proceedings VLDB Endowment 2(2)1414ndash1425

                                                                          Mobile Netw Appl (2014) 19171ndash209 207

                                                                          100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

                                                                          101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

                                                                          102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

                                                                          103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

                                                                          104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

                                                                          105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

                                                                          106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

                                                                          107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

                                                                          108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

                                                                          109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

                                                                          110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

                                                                          111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

                                                                          112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

                                                                          113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

                                                                          114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

                                                                          115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

                                                                          D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

                                                                          117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

                                                                          118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

                                                                          the 7th ACM international conference on computing frontiersACM pp 277ndash286

                                                                          119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

                                                                          120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

                                                                          121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

                                                                          122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

                                                                          123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

                                                                          124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

                                                                          125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

                                                                          126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

                                                                          127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

                                                                          128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

                                                                          129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

                                                                          130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

                                                                          131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

                                                                          132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

                                                                          133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

                                                                          134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

                                                                          135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

                                                                          136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

                                                                          137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

                                                                          138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

                                                                          139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

                                                                          140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

                                                                          208 Mobile Netw Appl (2014) 19171ndash209

                                                                          141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

                                                                          142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

                                                                          143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

                                                                          144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

                                                                          145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

                                                                          146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

                                                                          147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

                                                                          148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

                                                                          149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

                                                                          150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

                                                                          151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

                                                                          152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

                                                                          153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

                                                                          154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

                                                                          155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

                                                                          156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

                                                                          Mobile Netw Appl (2014) 19171ndash209 209

                                                                          • Big Data A Survey
                                                                            • Abstract
                                                                            • Background
                                                                              • Dawn of big data era
                                                                              • Definition and features of big data
                                                                              • Big data value
                                                                              • The development of big data
                                                                              • Challenges of big data
                                                                                • Related technologies
                                                                                  • Relationship between cloud computing and big data
                                                                                  • Relationship between IoT and big data
                                                                                  • Data center
                                                                                  • Relationship between hadoop and big data
                                                                                    • Big data generation and acquisition
                                                                                      • Data generation
                                                                                        • Enterprise data
                                                                                        • IoT data
                                                                                        • Bio-medical data
                                                                                        • Data generation from other fields
                                                                                          • Big data acquisition
                                                                                            • Data collection
                                                                                            • Data transportation
                                                                                            • Data pre-processing
                                                                                                • Big data storage
                                                                                                  • Storage system for massive data
                                                                                                  • Distributed storage system
                                                                                                  • Storage mechanism for big data
                                                                                                    • Database technology
                                                                                                      • Traditional data analysis
                                                                                                      • Big data analytic methods
                                                                                                      • Architecture for big data analysis
                                                                                                        • Real-time vs offline analysis
                                                                                                        • Analysis at different levels
                                                                                                        • Analysis with different complexity
                                                                                                          • Tools for big data mining and analysis
                                                                                                            • Big data applications
                                                                                                              • Key applications of big data
                                                                                                                • Application evolutions
                                                                                                                • Structured data analysis
                                                                                                                • Text data analysis
                                                                                                                • Web data analysis
                                                                                                                • Multimedia data analysis
                                                                                                                • Network data analysis
                                                                                                                • Mobile data analysis
                                                                                                                  • Key applications of big data
                                                                                                                    • Application of big data in enterprises
                                                                                                                    • Application of IoT based big data
                                                                                                                    • Application of online social network-oriented big data
                                                                                                                    • Applications of healthcare and medical big data
                                                                                                                    • Collective intelligence
                                                                                                                    • Smart grid
                                                                                                                        • Conclusion open issues and outlook
                                                                                                                          • Open issues
                                                                                                                            • Theoretical research
                                                                                                                            • Technology development
                                                                                                                            • Practical implications
                                                                                                                            • Data security
                                                                                                                              • Outlook
                                                                                                                                • Acknowledgments
                                                                                                                                • References

                                                                            100 Thusoo A Sarma JS Jain N Shao Z Chakka P Anthony SLiu H Wyckoff P Murthy R (2009) Hive a warehousing solu-tion over a map-reduce framework Proc VLDB Endowment2(2)1626ndash1629

                                                                            101 Isard M Budiu M Yu Y Birrell A Fetterly D (2007) Dryad dis-tributed data-parallel programs from sequential building blocksACM SIGOPS Oper Syst Rev 41(3)59ndash72

                                                                            102 Yu Y Isard M Fetterly D Budiu M Erlingsson U Gunda PKCurrey J (2008) Dryadlinq a system for general-purpose dis-tributed data-parallel computing using a high-level language InOSDI vol 8 pp 1ndash14

                                                                            103 Moretti C Bulosan J Thain D Flynn PJ (2008) All-pairs anabstraction for data-intensive cloud computing In Parallel anddistributed processing 2008 IEEE international symposium onIPDPS 2008 IEEE pp 1ndash11

                                                                            104 Malewicz G Austern MH Bik AJC Dehnert JC Horn I LeiserN Czajkowski G (2010) Pregel a system for large-scale graphprocessing In Proceedings of the 2010 ACM SIGMOD interna-tional conference on management of data ACM pp 135ndash146

                                                                            105 Bu Y Bill H Balazinska M Ernst MD (2010) Haloop effi-cient iterative data processing on large clusters Proc VLDBEndowment 3(1-2)285ndash296

                                                                            106 Ekanayake J Li H Zhang B Gunarathne T Bae S-H Qiu JFox G (2010) Twister a runtime for iterative mapreduce InProceedings of the 19th ACM international symposium on highperformance distributed computing ACM pp 810ndash818

                                                                            107 Zaharia M Chowdhury M Das T Dave A Ma J McCauleyM Franklin M Shenker S Stoica I (2012) Resilient distributeddatasets a fault-tolerant abstraction for in-memory cluster com-puting In Proceedings of the 9th USENIX conference onnetworked systems design and implementation USENIX Asso-ciation pp 2ndash2

                                                                            108 Bhatotia P Wieder A Rodrigues R Acar UA Pasquin R (2011)Incoop mapreduce for incremental computations In Proceed-ings of the 2nd ACM symposium on cloud computing ACMp 7

                                                                            109 Murray DG Schwarzkopf M Smowton C Smith SMadhavapeddy A Hand S (2011) Ciel a universal executionengine for distributed data-flow computing In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation p 9

                                                                            110 Anderson TW (1958) An introduction to multivariate statisticalanalysis vol 2 Wiley New York

                                                                            111 Wu X Kumar V Quinlan JR Ghosh J Yang Q Motoda HMcLachlan GJ Ng A Liu B Philip SY et al (2008) Top 10algorithms in data mining Knowl Inf Syst 14(1)1ndash37

                                                                            112 What analytics data mining big data software you used in thepast 12 months for a real project (2012) httpwwwkdnuggetscompolls2012analytics-data-mining-big-data-softwarehtml

                                                                            113 Berthold MR Cebron N Dill F Gabriel TR Kotter T MeinlT Ohl P Sieb C Thiel K Wiswedel B (2008) KNIME theKonstanz information miner Springer

                                                                            114 Sallam RL Richardson J Hagerty J Hostmann B (2011) Magicquadrant for business intelligence platforms CT Gartner GroupStamford

                                                                            115 Beyond the PC Special Report on Personal Technology (2011)116 Goff SA Vaughn M McKay S Lyons E Stapleton AE Gessler

                                                                            D Matasci N Wang L Hanlon M Lenards A et al (2011) Theiplant collaborative cyberinfrastructure for plant biology FrontPlant Sci 34(2)1ndash16 doi103389fpls201100034

                                                                            117 Baah GK Gray A Harrold MJ (2006) On-line anomaly detectionof deployed software a statistical machine learning approachIn Proceedings of the 3rd international workshop on Softwarequality assurance ACM pp 70ndash77

                                                                            118 Moeng M Melhem R (2010) Applying statistical machine learn-ing to multicore voltage amp frequency scaling In Proceedings of

                                                                            the 7th ACM international conference on computing frontiersACM pp 277ndash286

                                                                            119 Gaber MM Zaslavsky A Krishnaswamy S (2005) Mining datastreams a review ACM Sigmod Record 34(2)18ndash26

                                                                            120 Verykios VS Bertino E Fovino IN Provenza LP Saygin YTheodoridis Y (2004) State-of-the-art in privacy preserving datamining ACM Sigmod Record 33(1)50ndash57

                                                                            121 van der Aalst W (2012) Process mining overview and opportu-nities ACM Transac Manag Inform Syst (TMIS) 3(2)7

                                                                            122 Manning CD Schutze H (1999) Foundations of statistical naturallanguage processing vol 999 MIT Press

                                                                            123 Pal SK Talwar V Mitra P (2002) Web mining in soft computingframework relevance state of the art and future directions IEEETransac Neural Netw 13(5)1163ndash1177

                                                                            124 Chakrabarti S (2000) Data mining for hypertext a tutorial surveyACM SIGKDD Explor Newsl 1(2)1ndash11

                                                                            125 Brin S Page L (1998) The anatomy of a large-scale hypertextualweb search engine Comput Netw ISDN Syst 30(1)107ndash117

                                                                            126 Konopnicki D Shmueli O (1995) W3qs a query system for theworld-wide web In VLDB vol 95 pp 54ndash65

                                                                            127 Chakrabarti S Van den Berg M Dom B (1999) Focused crawl-ing a new approach to topic-specific web resource discoveryComput Netw 31(11)1623ndash1640

                                                                            128 Ding D Metze F Rawat S Schulam PF Burger S YounessianE Bao L Christel MG Hauptmann A (2012) Beyond audio andvideo retrieval towards multimedia summarization In Proceed-ings of the 2nd ACM international conference on multimediaretrieval ACM pp 2

                                                                            129 Wang M Ni B Hua X-S Chua T-S (2012) Assistive tag-ging a survey of multimedia tagging with human-computer jointexploration ACM Comput Surv (CSUR) 44(4)25

                                                                            130 Lew MS Sebe N Djeraba C Jain R (2006) Content-based multi-media information retrieval state of the art and challenges ACMTrans Multimed Comput Commun Appl (TOMCCAP) 2(1)1ndash19

                                                                            131 Hu W Xie N Li L Zeng X Maybank S (2011) A survey onvisual content-based video indexing and retrieval IEEE TransSyst Man Cybern Part C Appl Rev 41(6)797ndash819

                                                                            132 Park Y-J Chang K-N (2009) Individual and group behavior-based customer profile model for personalized product recom-mendation Expert Syst Appl 36(2)1932ndash1939

                                                                            133 Barragans-Martınez AB Costa-Montenegro E Burguillo JCRey-Lopez M Mikic-Fonte FA Peleteiro A (2010) A hybridcontent-based and item-based collaborative filtering approach torecommend tv programs enhanced with singular value decompo-sition Inf Sci 180(22)4290ndash4311

                                                                            134 Naphade M Smith JR Tesic J Chang S-F Hsu W Kennedy LHauptmann A Curtis J (2006) Large-scale concept ontology formultimedia IEEE Multimedia 13(3)86ndash91

                                                                            135 Ma Z Yang Y Cai Y Sebe N Hauptmann AG (2012) Knowl-edge adaptation for ad hoc multimedia event detection withfew exemplars In Proceedings of the 20th ACM internationalconference on multimedia ACM pp 469ndash478

                                                                            136 Hirsch JE (2005) An index to quantify an individualrsquos scientificresearch output Proc Natl Acad Sci USA 102(46)16569

                                                                            137 Watts DJ (2004) Six degrees the science of a connected ageWW Norton amp Company

                                                                            138 Aggarwal CC (2011) An introduction to social network dataanalytics Springer

                                                                            139 Scellato S Noulas A Mascolo C (2011) Exploiting place fea-tures in link prediction on location-based social networks InProceedings of the 17th ACM SIGKDD international conferenceon Knowledge discovery and data mining ACM pp 1046ndash1054

                                                                            140 Ninagawa A Eguchi K (2010) Link prediction using proba-bilistic group models of network structure In Proceedings ofthe 2010 ACM symposium on applied Computing ACM pp1115ndash1116

                                                                            208 Mobile Netw Appl (2014) 19171ndash209

                                                                            141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

                                                                            142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

                                                                            143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

                                                                            144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

                                                                            145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

                                                                            146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

                                                                            147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

                                                                            148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

                                                                            149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

                                                                            150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

                                                                            151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

                                                                            152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

                                                                            153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

                                                                            154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

                                                                            155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

                                                                            156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

                                                                            Mobile Netw Appl (2014) 19171ndash209 209

                                                                            • Big Data A Survey
                                                                              • Abstract
                                                                              • Background
                                                                                • Dawn of big data era
                                                                                • Definition and features of big data
                                                                                • Big data value
                                                                                • The development of big data
                                                                                • Challenges of big data
                                                                                  • Related technologies
                                                                                    • Relationship between cloud computing and big data
                                                                                    • Relationship between IoT and big data
                                                                                    • Data center
                                                                                    • Relationship between hadoop and big data
                                                                                      • Big data generation and acquisition
                                                                                        • Data generation
                                                                                          • Enterprise data
                                                                                          • IoT data
                                                                                          • Bio-medical data
                                                                                          • Data generation from other fields
                                                                                            • Big data acquisition
                                                                                              • Data collection
                                                                                              • Data transportation
                                                                                              • Data pre-processing
                                                                                                  • Big data storage
                                                                                                    • Storage system for massive data
                                                                                                    • Distributed storage system
                                                                                                    • Storage mechanism for big data
                                                                                                      • Database technology
                                                                                                        • Traditional data analysis
                                                                                                        • Big data analytic methods
                                                                                                        • Architecture for big data analysis
                                                                                                          • Real-time vs offline analysis
                                                                                                          • Analysis at different levels
                                                                                                          • Analysis with different complexity
                                                                                                            • Tools for big data mining and analysis
                                                                                                              • Big data applications
                                                                                                                • Key applications of big data
                                                                                                                  • Application evolutions
                                                                                                                  • Structured data analysis
                                                                                                                  • Text data analysis
                                                                                                                  • Web data analysis
                                                                                                                  • Multimedia data analysis
                                                                                                                  • Network data analysis
                                                                                                                  • Mobile data analysis
                                                                                                                    • Key applications of big data
                                                                                                                      • Application of big data in enterprises
                                                                                                                      • Application of IoT based big data
                                                                                                                      • Application of online social network-oriented big data
                                                                                                                      • Applications of healthcare and medical big data
                                                                                                                      • Collective intelligence
                                                                                                                      • Smart grid
                                                                                                                          • Conclusion open issues and outlook
                                                                                                                            • Open issues
                                                                                                                              • Theoretical research
                                                                                                                              • Technology development
                                                                                                                              • Practical implications
                                                                                                                              • Data security
                                                                                                                                • Outlook
                                                                                                                                  • Acknowledgments
                                                                                                                                  • References

                                                                              141 Dunlavy DM Kolda TG Acar E (2011) Temporal link predic-tion using matrix and tensor factorizations ACM Transac KnowlDiscov Data (TKDD) 5(2)10

                                                                              142 Leskovec J Lang KJ Mahoney M (2010) Empirical comparisonof algorithms for network community detection In Proceedingsof the 19th international conference on World wide web ACMpp 631ndash640

                                                                              143 Du N Wu B Pei X Wang B Xu L (2007) Community detec-tion in large-scale social networks In Proceedings of the 9thWebKDD and 1st SNA-KDD 2007 workshop on Web mining andsocial network analysis ACM pp 16ndash25

                                                                              144 Garg S Gupta T Carlsson N Mahanti A (2009) Evolution of anonline social aggregation network an empirical study In Pro-ceedings of the 9th ACM SIGCOMM conference on Internetmeasurement conference ACM pp 315ndash321

                                                                              145 Allamanis M Scellato S Mascolo C (2012) Evolution of alocation-based online social network analysis and models InProceedings of the 2012 ACM conference on Internet measure-ment conference ACM pp 145ndash158

                                                                              146 Gong NZ Xu W Huang L Mittal P Stefanov E Sekar V SongD (2012) Evolution of social-attribute networks measurementsmodeling and implications using google+ In Proceedings ofthe 2012 ACM conference on Internet measurement conferenceACM pp 131ndash144

                                                                              147 Zheleva E Sharara H Getoor L (2009) Co-evolution of socialand affiliation networks In Proceedings of the 15th ACMSIGKDD international conference on Knowledge discovery anddata mining ACM pp 1007ndash1016

                                                                              148 Tang J Sun J Wang C Yang Z (2009) Social influence anal-ysis in large-scale networks In Proceedings of the 15th ACMSIGKDD international conference on knowledge discovery anddata mining ACM pp 807ndash816

                                                                              149 Li Y Chen W Wang Y Zhang Z-L (2013) Influence diffusiondynamics and influence maximization in social networks withfriend and foe relationships In Proceedings of the sixth ACMinternational conference on Web search and data mining ACMpp 657ndash666

                                                                              150 Dai W Chen Y Xue G-R Yang Q Yu Y (2008) Translatedlearning transfer learning across different feature spaces InAdvances in neural information processing systems pp 353ndash360

                                                                              151 Cisco Visual Networking Index (2013) Global mobile data traf-fic forecast update 2012ndash2017 httpwwwciscocomenUSsolutionscollateralns341ns525ns537ns705ns827white paperc11-520862html (Son erisim 5 Mayıs 2013)

                                                                              152 Rhee Y Lee J (2009) On modeling a model of mobile com-munity designing user interfaces to support group interactionInteractions 16(6)46ndash51

                                                                              153 Han J Lee J-G Gonzalez H Li X (2008) Mining massive rfidtrajectory and traffic data sets In Proceedings of the 14th ACMSIGKDD international conference on knowledge discovery anddata mining ACM p 2

                                                                              154 Garg MK Kim D-J Turaga DS Prabhakaran B (2010) Mul-timodal analysis of body sensor network data streams forreal-time healthcare In Proceedings of the international con-ference on multimedia information retrieval ACM pp 469ndash478

                                                                              155 Park Y Ghosh J (2012) A probabilistic imputation frameworkfor predictive analysis using variably aggregated multi-sourcehealthcare data In Proceedings of the 2nd ACM SIGHITinternational health informatics symposium ACM pp 445ndash454

                                                                              156 Tasevski P (2011) Password attacks and generation strate-gies Tartu University Faculty of Mathematics and ComputerSciences

                                                                              Mobile Netw Appl (2014) 19171ndash209 209

                                                                              • Big Data A Survey
                                                                                • Abstract
                                                                                • Background
                                                                                  • Dawn of big data era
                                                                                  • Definition and features of big data
                                                                                  • Big data value
                                                                                  • The development of big data
                                                                                  • Challenges of big data
                                                                                    • Related technologies
                                                                                      • Relationship between cloud computing and big data
                                                                                      • Relationship between IoT and big data
                                                                                      • Data center
                                                                                      • Relationship between hadoop and big data
                                                                                        • Big data generation and acquisition
                                                                                          • Data generation
                                                                                            • Enterprise data
                                                                                            • IoT data
                                                                                            • Bio-medical data
                                                                                            • Data generation from other fields
                                                                                              • Big data acquisition
                                                                                                • Data collection
                                                                                                • Data transportation
                                                                                                • Data pre-processing
                                                                                                    • Big data storage
                                                                                                      • Storage system for massive data
                                                                                                      • Distributed storage system
                                                                                                      • Storage mechanism for big data
                                                                                                        • Database technology
                                                                                                          • Traditional data analysis
                                                                                                          • Big data analytic methods
                                                                                                          • Architecture for big data analysis
                                                                                                            • Real-time vs offline analysis
                                                                                                            • Analysis at different levels
                                                                                                            • Analysis with different complexity
                                                                                                              • Tools for big data mining and analysis
                                                                                                                • Big data applications
                                                                                                                  • Key applications of big data
                                                                                                                    • Application evolutions
                                                                                                                    • Structured data analysis
                                                                                                                    • Text data analysis
                                                                                                                    • Web data analysis
                                                                                                                    • Multimedia data analysis
                                                                                                                    • Network data analysis
                                                                                                                    • Mobile data analysis
                                                                                                                      • Key applications of big data
                                                                                                                        • Application of big data in enterprises
                                                                                                                        • Application of IoT based big data
                                                                                                                        • Application of online social network-oriented big data
                                                                                                                        • Applications of healthcare and medical big data
                                                                                                                        • Collective intelligence
                                                                                                                        • Smart grid
                                                                                                                            • Conclusion open issues and outlook
                                                                                                                              • Open issues
                                                                                                                                • Theoretical research
                                                                                                                                • Technology development
                                                                                                                                • Practical implications
                                                                                                                                • Data security
                                                                                                                                  • Outlook
                                                                                                                                    • Acknowledgments
                                                                                                                                    • References

                                                                                top related