Top Banner
A Hybrid Approach for Data Analytics for Internet of Things Badraddin Alturki University of Leicester Leicester United Kingdom [email protected] Stephan Reiff-Marganiec University of Leicester Leicester United Kingdom [email protected] Charith Perera Newcastle University Newcastle upon Tyne United Kingdom [email protected] ABSTRACT The vision of the Internet of Things is to allow currently un- connected physical objects to be connected to the internet. There will be an extremely large number of internet connected devices that will be much more than the number of human being in the world all producing data. These data will be collected and delivered to the cloud for processing, especially with a view of finding meaningful information to then take action. However, ideally the data needs to be analysed locally to increase privacy, give quick responses to people and to re- duce use of network and storage resources. To tackle these problems, distributed data analytics can be proposed to collect and analyse the data either in the edge or fog devices. In this paper, we explore a hybrid approach which means that both in- network level and cloud level processing should work together to build effective IoT data analytics in order to overcome their respective weaknesses and use their specific strengths. Specif- ically, we collected raw data locally and extracted features by applying data fusion techniques on the data on resource constrained devices to reduce the data and then send the ex- tracted features to the cloud for processing. We evaluated the accuracy and data consumption over network and thus show that it is feasible to increase privacy and maintain accuracy while reducing data communication demands. ACM Classification Keywords D.2.11. Software Engineering: Software Architectures — Data Abstraction; H.3.4. Information Systems: Systems and Software — Distributed Systems Author Keywords Internet of Things; Cloud; Data Analytics; Fog Computing; Edge Computing; Distributed Data Analytics INTRODUCTION The Internet of Things (IoT) has become one of the most active areas in computer science and beyond, both for researchers and companies and it is interpreted by communities in a variety of ways. The cluster of European research projects defined the Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. © 2017 ACM. ISBN 978-1-4503-2138-9. DOI: 10.1145/1235 IoT as allowing "people and things to be connected Anytime, Anyplace, with Anything and Anyone, ideally using Any net- work and Any service [18]". This is important for applications to monitor, track, and communicate (amongst others) with things for various purposes remotely. Sensors embedded in many everyday physical objects around us play a key role in IoT. These embedded sensors will include vast sensing capa- bilities [15] and can then send the data through the network to decision points – typically the cloud. After collecting the data, there is a need to analyse them to gain insights to help, automate and speed up decision making[16]. In a data driven economy, the data and insights can be considered as the main goods [14]. According to [17] the number of devices con- nected to the internet will be more than 50 billion devices in the very near future. However, the greater awareness promised by so many smart things will produce an ever greater volume of data at increasing rates of delivery. Big data is not a new term in computer science [24], it has been created by big technological companies like Yahoo, Mi- crosoft and Google. Big data as researched has three key characteristics: volume, variety and velocity [24]. As result of improvements in electronics, the cost of equipment has decreased dramatically and sensors have become more afford- able and are already embedded into many electronic devices. A great number of data has been generated by these sensors and companies have started storing it – in fact the predominant business model at the moment seems to be around storing and owning data for possible later analytics: most fitness trackers or smart watches will send their recorded data to the cloud of the maker. Part of this desire to store comes from the value of data, part of the fact that data analytics is a notable chal- lenge to which solutions are still being explored. However, the IoT moves the game to an entirely new level by increas- ing the scale of deployed devices dramatically – posing new challenges to gathering, processing, transporting, storing and analysing data. There are five steps (Collection, Collation, Evaluation, Decide, and Act) in the so called IoT monitoring cycle [19]. Con- sidering the overall IoT systems, we always have devices at the edge as well as in the network and the cloud in a central position. Data processing can be at the in-network (so edge and devices in the network) and the cloud level [3, 19] – and these levels play a role in the various stages. Most of the research and existing work in the field of big data focuses on cloud computing because of the offered power in terms of processing and storage. The common way to process arXiv:1708.06441v1 [cs.NI] 21 Aug 2017
8

A Hybrid Approach for Data Analytics for Internet of Things · owning data for possible later analytics: most fitness trackers or smart watches will send their recorded data to the

May 24, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A Hybrid Approach for Data Analytics for Internet of Things · owning data for possible later analytics: most fitness trackers or smart watches will send their recorded data to the

A Hybrid Approach for Data Analytics for Internet of ThingsBadraddin Alturki

University of LeicesterLeicester

United [email protected]

Stephan Reiff-MarganiecUniversity of Leicester

LeicesterUnited Kingdom

[email protected]

Charith PereraNewcastle UniversityNewcastle upon Tyne

United [email protected]

ABSTRACTThe vision of the Internet of Things is to allow currently un-connected physical objects to be connected to the internet.There will be an extremely large number of internet connecteddevices that will be much more than the number of humanbeing in the world all producing data. These data will becollected and delivered to the cloud for processing, especiallywith a view of finding meaningful information to then takeaction. However, ideally the data needs to be analysed locallyto increase privacy, give quick responses to people and to re-duce use of network and storage resources. To tackle theseproblems, distributed data analytics can be proposed to collectand analyse the data either in the edge or fog devices. In thispaper, we explore a hybrid approach which means that both in-network level and cloud level processing should work togetherto build effective IoT data analytics in order to overcome theirrespective weaknesses and use their specific strengths. Specif-ically, we collected raw data locally and extracted featuresby applying data fusion techniques on the data on resourceconstrained devices to reduce the data and then send the ex-tracted features to the cloud for processing. We evaluated theaccuracy and data consumption over network and thus showthat it is feasible to increase privacy and maintain accuracywhile reducing data communication demands.

ACM Classification KeywordsD.2.11. Software Engineering: Software Architectures —Data Abstraction; H.3.4. Information Systems: Systems andSoftware — Distributed Systems

Author KeywordsInternet of Things; Cloud; Data Analytics; Fog Computing;Edge Computing; Distributed Data Analytics

INTRODUCTIONThe Internet of Things (IoT) has become one of the most activeareas in computer science and beyond, both for researchersand companies and it is interpreted by communities in a varietyof ways. The cluster of European research projects defined the

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

© 2017 ACM. ISBN 978-1-4503-2138-9.

DOI: 10.1145/1235

IoT as allowing "people and things to be connected Anytime,Anyplace, with Anything and Anyone, ideally using Any net-work and Any service [18]". This is important for applicationsto monitor, track, and communicate (amongst others) withthings for various purposes remotely. Sensors embedded inmany everyday physical objects around us play a key role inIoT. These embedded sensors will include vast sensing capa-bilities [15] and can then send the data through the networkto decision points – typically the cloud. After collecting thedata, there is a need to analyse them to gain insights to help,automate and speed up decision making[16]. In a data driveneconomy, the data and insights can be considered as the maingoods [14]. According to [17] the number of devices con-nected to the internet will be more than 50 billion devices inthe very near future. However, the greater awareness promisedby so many smart things will produce an ever greater volumeof data at increasing rates of delivery.

Big data is not a new term in computer science [24], it hasbeen created by big technological companies like Yahoo, Mi-crosoft and Google. Big data as researched has three keycharacteristics: volume, variety and velocity [24]. As resultof improvements in electronics, the cost of equipment hasdecreased dramatically and sensors have become more afford-able and are already embedded into many electronic devices.A great number of data has been generated by these sensorsand companies have started storing it – in fact the predominantbusiness model at the moment seems to be around storing andowning data for possible later analytics: most fitness trackersor smart watches will send their recorded data to the cloud ofthe maker. Part of this desire to store comes from the valueof data, part of the fact that data analytics is a notable chal-lenge to which solutions are still being explored. However,the IoT moves the game to an entirely new level by increas-ing the scale of deployed devices dramatically – posing newchallenges to gathering, processing, transporting, storing andanalysing data.

There are five steps (Collection, Collation, Evaluation, Decide,and Act) in the so called IoT monitoring cycle [19]. Con-sidering the overall IoT systems, we always have devices atthe edge as well as in the network and the cloud in a centralposition. Data processing can be at the in-network (so edgeand devices in the network) and the cloud level [3, 19] – andthese levels play a role in the various stages.

Most of the research and existing work in the field of big datafocuses on cloud computing because of the offered power interms of processing and storage. The common way to process

arX

iv:1

708.

0644

1v1

[cs

.NI]

21

Aug

201

7

Page 2: A Hybrid Approach for Data Analytics for Internet of Things · owning data for possible later analytics: most fitness trackers or smart watches will send their recorded data to the

the data is to send all data to the cloud and return results afteranalysis. In addition to the significant power available, pro-cessing in the cloud also means that as complete a collection ofdata is available to analysis as can be obtained. However, pro-cessing all streaming raw data in the cloud negatively effectsseveral aspects, such as increased network traffic, latency (toget actions back to the user), energy consumption and privacy.As the IoT grows the need to tackle these issues grows.

We suggest that in the longer term there is an opportunity tomove the computation as much as possible off the cloud tothe fog or edge device side. This means that data analyticsshould be handled in the device or fog before sending the datato cloud (possibly going as far as avoiding the cloud altogetherfor processing of operational data). The cloud would still havea role in longer term backup and also in helping to computemodels to guide analytics, in fact there is no doubt that thecloud has an important role in enabling the IoT since it pro-vides high power processing and storage [25]. A key problemto be tackled is to understand how accuracy of analytic resultsis effected if computations and decisions on raw data are madeelsewhere in the processing chain and infrastructure.

So, in this paper we propose a hybrid approach that movessome processing off the cloud and allows us to study the sav-ings in data transfer and changes to accuracy. In the proposedwork, we are fusing and filtering data close to the source andthen send meaningful higher level data rather than raw data tothe cloud. As less details are being transmitted some privacyprotection (not every little move is known, only the generalpicture) is already taking place – however further work instudying the privacy angle needs to be undertaken.

This data fusion technique will directly influence the collationand evaluation steps and hence the crucial question arisingis: how can we fuse sensors data locally without harmingthe accuracy of the overall decision? A secondary questionis considering the feasibility of distributing the processingconsidering that many network and edge devices have lessprocessing power.

The novel contributions of this paper are:

• We propose a hybrid approach, which moves the computa-tion as much possible to fog/ edge side of the network.

• We extensively evaluate the approach using the WISDMdataset [13] and five of the most popular data analyticstechniques.

• We explore the feasibility of applying these data aggregationtechniques via resource constrained device, particularly aRaspberry Pi 3 Model B.

The rest of this paper is organised as follows: section 2 de-scribes the solution space while section 3 explores our solution.We then evaluate, consider related work and draw conclusions.

SOLUTION SPACE: DATA AGGREGATIONCERP-IoT [17] provides the following characteristics of theIoT: Autonomous, Intelligence, connectivity, sensing, energy,dynamism, interoperability, privacy and security. The char-acteristics of data in the IoT are heterogeneity, redundancy,

dynamism and variety. Considering that "data fusion and min-ing present an efficient way to manipulate, integrate, manageand preserve mass data collected from various things" [23].Processing IoT data means to add value to the raw data byextracting important aspects and creating meaningful infor-mation – an essential element of the IoT [22], [4] identifiesfive steps to follow when processing IoT data, namely datacollection, data pre-processing, transformation of data, miningand evaluation. In this paper we are specifically interested indata fusion, which fits into the area of data pre-processing andtransformation and allows to reduce the volume data but in-crease its value. Data fusion is referred to by other ’synonyms’such as information fusion, decision fusion, data combination,multi-sensor data fusion, sensor fusion and data aggregation.While there is no general agreement on these terms, there aresome differences that can be observed: in some cases datafusion is applied on raw sensor data while information fusionis used to determine analysed data, meaning that the latter hasa higher semantic grade than data fusion[6]. Similarly, datafusion techniques are used to integrate data from a variety ofsources to produce more meaningful and effective inferencesand associations, whereas data aggregation can be consideredas subcomponent of data fusion which summarises the sen-sor data to remove data redundancy [1]. The most commondefinitions by researchers are as follows:

• data fusion is defined by the Joint Directors of Laborato-ries (JDL) workshop [20] as "a multi-level process dealingwith the association, correlation, combination of data andinformation from single and multiple sources to achieverefined position, identify estimates and complete and timelyassessments of situations, threats and their significance."

• Hall and Llinas [11] say that "data fusion techniques com-bine data from multiple sensors and related informationfrom associated databases to achieve improved accuracyand more specific inferences than could be achieved by theuse of a single sensor alone."

Data fusion can be classified depending on a variety of at-tributes as shown in figure 1 [7]. These attributes are discussedin detail in [1] and generally capture the idea that there are dif-ferent dimensions such as the abstraction level or the relationbetween the data items from one or multiple sensors.

Dasarathy’s data fusion classification system formalises theattributes just discussed and can be considered as one of themost common approaches [9]. Dasarathy’s classification fo-cuses on details of input and output based on the abstractionlevel. The classification contains five classes as follows [6]):

Data In-Data Out (DAI-DAO) is the primary method ofdata fusion in the classification model. It processes theraw data that are collected directly from sensors resulting inmore accurate data. In addition, image and signal process-ing algorithms can be used at this stage.

Data In-Feature Out (DAI-FEO) processes the raw data toproduce features which can depict a structure about theenvironment.

Page 3: A Hybrid Approach for Data Analytics for Internet of Things · owning data for possible later analytics: most fitness trackers or smart watches will send their recorded data to the

Figure 1. Data Fusion Classification

Feature In-Feature Out (FEI-FEO) processes a collectionof features to get more effective feature results.

Feature In-Decision out (FEI-DEO) processes the featuresto acquire a collection of decisions.

Decision In-Decision Out (DEI-DEO) processes the deci-sions to extract more efficient decisions.

Features are defined as the single measurements that are usedto create the training model. In other words, they are thecolumns of data that are created for the training set [12]. In ad-dition, data fusion can provide the required knowledge that isessential in a decision-making process, therefore, the amountof the available knowledge / data can effect the final decisionat any stage. Many techniques use symbolic information andthe data fusion process to determine the uncertainties and re-strictions that are part of / effect the decision-making process[6]. In other word, the decision can be captured dependingon the knowledge of the events that are collected from varietysources by fusing them.

Informally, our working definition of data fusion can be thatit aggregates and integrates all sensor data to allow obtainingaccurate and meaningful data while eliminating unneeded anduseless data.

Understanding what data fusion can achieve, one also needsto consider the architectural aspect of where data fusion isapplied. Options include a centralised, decentralised or dis-tributed architectures as follows [6]:

• Centralized architecture: all the collected data from sensorswill be sent to the cloud for processing which means thateverything is held in one single server. It is known that thecloud is capable to process very large amounts of data ef-fectively. However, in real time scenarios data consumptionover the network will be high, which will make the cloudnot sufficient for effective fusion of the data. This architec-ture is also very problematic if the data consists of images

such as earth observation imagery. The reason is that therewill be more delays in terms of data arrival time and thiswill impact badly on the output of data. Additionally, pri-vacy will be one of the main issues because this architecturereceives all the raw data without applying any reduction oraggregation previously. Finally, energy consumption hasbeen important in IoT because transferring raw data all timefrom devices using any network such 3G and WiFi willconsume significant amounts of energy.

• Decentralized architecture: there are several nodes in thenetwork and each of them has their specific computationcapabilities, so there is no single server like centralised sys-tem. Every node applies data aggregation autonomously onits local data and data received from peers. One of the majorlimitations of this architecture is the high communicationcost between peers. In this case, if we increase the numberof nodes, then there might be a lack of scalability.

• Distributed architecture: sensor readings are processed atthe source level before applying data aggregation in a spe-cific node that is capable of data fusion. This can overcomevarious issues of the centralised architecture and can reducecommunication costs over the decentralized architecture.

• Hierarchical architecture: The data fusion step is performedat a variety of levels in the hierarchy and it can be consid-ered as a combination of both distributed and decentralisedarchitecture.

It is true that it is not possible to say that one of these architec-ture is the best, as it often depends on specific requirementsand technology. Both decentralised and distributed architec-tures are quite similar to each other in many ways. However,they differ in terms of the place for pre-processing the data.In decentralised architectures the whole data aggregation hap-pens in every node which produces comprehensive output.Whereas, in distributed architectures the raw data is firstly pre-processed at source to extract features, and then these features

Page 4: A Hybrid Approach for Data Analytics for Internet of Things · owning data for possible later analytics: most fitness trackers or smart watches will send their recorded data to the

are fused. The main advantages of the distributed architec-ture over the centralised one are reducing the processing andcommunication costs because it pre-processes the data in adistributed manner before fusing data [6].

It is generally accepted that increasing accuracy and reducingenergy usage are major aspects of data fusion [7], so any archi-tecture that is presented needs to consider these aspects. Whileaccuracy is self explanatory, reducing energy is more difficultas the energy used is a combination of costs for storage, trans-port and processing with transport being very expensive onwireless transmissions technologies.

As there are obvious trade-offs between the different architec-tures it seems desirable to formulate solutions which combinethe different ideas in ways that reduce the disadvantages andbenefit from the advantages of each. Our method presentedbelow attempts to achieve this.

PROPOSED SOLUTION: ADAPTIVE DATA AGGREGATION

OverviewWe propose a hybrid approach that moves the computation asmuch as possible from the cloud to the fog/edge level. Theoverview of this approach is demonstrated in Figure 2. Webegin with applying data fusion techniques on sensors datato minimise the number of data points and extract featuresin IoT devices. Then, we extract features from this data andsend it to edge/fog node. This step is important because it iswidely accepted that raw time-series data cannot be efficientlyanalysed by ordinary algorithms for classification. After that,the features will be sent to the cloud for training purposes andcreating inferences.

ArchitectureThe architecture of our proposed solution is divided into twomain parts: First, the cloud level has the responsibility of datatraining and creating inferences. Second, the in-network levelaggregates the sensor data to reduce data transmission costover network. The aim is to save energy, reduce decisiontimes and to increase privacy by analysing and processingdata locally while maintaining accuracy as much as possible,reduce data transmission cost over network and save energyby analysing and processing data locally.

The communications between the nodes or peers can be under-taken in different ways as is typical (such as WiFi, 3G and anyother solutions). Figure 2 shows the architecture of the system.The distributed processing architecture contains three types ofnode including IoT Devices, fog, and the cloud as follows:

• A sensor node is at the lowest level of the system and istypically embedded in physical objects. Sensor nodes aresmall and cheap in terms of price to make the process ofdeploying sensors to objects easy and inexpensive. It sensesreal-world inputs such as motion detection, temperature andso on. These sensor nodes are connected to Fog nodes viawireless or wired communication.

• A fog node resides next to the sensors or along the commu-nication path to the cloud and collects sensors data (or datareceived from a ’downstream’ fog node) and applies data

fusion techniques to extract features. For our architecturethey form the main component. Obviously a fog node hasless power and a less global data view than a cloud nodeand hence it can apply less sophisticated data aggregationalgorithms. It sends the transformed and fused data to thecloud for further processing and storage if required (ideallythe fog node can make the ultimate decision). In an investi-gation into energy limitation, the authors in [5] found thatthese devices have restricted energy for particular tasks.

• Cloud nodes reside in the cloud and provide the final pro-cessing mechanism, obtaining the transformed data fromfog nodes. They mainly apply machine learning algorithmsand store the data. It is clear that the processing power andstorage capability of the cloud is high. This power can beused even more effectively by using the presented approach.According to [5], there is no energy limitation in the deviceswhich are in cloud.

Activity Recognition Using Accelerometer TracesTo validate our architecture we have used the WISDM [13]data set which is a set of accelerometer data on mobiles (partic-ularly Android based) from 36 users who are doing 6 activities(walking, jogging, climbing upstairs, descending downstairs,sitting and standing). These users carried their mobiles whilethey were performing these activities for a fixed time.

We divided the data into 10 seconds chunks. In addition,43 features are created depending on 200 readings withinthe specified chunks. The transformed data contains 5418accelerometer traces from the 36 users, with in average 150.50traces per user and a standard deviation of 44.73.

We conducted 3 sets of experiments: Firstly, we apply analyti-cal algorithms on the transformed data in the cloud to calculatethe accuracy of each algorithms and the execution time as abaseline. Secondly, we apply analytical algorithms on thetransformed data in a fog gateway to calculate the executiontime and to check the feasibility of the resource constraintdevices while processing the data. Finally, we apply data ag-gregation algorithms on the raw data to extract features. Thefinal approach minimises the data as much possible in the fogthen sends the transformed data to the cloud for analysis. Wemeasure the accuracy and execution time as well as the dataamount send to the cloud.

Our hope was that a similar accuracy can be achieved with thethird approach without increasing processing time and withsignificantly reducing network data transmissions.

EVALUATION AND DISCUSSION

Experimental Set upAs mentioned earlier it is not possible to apply classificationalgorithms on raw data which is time series data. Therefore,there is a need to transform raw data into features [13]. In ourexperiment, we used a Raspberry Pi 3 model as an exampleof a low power fog gateway. The used Raspberry Pi has1GB RAM and runs Raspbian Jessie with Pixel installed asoperating system. In addition, to simulate the cloud devicewe used a 16GB RAM Linux System. We used the weka toolon both sides and we adjusted the heap size in both cloud

Page 5: A Hybrid Approach for Data Analytics for Internet of Things · owning data for possible later analytics: most fitness trackers or smart watches will send their recorded data to the

Smart Thermostat

Smart Activity Monitor

Smart Coffee

Machine

Machine Learning Model Feature Extraction Model

CloudFogSensors

Figure 2. Distributed Processing in Internet of Things

and fog. In the fog the heap size was 650MB, in the cloudwe allowed 8GB RAM for our experiment. Moreover, weused the same data aggregation methods that were used toextract features in [13] to allow for comparability. We rundata said aggregation methods in the Raspberry Pi to generatemeaningful features depending on 200 readings where eachhas x,y and z acceleration information.

When we used the statistical measurements that are used in[13] we created 43 features including the average of each axis,standard deviation of each axis, average absolute differenceof each axis, average resultant acceleration for all axis, timebetween peaks of each axis and binned distribution for everyaxis (10 equal sized bins and totally 30 bins).

After the data is prepared we applied five classification meth-ods from the Weka data mining and machine learning tools.The methods include decision tree (J48), logistic regression,multilayer perceptron, and naive Bayesian. Throughout ourexperiment we have used 10 fold cross validation.

ResultsFigure 3.(a) shows the accuracy results of the 5 analysis al-gorithms that we applied on the transformed data. It is clearthat from results that the multilayer perceptron has the highestaccuracy percentage (100% is the best result from the in Cloudanalysis on raw data).

Figure 3.(b) shows the data communication time over networkfrom fog (Raspberry Pi) to cloud. There are two bars visible:one for raw data and the other for transformed data. Whileapplying this experiment the upload speed of the internet was1Mbps. It is clear that the fog only device has no data commu-nication cost because the processing happened in the deviceand no communication to the cloud took place. However, inthe cloud approach the raw data communication over networkfrom fog to cloud is extremely high, whereas in the hybrid ap-proach the transformed data communication over the networkfrom fog to cloud is low. This not surprising result confirmsthat we can save significantly on data communications byaggregating and pre-processing data early in the chain.

Figure 3.(c) illustrates the execution time of the 5 analyticsalgorithms in both the cloud and fog device. The results show

us that two algorithms (logistic regression and multilayer per-ceptron) have significant differences between the two sides.Obviously, the IoT device takes more time than the cloud to ex-ecute analytics algorithms because of its resource constraints.

Figure 3.(d) demonstrates the total processing time for thethree architectures. There are three measurements for eacharchitecture including the execution time of analytics (ML)algorithms, the execution time of the data transformation pro-cess and the data communication time between local deviceand cloud. This graph needs a bit more explanation as theresults are more interesting, so the details are as follows:

• Fog (Raspberry PI): The data transformation process is con-ducted locally and it is clear that the processing time ishigher than in cloud. The analytics algorithms have beenprocessed locally and they took much more time than cloudbecause of the processing power. However, data communi-cation (the time to send data to the cloud) is very low as onlyaggregated data is being sent for storage. So, overall pro-cessing time is in the middle of the measured approaches.

• Cloud: Data Communication is the time that the raw datatakes from IoT device to cloud, which is clearly high asall raw data is being transmitted. The data transformationprocess was done in the cloud and due to the availableresource runs quickly. Also, the analytics algorithms havebeen processed in the cloud and they took much less timethan the fog because of the processing power. However,overall due to the significant amount of transmission timethe cloud is the slowest approach in the given setting.

• Hybrid: Here the data transformation process is done locally(in the fog) on the Raspberry Pi, with the usual observation.Data Communication is the time the transformed data takesfrom IoT device to cloud which as before is low. Finally theanalytics algorithms have been processed in the cloud onthe transformed data they took much less time than locallybecause of the processing power. Overall by combining thevarious strength this leads to a very good execution time.

Both fog and hybrid approaches looks similar to each otherin most cases. However, they differ in terms of the place forapplying the machine learning algorithms on the transformeddata. In the fog approach the whole processing (data fusion and

Page 6: A Hybrid Approach for Data Analytics for Internet of Things · owning data for possible later analytics: most fitness trackers or smart watches will send their recorded data to the

(a) (b)

(c) (d)

Figure 3. Data Processing Results (a - d)

machine learning algorithms) happens in the node itself whichcan be considered as decentralised architecture. Whereas, inthe hybrid approach the raw data is firstly fused in the fognode to extract features, and then these features are sent to thecloud for applying the machine learning algorithms. The majorbenefit of the hybrid approach over the fog one is using thepower of the cloud for applying machine learning algorithmswhich need more processing power. Therefore, this step helpsin reducing the processing time as a contributor to the overalldata processing time.

These initial results show us that the proposed hybrid approachis good enough for the chosen dataset and analytical methods.It is clear from the results that data communication is efficientand provides significant gains.

Observation 1: Data consumption over network. It is ac-cepted that when the size of data is large the data consumptionwill be more expensive. The raw data was around 1 millionrows which is equal to approximately 50 MB. However, af-ter aggregating data into features by using data aggregationalgorithms, the number of rows became 5418 rows and thesize became 1.2 MB. This means that very significant savingsto data transmission and storage can be made by early aggre-gation. This observation will gradually gain in importance

as the number and quality of sensors increases rapidly andthus the rate and resolution at which data will be deliveredgrows quickly. By fusing the data locally before sending itto the cloud, we are not only reducing the data we are alsodetermining which data is meaningful and only send that. Thiswill reduce the energy consumption of fog and sensor deviceswhich typically gain internet connectivity through 3, 4 or 5G,thus batteries in the devices will last longer.

Observation 2: Accuracy. Aggregated data leads to less ac-curacy in the results compared to working with raw data. Allpresented approaches are effected in the same way. Over-all, the loss of accuracy is not drastic: the lowest bar is 75%with the highest being around 93%. Clearly the used analysismethod has an impact with trade-offs such as the used localprocessing power as well as methods optimised for this lo-calised setting being factors that can influence the accuracy.The right balance will in terms of privacy, accuracy, resourcecost, energy consumption and data transmission will need tobe identified and our future work will further this area.

Out of these experiments and observations we can concludethat one important aspect for future work is a developmentthat combines the ideas of distributed data aggregation withanalysis methods that can also be distributed effectively and be

Page 7: A Hybrid Approach for Data Analytics for Internet of Things · owning data for possible later analytics: most fitness trackers or smart watches will send their recorded data to the

run in low power environments. The fact that they can operateon smaller data sets will help, but somehow the methods needto contain a core part based on global understanding.

Aside. For completeness we like to note that we initiallyattempted to conduct the experiments with a Raspberry Pimodel b with 512MB RAM, however this was not capable toapply some of the weka toolbox analysis algorithms because ofthe RAM constraint. Hence we used a slightly more powerfulversion as reported above.

RELATED WORKIn recent years, the main cloud providers have been promisingnew IoT services with various functionalities and advantages.One of the main cloud providers is Microsoft with its AzureStack [21] which offers a hybrid cloud that allows companiesto transfer benefits from their servers while keeping the man-agement of servers for new types of cloud (hybrid cloud). Inaddition, they provide gateway devices in the cloud and dataanalytics. Similarly, IBM has an online web analytics systemwith IBM Digital Analytics. This service provides trackingand analysing of behaviours from visitors. The data analyticsuses high power servers inside IBM. The IBM PureData sys-tem promises fast data analytics and warehouse that combinewarehouse, data centres and analytics [8]. Although the abovesystems are promising powerful data analytics approaches,they do not support a fog gateway concept which resides be-tween IoT devices and the cloud. As we can see from ourresults that uploading high volumes of raw data consumestime and energy. Therefore, a fog gateway concept is impor-tant for real time services to save time, energy and resourcecost.

Data fusion is an active area in research and business particu-larly with a view to optimised data analytics. There are severaldata fusion techniques that focus on reducing the consumptionof energy in [10, 2]. They have used a variety of methodsincluding fuzzy set theory and neural networks. They suc-ceeded in terms of removing redundancies while fusing thedata. However, they did not focus on the resource constraintsof devices that embed the sensors. In contrast, they assumethat these devices work efficiently without a need to pay atten-tion to their limitations. More importantly, these mechanismssend all the data to centralised computation systems, whichaffects the data communication cost, privacy and energy aswell. As we could see in our experiments sending the raw datato the cloud is not efficient in terms of data communicationover the network.

CONCLUSIONS AND FUTURE WORKThis paper presents a hybrid approach in which data is fused inthe fog before being send to the cloud to reduce data communi-cation over the network. The results show that this architectureis successful in terms of reducing data communication costover network without significantly reducing accuracy of laterdecision making. We presented the proposed approach and itsrelevant methods. In addition, we used the WISDM dataset[13] to validate our architecture.

On the basis of the promising findings presented in this paper,future work will involve creating different features with dif-

ferent algorithms for better data aggregation further reducingdata communication while attempting to increase (or at leastmaintain) accuracy. One particular key piece of work has beenalluded to in the results section: the development of analysismethods that can be distributed and executed in efficient wayson low power devices. A key strategy here will be the explo-ration of the balance of analysis with limited (local) data setsvs the availability of a global view. In addition, evaluation ofenergy consumption and consideration of the positive impacton privacy will be aspects of future work. Furthermore, addi-tional datasets to test and evaluate our hybrid approach furtherwill investigated.

ACKNOWLEDGMENTSDr. Charith Perera’s work is funded by EPSRC award numberDERC EP/M023001/1 (Digital Economy Research Centre).Badraddin Alturki’s research is funded by Saudi Arabian Cul-tural bureau in London and his scholarship is granted by KingAbdul Aziz University.

REFERENCES1. Ahmed Abdelgawad and Magdy Bayoumi. 2012a. Data

fusion in WSN. In Resource-aware data fusionalgorithms for wireless sensor networks. Springer, 17–35.

2. Ahmed Abdelgawad and Magdy Bayoumi. 2012b.Resource-Aware data fusion algorithms for wirelesssensor networks. Vol. 118. Springer Science & BusinessMedia.

3. Mervat Abu-Elkheir, Mohammad Hayajneh, andNajah Abu Ali. 2013. Data management for the internetof things: Design primitives and solution. Sensors 13, 11(2013), 15582–15612.

4. Henrique C. M. Andrade, Buħra Gedik, and Deepak S.Turaga. 2014. Fundamentals of Stream Processing:Application Design, Systems, and Analytics. CambridgeUniversity Press. DOI:http://dx.doi.org/10.1017/CBO9781139058940

5. Carsten Bormann, Mehmet Ersue, and A Keranen. 2014.Terminology for constrained-node networks. TechnicalReport.

6. Federico Castanedo. 2013. A review of data fusiontechniques. The Scientific World Journal 2013 (2013).

7. Sakshi Chhabra and Dinesh Singh. 2015. Data Fusionand Data Aggregation/Summarization Techniques inWSNs: A Review. International Journal of ComputerApplications 121, 19 (2015).

8. Larry Coyne, Joe Dain, Phil Gilmer, Patrizia Guaitani,Ian Hancock, Antoine Maille, Tony Pearson, BrianSherman, Christopher Vollmar, and others. 2017. Ibmprivate, public, and hybrid cloud storage solutions. IBMRedbooks.

9. Belur V Dasarathy. 1997. Sensor fusion potentialexploitation-innovative architectures and illustrativeapplications. Proc. IEEE 85, 1 (1997), 24–38.

Page 8: A Hybrid Approach for Data Analytics for Internet of Things · owning data for possible later analytics: most fitness trackers or smart watches will send their recorded data to the

10. Hevin Rajesh Dhasian and Paramasivan Balasubramanian.2013. Survey of data aggregation techniques using softcomputing in wireless sensor networks. IET InformationSecurity 7, 4 (2013), 336–342.

11. David L Hall and James Llinas. 1997. An introduction tomultisensor data fusion. Proc. IEEE 85, 1 (1997), 6–23.

12. Peter Harrington. 2012. Machine learning in action.Vol. 5. Manning Greenwich, CT.

13. Jennifer R Kwapisz, Gary M Weiss, and Samuel AMoore. 2011. Activity recognition using cell phoneaccelerometers. ACM SigKDD Explorations Newsletter12, 2 (2011), 74–82.

14. IC Ng. 2014. Engineering a Market for Personal Data:The Hub-of-all-Things (HAT), A Briefing Paper. WMGService Systems Research Group Working Paper Series(2014).

15. C. Perera, C. H. Liu, and S. Jayawardena. 2015. TheEmerging Internet of Things Marketplace From anIndustrial Perspective: A Survey. IEEE Transactions onEmerging Topics in Computing 3, 4 (Dec 2015), 585–598.DOI:http://dx.doi.org/10.1109/TETC.2015.2390034

16. Jurgo Preden, Jaanus Kaugerand, Erki Suurjaak, SergeiAstapov, Leo Motus, and Raido Pahtma. 2015. Data todecision: pushing situational information needs to theedge of the network. In Cognitive Methods in SituationAwareness and Decision Support (CogSIMA), 2015 IEEEInternational Inter-Disciplinary Conference on. IEEE,158–164.

17. Harald Sundmaeker, Patrick Guillemin, Peter Friess, andSylvie Woelfflé. 2010. Vision and challenges for realisingthe Internet of Things. Cluster of European ResearchProjects on the Internet of Things, European Commision(2010).

18. Ovidiu Vermesan, Peter Friess, Patrick Guillemin, SergioGusmeroli, Harald Sundmaeker, Alessandro Bassi,Ignacio Soler Jubert, Margaretha Mazura, Mark Harrison,Markus Eisenhauer, Pat Doody, Friess Peter, GuilleminPatrick, Gusmeroli Sergio, Bassi Harald, SundmaekerAlessandro, Jubert Ignacio Soler, Mazura Margaretha,Harrison Mark, Eisenhauer Markus, and Doody Pat. 2009.Internet of Things Strategic Research Roadmap. Internetof Things Strategic Research Roadmap (2009), 9–52.DOI:http://dx.doi.org/pdf/IoT_Cluster_Strategic_Research_Agenda_2011.pdf

19. Meisong Wang, Charith Perera, Prem Prakash Jayaraman,Miranda Zhang, Peter Strazdins, and Rajiv Ranjan. 2015.City data fusion: Sensor data fusion in the internet ofthings. arXiv preprint arXiv:1506.09118 (2015).

20. Franklin E White. 1991. Data fusion lexicon. TechnicalReport. DTIC Document.

21. J. Woolsey. 2016. Powering the Next Generation Cloudwith Azure Stack. (2016).

22. Fatos Xhafa and Leonard Barolli. 2014. Semantics,intelligent processing and services for big data. (2014).

23. Zheng Yan, Jun Liu, Athanasios V Vasilakos, andLaurence T Yang. 2015. Trustworthy data fusion andmining in Internet of Things. Future GenerationComputer Systems 49, C (2015), 45–46.

24. Arkady Zaslavsky, Charith Perera, and DimitriosGeorgakopoulos. 2013a. Sensing as a service and bigdata. arXiv preprint arXiv:1301.0159 (2013).

25. Arkady B. Zaslavsky, Charith Perera, and DimitriosGeorgakopoulos. 2013b. Sensing as a Service and BigData. CoRR abs/1301.0159 (2013).http://arxiv.org/abs/1301.0159