RESEARCH Open Access Secure semantic expansion based ... · RESEARCH Open Access Secure semantic expansion based search over encrypted cloud data supporting similarity ranking Zhihua

Xia et al. Journal of Cloud Computing: Advances, Systems and Applications 2014, 3:8http://www.journalofcloudcomputing.com/content/3/1/8

RESEARCH Open Access

Secure semantic expansion based search overencrypted cloud data supporting similarityrankingZhihua Xia1,2, Yanling Zhu1,2, Xingming Sun1,2* and Lihong Chen1,2

Abstract

With the advent of cloud computing, more and more information data are outsourced to the public cloud foreconomic savings and ease of access. However, the privacy information has to be encrypted to guarantee thesecurity. To implement efficient data utilization, search over encrypted cloud data has been a great challenge. Theexisting solutions depended entirely on the submitted query keyword and didn’t consider the semantics ofkeyword. Thus the search schemes are not intelligent and also omit some semantically related documents. In viewof the deficiency, as an attempt, we propose a semantic expansion based similar search solution over encryptedcloud data. Our solution could return not only the exactly matched files, but also the files including the termssemantically related to the query keyword. In the proposed scheme, a corresponding file metadata is constructedfor each file. Then both the encrypted metadata set and file collection are uploaded to the cloud server. With themetadata set, the cloud server builds the inverted index and constructs semantic relationship library (SRL) for thekeywords set. After receiving a query request, the cloud server first finds out the keywords that are semanticallyrelated to the query keyword according to SRL. Then both the query keyword and the extensional words are usedto retrieve the files. The result files are returned in order according to the total relevance score. Eventually, detailedsecurity analysis shows that our solution is privacy-preserving and secure under the previous searchable symmetricencryption (SSE) security definition. Experimental evaluation demonstrates the efficiency and effectives of thescheme.

Keywords: Secure; Semantic expansion; Rank; Semantic relationship library; Cloud data

IntroductionCloud Computing enables cloud customers to enjoy theon-demand high quality applications and services from acentralized pool of configurable computing resources.This new computing model can relieve the burden ofstorage management, allow universal data access withindependent geographical locations, and avoid capitalexpenditure on hardware, software, and personnelmaintenances, etc [1].As cloud computing becomes mature, lots of sensitive

data is considered to be centralized into the cloudservers, e.g. personal health records, secret enterprise

* Correspondence: [email protected] Engineering Center of Network Monitoring, Nanjing University ofInformation Science & Technology, Nanjing 210044, China2School of Computer & Software, Nanjing University of Information Science &Technology, Nanjing 210044, China

© 2014 Xia et al.; licensee Springer. This is an oAttribution License (http://creativecommons.orin any medium, provided the original work is p

data, government documents, etc [1,2]. The straight-forward solution to protect data privacy is to encryptsensitive data before being outsourced. Unfortunately,data encryption, if not done appropriately, may reducethe effectiveness of data utilization. Typically, a userretrieves files of interest to him/her via keyword searchinstead of retrieving back all the files. Such keyword-based search technique has been widely used in ourdaily life, e.g. Google plaintext keyword search. How-ever, the technologies are invalid after the keywords areencrypted.In recent years, searchable encryption (SE) techniques

have been developed for secure outsourced data search[3-8]. Some further researches focus on search efficiency[9,10], multi-keyword search [11,12], and securedynamic updating [13]. But they only support exactkeyword search. To enhance the search flexibility and

pen access article distributed under the terms of the Creative Commonsg/licenses/by/2.0), which permits unrestricted use, distribution, and reproductionroperly cited.

mailto:[email protected]

Xia et al. Journal of Cloud Computing: Advances, Systems and Applications 2014, 3:8 Page 2 of 11http://www.journalofcloudcomputing.com/content/3/1/8

usability, some research has been done on fuzzy keywordsearch [14-18]. These solutions support tolerance ofminor typos and format inconsistencies, such as, searchfor “million” by carelessly typing “milion”, or “datamining”by typing “data-mining”. These schemes mainly take thestructure of terms into consideration and use edit distanceto evaluate the similarity. They didn’t consider the termssemantically related to query keyword, thus many relatedfiles are omitted. In addition, these fuzzy systems sendback all relevant files solely upon presence/absence of thekeyword, and result-ranking is still out of considering.In this paper, from a new perspective, we propose a

similar search solution based on semantic query ex-pansion while supporting similarity ranking. Semanticexpansion based similar search reinforces the systemusability by returning the exactly matched files and thefiles including the terms semantically related to thequery keyword. In the proposed scheme, a correspond-ing file metadata is constructed for each file. Then theencrypted metadata set and file collection are uploadedto the cloud server. With the metadata set, the cloudserver builds the inverted index and constructs seman-tic relationship library (SRL) for the keywords set. Theco-occurrence of terms is used to evaluate the seman-tic relationship between terms in SRL. Upon receivinga query request, the cloud server automatically findsout the terms which are semantically related to thequery keyword according to the value of semantic rela-tionship between terms in SRL. Then both the keywordand the semantically expanded words are used to retrievefiles. Finally, the matched files are returned in order ac-cording to the total relevance score. In the process, toensure security and final result ranking, we properlymodify a crypto primitive order-preserving encryptionto protect the relevance score. Detailed security analysisshows that the solution correctly realizing the goal ofsemantic search, while preserving the privacy. Extensiveexperimental evaluation demonstrates the efficiency andeffectives of the scheme.

Related workEarly searchable encryption (SE) schemes provide thesolution mainly for secure exact keyword search [3-8].In the symmetric key setting, Song et al. proposed thefirst SE scheme, where each word in the file should beencrypted with a two-layered encryption constructionindependently [3]. To improve search efficiency, someresearchers turn to index technique. Goh et al. andChang et al. both proposed similar secure per-file index,where an index including trapdoors of all unique wordsis constructed for each file [4,6]. Curmola et al. presenteda per-keyword index construction, where each entry ofthe whole hash table index contains the trapdoor for akeyword and an encrypted set of file identifiers [7]. To

further enhance system usability, some other researcherspropose ranked search. Wang et al. proposed a solutionfor ranked single-keyword search regarding to certainrelevance score [9,10]. Cao et al. and Yang et al. proposedthe scheme for multi-keyword ranked search, where “Innerproduct similarity” is used for result ranking [11,12]. Emilet al proposed a hierarchical index structure to achievemore secure and effective dynamic updating [13]. As acomplementary approach, Boneh et al. proposed thefirst public key based searchable encryption scheme inthe public-key setting [5].However, all the above schemes support only exact

keyword search. Namely, users’ searching input shouldexactly match the keywords contained in the files. As anattempt to enhance search flexibility, fuzzy keywordsearch over encrypted cloud data has been proposed[14-16,19]. Li et al. and Wang et al. both exploited editdistance as the similarity metric of keywords to constructthe fuzzy keywords set as indexes. Besides, the wildcard-based technique is used for storage-efficiency of fuzzy key-words set [15,14]. Liu proposed “dictionary-based fuzzyset construction” to further reduce the size of fuzzy key-words set [17]. Relying on an asymmetric security model,Bringer et al. proposed a fuzzy search scheme based onthe embedding of edit distance into Hamming distance[19]. This scheme does not need priori define of fuzzy key-words set. Chuah proposed a fuzzy multi-keyword searchscheme, where edit distance is also used to evaluate thesimilarity between terms [16]. Besides, an index BedTreeis constructed to improve search efficiency with n-gramtechnique. Without the construction of fuzzy keywordsset, Jin introduced new measures, e.g. n-gram bloom-filterand frequency vector, to approximately measure the simi-larity over encrypted string [18]. Note that, the above fuzzysearch systems consider the similarity metric mainly fromthe structure of keywords, not from the semantic relation-ship. Thus, practically usable semantic search remains to beaddressed in the context of encrypted data search.In this paper, we propose a ranked semantic expansion

based similar search scheme in the symmetric key setting,which take both the semantic search and result rankinginto consideration.

Problem formulationSystem modelWe consider the system model involving three differententities: data owner, data user and cloud server, as illus-trated in Figure 1.Data owner uploads a collection of n text files F =

{F1, F2, F3,⋯, Fn} in encrypted form C, together withthe encrypted metadata set, to the cloud server. Notethat, a corresponding file metadata is constructed foreach file. Each file in the collection is encrypted withcommon symmetric encryption algorithm, e.g. AES.

Encrypted files

Encrypted Metadata

Index SRLSearch Trapdoor

Ranked Result

Cloud server

Data Owner

Search Control

Access Control

Data User

Figure 1 Framework of the semantic expansion based similar search over encrypted cloud data.


Data user provides a search trapdoor Tw for keywordw to the cloud server. In our paper, we assume theauthorization between the data owner and users is ap-propriately done.Cloud server first constructs the index and SRL using

the metadata set provided by data owner, thus reducethe computing burden on owner, e.g. index creating. Uponreceiving the request Tw, the cloud server automatically ex-pands the query keyword based on SRL. Then the serversearches the index, and returns the matching files to theuser in order. Finally, the access control mechanism, whichis out of the scope of this paper, is employed to manage thecapability of the user to decrypt the received files.

Threat modelIn this paper, we use the same threat model described inprevious searchable symmetric encryption (SSE) scheme[6,7,9,11,15,16]. We consider an “honest-but-curious”server in our model. Specifically, the cloud server hon-estly follows the designated protocol specification, but is“curious” to infer and analyze all data information avail-able on the server so as to learn additional information.In other words, the cloud server has no intention to activelymodify the stored data or disrupt any other kind of service.Thus we consider the threat models with attack capabilitiesas follows.Known background Model: In this model, except for

the encrypted dataset and metadata set the owner up-load, the server is assumed to have additional knowledgeon the dataset, e.g. the subject and its related statisticalinformation. For instance, the server can utilize the key-word frequency statistics to infer keywords.

Design goalsTo enable effective and secure ranked semantic expan-sion search over outsourced cloud data under the afore-mentioned model, our mechanism should achieve thefollowing design goals.

1) Ranked semantic expansion search: To design asimilar search scheme that supports semantic searchover encrypted cloud data by expanding the querykeyword upon semantic relationship of terms, whichfinally returns the retrieved files in order.

2) Security guarantee: To prevent cloud server fromlearning the plaintext of the data files andkeywords. Compared to the existing SSE schemes,the scheme should achieve the as-strong-aspossible security strength.

3) Efficiency: To achieve the above goals with minimumcommunication and computation overhead.

Notation

F − the plaintext file collection, denoted as a set of ndata files F = {F1, F2, ⋯, Fn}.C − the encrypted file collection, stored in the cloudserver, denoted as C = {c1, c2, ⋯, cn}.id(Fi) − the identifier of file Fi that can help uniquelylocate the actual file.W − the dictionary, i.e., the keywords set extractedfrom F, denoted as a set of m keywordsW = {w1, w2, ⋯ wm}.M − the encrypted metadata set, denoted as a set of nfile metadata M = {M(Fi)}, i = 1, 2, ⋯ n.

Protocolintranet

Internet

WWW

network

Web

0.310.21

0.52

0.45

0.78

LAN

0.78

1.0

0.685

authentication

0.24

host

0.31

0.2

Figure 2 An example of semantic relationship library.


I − the inverted index including a set of m posting listsI = {I(wi)}, i = 1, 2, ⋯m.Tw − the trapdoor generated for a query keyword w bya user.Sw − the semantically expanded keyword set of w, it is asubset of W, denoted as Sw ¼ w′

1 ;w′2 ; ⋯

� �.

PreliminariesSemantic query expansionIn the domain of plaintext retrieval, automatic queryextension has been a technique to improve the recalland precision of retrieval for a long time [20]. It uses thesemantically related words to expand the particular query,and makes the query request more satisfy the user’s intent.The key step of semantic query expansion is to find out

the semantic relationship between the keywords. Someresearchers utilized readily available corpus independentknowledge models [21], e.g. WordNet, EuroWordNet, andsome others dynamically constructed the semantic relation-ship from the document collection by the technologies suchas term clustering [22,23], and mutual information model[24-26]. Among these technologies, mutual informationmodel is widely used [24,26-29].Refer to the formula used in [26], which adopted the

mutual information model to implement semantic searchin web. The mutual information I(x, y) is defined as

I x; yð Þ≡log2P x; yð Þp xð Þp yð Þ ð1Þ

Here P(x, y) is the probability of observing x and y to-gether. p(x) and p(y) are the probabilities of observingx and y independently in the collection. The higher thesemantic relationship between x and y is, the larger theco-occurrence degree is, and consequently the largerthe mutual information I(x, y) is.Then normalize the mutual information into a value of

relationship in interval [0, 1]. The semantic relationshiplibrary will be constructed as a weighted graph structureshowed in Figure 2.

Table 1 An example of inverted index

Keyword wi

File ID id(Fi1) id(Fi2) id(Fi3) … id Finið ÞRelevance score Si1 Si2 Si3 … Sini

Inverted indexInverted index is a widely used indexing structure ininformation retrieval. It is consist of a list of mappingsfrom keywords to the set of files that contain this key-word [30]. For the purposes of ranking, the numericalrelevance score is computed for each file based on TF× IDFrule introduced later in subsection “Basic definition”.An example index structure of keyword wi is shown inTable 1. Here Sij (j = 1,⋯, ni) denotes the relevancescore of file Fij in response to wi, ni is the number offiles contain keyword wi.

Order-preserving Encryption (OPE)The OPE is a deterministic encryption scheme, whoseencryption function preserves the numerical ordering inplaintext-space [31,32]. More specifically, a function f :D = {1,⋯,M}→ R = {1,⋯N} is order-preserving, if for alla, b ∈D, f(a) > f(b) if a > b. Generally, any order-preservingfunction can be defined as a combination of M out of

N ordered items, which can be calculated byNM

� �. The

adversary has to execute exhaustive enumeration, namelysearching over all the possible combination, to break theencryption. So the number of combination, which is maxi-mized when M =N/2, should be large enough to ensurethe security. If the security level is chosen to be 280, since

N=Mð ÞM≤NM

� �, it is suggested to choose M =N/2 > 80.

A plaintext m in domain D is always mapped to arandom-sized non-overlapping bucket in range R. Then aciphertext c is chosen within the bucket depend on thevalue of some random function.

Basic definitionsRanking functionA ranking function is used to measure relevance scores ofmatching files to a given query in information retrieval.The most widely used measurement for evaluating rele-vance score is TF × IDF rule. TF (Term frequency) is usedto measure the importance of the term within the particularfile, defined as the number of times a given term or key-word appears within a file. IDF is used to measure the over-all importance of the term within the whole collection,


defined as the total number of documents in the collectiondivided by the total number of documents including thatword. Note that, we focus on single keyword search in ourscheme. Thus without loss of generality, the relevance scoreof single keyword can be computed using equation 2, whichis widely used in the literature [33]:

Score w; Fið Þ ¼ 1Fij j ⋅ 1þ ln f i;w

� �⋅ ln 1þ n

f w

� �ð2Þ

Here w denotes the query keyword; fi,w is the TF ofterm w in file Fi; fw denotes the number of files that con-tain keyword w. n is the number of files in the collection,while |Fi| is the length of file Fi, obtained by countingthe number of indexed terms in the file.In our scheme, we first expand the query keyword

based on SRL, and then both the keyword and its se-mantically related words are used to retrieve the files.So Fd’s total relevance score will be computed for resultranking with equation 3.

TScore w; Fdð Þ ¼ Scorew þX

∀wi′ ∈Sw

Scorewi′� Ri ð3Þ

Here Scorew represents the relevance score of the in-put keyword; Scorewi′

represents the relevance score of

expanded keyword wi′ , while Ri is the value of semanticrelatedness.

File metadataA piece of file-metadata is constructed for each file. Thefile-metadata consists of the file ID, keywords, and therelevance scores (refer to equation 2) of keywords in re-sponse to the file. If file Fi contains keyword wj, a tuplewj, sji is insert into metadata M(Fi), where sji representsthe relevance score of keyword wj response to file Fi.All of the file metadata constitute metadata set, whichis shown in Figure 3.

Figure 3 An example of metadata set.

Secure Semantic Expansion based SimilarSearch SchemeThe scheme consists of six algorithms (KeyGen, BuildMD,BuildIndex, BuildSRL, TrapdoorGen, and SearchIndex),which can be constructed in two phases—Setup andRetrieval.

The setup phaseIn this phase, data owner initializes the public and secretparameters of the system by executing KeyGen, and pre-processes the file collection F using BuildMD to generatethe encrypted metadata for each file. Finally the owner up-loads both the encrypted file collection C and metadata setM to the cloud server. With M received from data owner,the server constructs the index using BuildIndex and se-mantic relationship library using BuildSRL. In addition, thenecessary secret parameters, e.g. the trapdoor generationkey, should be distributed to a group of authorized users byemploying off-the-shelf public key cryptography or broad-cast encryption. Details are as follows:

1) The data owner initiates the scheme by calling KeyGen(1k, 1l, 1P). It takes the security parameters k, l, p asinputs and generates random keys x←

R0; 1f gk , y←R

0; 1f gl . Finally it outputs secret keys set K = {x,y, 1l, 1P }used for later encryption, such as trapdoor generationand relevance score encryption.

2) Then the data owner builds the secure metadata foreach file in file collection F by calling BuildMD(K, F),It takes the secret K and dataset F as inputs andoutputs the encrypted metadata set M. The functionextracts the keywords in each file and computes thecorresponding relevance score. The keyword in themetadata is encrypted with collision resistant hashfunction π : {0, 1}k × {0, 1}*→ {0, 1}p (p > logm),where m denotes the size of keywords set. Therelevance score is encrypted with order-preserving


encryption algorithm OPE : {0, 1}l × {0, 1}d→ {0, 1}r,where d and r respectively represent the bit length usedto denote all the values in domain D and range R. Thedetail is shown in Algorithm 1.

Figure 4 is an example of the encrypted metadata set.3) When receiving the secure metadata, the server

builds the inverted index by calling BuildIndex(M).The function extracts the encrypted keywords andconstructs a posting list for each keyword. Ifkeyword ewj included in file metadata M(Fi), theelement {id(Fi)||esji} is inserted into posting list ofkeyword ewj. The details are given in Algorithm 2.The SRL is also built upon the metadata set anduses common association rules algorithm to miningthe co-occurrence relationship of keywords.

An example of secure inverted index constructed bycloud server is shown in Figure 5.

The retrieval phaseIn this phase, the user generates a secure trapdoor ofhis interested keyword using TrapdoorGen, and sub-mits it to the cloud server. Upon receiving the querytrapdoor, the cloud server first automatically expands

the query keyword. Then the server searches the indexvia SearchIndex, and eventually sends back the matchedfiles in a ranked sequence according to the total relevancescores. During the process, beyond the order of therelevance scores, nothing or little information shouldbe leaked. Details are as follows.

1) The user generates a trapdoor Tw = πx(w) for aninterested keyword w, by calling TrapdoorGen(w).

2) Upon receiving the trapdoor Tw, the server firstexpands the query keyword to obtain the extensionalquery trapdoor Tw′ = {πx(w), πx(wi′)}, ∀wi′∈ Sw. Bycalling SearchIndex, the server locates the matchingentries of the index via πx(w) and πx(wi′), whichinclude the file identifiers and the associatedorder-preserved encrypted relevance scores.

3) The server then computes the total relevance scoreof each file to the query according to equation 3. Inthe end, the server sends back the matched files in aranked sequence, or sends top-k most relevant filesif the user provides the optional value k.

Towards one-to-many order-preserving encryptionTo implement efficient result ranking, we use OPE en-crypt the relevance score. Thus the server can rank the re-trieved files directly according to the encrypted relevancescore. However, the original OPE is a deterministic en-cryption scheme, if not disposed properly, it will leak asmuch information as any deterministic encryption schemedoes [32]. In particular, the statistical information of thescores, such as the distribution slope, value range etc., canbe used to identify the specific keyword in the query [9].Therefore we need to modify the OPE to suit our

requirement. The original OPE first maps the plaintextm in domain D to an interval bucket in range R. Thenthe ciphertext c is chosen in the bucket using m as therandom seed for the random selection function. Themodified OPE should map the same plaintext score todifferent ciphertext, and still globally preserve theorder of relevance score. Thus a one-to-many OPEscheme is desired to reduce the amount of informationleakage. More specifically, in the final ciphertext selectionprocess, together with the plaintext m, the unique fileID is introduced as an additional random seed. Thus thesame plaintext will not be deterministically mapped tothe same ciphertext, but a random value within the ran-domly assigned bucket in range R. Algorithm 3 showsthe whole process, where GetCoins(⋅) is a random coingenerator, HYGEINV(⋅) is the HGD(⋅) sampling functioninstance in MATLAB. In the process, a plaintext m indomain D = {1,⋯,M} is mapped into ciphertext c se-lected in range R = {1,⋯ N}, id(F) denotes the corre-sponding file ID. In the paper, the one-to-many OPEis denoted as OM − OPE.


The mapping scheme should be as random as possibleto eliminate the predictability of the keyword specificscore distribution. Obviously, the larger the size of rangeR is, the less specific characteristics will be preserved.However, considering the efficiency of HGD function,the size of range R cannot be unboundedly large. Sothe range size |R| should be properly tradeoff betweenrandomness and efficiency.To guarantee the security of keywords in the meta-

data set, the relevance score should be encrypted withOM − OPEy(⋅) instead of OPEy(⋅) in Algorithm 1.

Security analysisWe estimate the security of the proposed scheme byproving the security guarantee stated above (refer toDesign goals). That is, both the data files and the keywordsare not leaked to the server.

Security analysis for the ranked semantic expansion SearchWe analyze the solution with respect to the aforementionedsearch privacy requirement, e.g. keyword privacy andfile confidentiality.

� File confidentiality: the file confidentiality dependson the inherently security strength of the symmetricencryption scheme, so the file content is obviouslyprotected well.

� Keyword privacy:
1. The query trapdoor is generated using the
symmetric encryption scheme, so the privacy ofquery keyword depends on the inherently securitystrength of the symmetric encryption scheme.

2. The proposed scheme introduces some additionalinformation in the index compared to the originalSSE, such as the encrypted relevance scores andthe values of relationship between terms. Thusthe privacy of keyword in the index depends onnot only the symmetric encryption scheme. Wediscuss the security from two aspects.

On one hand, as defined in the thread model, the ser-ver may predict the plaintext of keyword depends onthe score distribution. Thus the OM −OPE is used toencrypt the score, which could flatten the distributionof relevance score. So the keyword privacy mainly de-pends on the security of OM −OPE. In the next part,we analyze the security of OM − OPE in detail. As dis-cussed, if the data owner properly enlarges the rangeR, the relevance score will be randomly mapped to asequence of order-preserved numeric values with verylow duplicates. So OM −OPE makes it difficult for theadversary to predict the plaintext score distribution, letalone predict the keywords.On the other hand, as shown in Table 2, the semantic

relationship values between terms do not have their pecu-liarities, which cannot be effectively used for statisticalanalysis. Note that, in the previous literature with invertedindex [9], the server can also get the co-occurrence degreeof terms by recording and analyzing the search result.Thus the leaking of relationship information shouldn’t bea main secure problem we have to solve in current work.

Security analysis for one-to-many OPEThe one-to-many OPE scheme introduces the file ID asthe additional seed in the ciphertext chosen process. Sothe same plaintext will not be deterministically mapped

Figure 4 An example of encrypted metadata set.


to the same ciphertext, but a random value in the assignedbucket in range R. This helps flatten the score distribu-tion of keyword, and protect the keyword privacy fromstatistical attack.However, if there are many duplicates of plaintext

m, the ciphertext distribution may not be flattened ef-fectively for the small size of assigned bucket in rangeR. So we should expand the range R properly to ensurethe low duplicates on the ciphertext range, it will bedifficult for the adversary to analyze which points in Rbelong to the same plaintext score.In this paper, we use the min-entropy to choose the size

of R. It is defined as: H σð Þ ¼ − log maxα

Pr σ ¼ α½ ��

,

where σ is a discrete random variable, α denotes a state ofσ with the max probability. In general, the higher H(σ) is,the more difficult the σ can be predicted. If H(σ) ∈w(log k),the min-entropy of variable σ will be high, where k isthe bit length needed to denote all the states of σ [8].

Figure 5 An example of secure inverted index.

We could choose H(σ) as (log k)c where c > 1 [9]. Thenthe least size |R| should satisfy the equation 4:

log log Rj jð Þð ÞC≤− logmax= Rj j⋅ 12

5 logMþ12� �

δ

0@

1A ð4Þ

Here max denotes the maximum number of scoreduplicates within the metadata set. δ denotes the totoalnumber of scores to be mapped within metadata set.Wit D = {1,⋯,M},M = |D|, the total recursive calls ofBinarySearch(⋅) function (line 9 in Algorithm 3) is atmost 5 logM + 12 on average. If the range size |R| is de-noted in bits, namely k = log |R|, we will get equation 5.With the established file metadata set, it is easy to de-termine the proper rage size |R|.

max ⋅ 25 logMþ12

2k⋅ δ¼ max ⋅M5

2k−12⋅ δ≤2− logkð Þc ð5Þ

Table 2 An example of semantic relationship between terms

Keyword Keyword Similarity

host network 0.31

lan ethernet 0.31

protocol internet 0.31

1 2 3 4 5 64

5

6

7

8

9

10

11

12

Size of metadata set(×103)

Tim

e of

bui

ldin

g in

dex(

×10

ms)

Figure 6 The time cost for building index.

40

50

60

70

80

90

100

uild

ing

SRL

(×10

ms)


As discussed above, if we properly choose the range R,the randomness in the ciphertext selection process willeffectively mitigate the useful information revealed tothe cloud server.

Performance analysisTo evaluate the performance of our proposed scheme,we implemented the secure search system using C++ ona windows machine with Intel Core 2 Duo CPU Processorrunning at 2.93GHZ, 2.94GHZ. The experimental evalu-ation was conducted on a real data set: Request for com-ments database(RFC) [34], this file set contains a largenumber of technical keywords. The overall performanceevaluation of our scheme includes the cost of metadataconstruction, the time necessary for index and SRLconstruction as well as the efficiency of search.

Metadata constructionThe main overheads for data owner are time cost andstorage cost of metadata construction. To build a meta-data for each document Fi in the dataset F, we shouldextract the keywords and compute the associated rele-vance score, then encrypt the keywords and scores. Thetime cost of each entry directly depends on the number ofkeywords in the file, while the overall efficiency is also re-lated to the number of the files in the collection. So Table 3lists the metadata construction performance for a dataset ofRFC files. Both the metadata size and construction timelisted are the average value, for the reason that it eliminatesthe difference of various file set construction choices.

Index and SRL constructionIn our construction we should scan the whole metadataset to extract the keywords and build the inverted indexwith corresponding scores. Figure 6 shows that the wholeindex is nearly linear with the size of M, namely the num-ber of documents in the collection. The SRL is also builtby scanning the metadata set, with the certain supportthreshold, the number of entries is the main factor to the

Table 3 File metadata construction overhead

Numberof files

Per filemetadata size

Per file metadatabuild time

1000 0.18 KB 0.28 s

2000 0.20 KB 0.30 s

3000 0.21 KB 0.32 s

efficiency. Figure 7 shows the time cost of building SRLagainst the increasing size of M or dataset. In addition,taking into account the abundant computing resources onserver, the performance of building index and SRL is prac-tically efficient.

Search efficiencyThe search process includes query extension, fetchingthe posting list in the index, calculating the total rele-vance score and ranking the result in descending order.Compared to the original ranked search, our approachintroduces the keyword extension cost, and the calcula-tion cost of final relevance score. So the size of seman-tically expanded keywords set is a factor to the queryefficiency. Figure 8 shows the average time cost of queryagainst the size of Sw. With result ranking, top-k searchcould return the most satisfied files more efficiently. Inaddition, as the evaluation of overall search performance,

1 2 3 4 5 60

10

20

30

Size of metadata set(×103)

Tim

e of

b

Figure 7 The time cost for building SRL.

1 2 3 4 5 6 72

2.5

3

3.5

4

4.5

5

5.5

6

Size of Sw

Tim

e of

que

ry (

×10

ms)

return all results

return top-100 results

Figure 8 Time cost of query. For query keyword with different sizeof semantically expanded keywords set, n=1000.


Figure 9 shows the average time cost of query against thenumber of files. Besides, the index and SRL could be storedwith a tree based data structure, so that the server does notneed to traverse all the keywords entries.

Recall factor of the searchBy analyzing the search result, the overall recall rate isimproved, and the query results are more in line withthe user’s actual intentions. E.g. a user inputs a keyword‘protocol’, the files which contain related words like‘internet’, ‘network’, ‘authentication’ will also be returned,in addition, the files which include most of the words willalso be ranked forward.

ConclusionIn this paper, as an initial attempt, we propose a securesemantic expansion based similar search scheme over

1 2 3 4 5 62

4

6

8

10

12

14

16

18

Tim

e of

que

ry (

×10

ms)

Number of files(×10 3)

Figure 9 The overall query performance.

encrypted cloud data. The proposed scheme could returnnot only the exactly matched files, but also the files includ-ing the terms semantically related to the query keyword.The encrypted files and metadata set are outsourced to theserver by the owner. With the file metadata, the cloudbuilds the inverted index and constructs semantic relation-ship library (SRL) for the keywords. The co-occurrence ofterms is used to capture the semantic relationship of key-words in the dictionary, which offers appropriate semanticdistance between terms to accomplish the query keywordextension. Then we derive a one-to-many OPE scheme toprotect the term frequency, while ensure the computing oftotal relevance score. Experimental evaluation demonstratesthe efficiency and effectives of the scheme.As our future work, the most practical one is to

further improve the security of our solution. Thus newcrypto techniques still need to be designed to protect thesemantic information while keep the ability to calculatethe relevance score. In addition, we intend to research onmulti-keyword semantic search scheme which furtherintroduces the semantic relationship between terms, e.g.the position of terms.

AbbreviationsSRL: Semantic relationship library; SE: Searchable encryption; SSE: Searchablesymmetric encryption; OPE: Order-preserving encryption; OM-OPE: One-to-manyorder-preserving encryption.

Competing interestsThe authors declare that they have no competing interests.

Authors’ contributionsZX, YZ and XS proposed the solution of secure ranked semantic key wordsearch over encrypted cloud data and conducted experiments. ZX and YZdesigned the secure semantic keyword search model using semanticextension. XS and YZ developed the encryption scheme for security. YZcompleted the main programming work, and analyzed the experiment resultand security with ZX and LC. ZX and LC also helped revise the manuscript.The submission was approved by all co-author. All authors read andapproved the final manuscript.

Authors’ informationZhihua Xia received his PhD in computer science and technology fromHunan University, China, in 2011. He works as a lecture in School ofComputer& Software, Nanjing University of Information Science &Technology. His research interests include Steganography and Steganalysis,digital forensic, image processing, pattern recognition, and cloud security.Yanling Zhu is currently studying for her master’s degree in School ofComputer & Software, Nanjing University of Information Science &Technology, China. Her research interests include information security andcloud security.Xingming Sun received his BS in mathematics from Hunan NormalUniversity, China, in 1984, MS in computing science from Dalian University ofScience and Technology, China, in 1988, and PhD in computing sciencefrom Fudan University, China, in 2001. He is currently a professor in School ofComputer & Software, Nanjing University of Information Science &Technology, China. His research interesting include network and informationsecurity, digital watermarking, digital forensic, database security, naturallanguage processing, and cloud security.Lihong Chen is currently studying for the master’s degree in School ofComputer & Software, Nanjing University of Information Science &Technology, China. Her research interests include information security andcloud security.


AcknowledgementsThis work is supported by the NSFC (61232016, 61173141, 61173142,61173136, 61103215, 61373132, 61373133), GYHY201206033, 201301030,2013DFG12860, BC2013012, Open fund of Jiangsu Engineering Center ofNetwork Monitoring (KJR1308) and PAPD fund.

Received: 30 August 2013 Accepted: 19 May 2014

References1. Ren K, Wang C, Wang Q (2012) Security challenges for the public cloud.

IEEE Internet Comput 16(1):69–732. Kamara S, Lauter K (2010) Cryptographic cloud storage. In: Financial

Cryptography and Data Security. Springer, Berlin/Heidelberg, pp 136–1493. Song DX, Wagner D, Perrig A (2000) Practical techniques for searches on

encrypted data. In: Proceedings of IEEE Symposium on Security and Privacy.IEEE, Berkeley, California, pp 44–55

4. Goh E-J (2003) Secure indexes. Cryptology ePrint Archive, Report 2003/2165. Boneh D, Di Crescenzo G, Ostrovsky R, Persiano G (2004) Public key

encryption with keyword search. In: Advances in Cryptology-Eurocrypt 2004.Springer, Berlin/Heidelberg, pp 506–522

6. Chang Y-C, Mitzenmacher M (2005) Privacy preserving keyword searches onremote encrypted data. In: Applied Cryptography and Network Security.Springer, Berlin/Heidelberg, pp 442–455

7. Curtmola R, Garay J, Kamara S, Ostrovsky R (2006) Searchable symmetricencryption: improved definitions and efficient constructions. In: Proceedingsof the 13th ACM conference on Computer and communications security.ACM, Alexandria, VA, USA, pp 79–88

8. Bellare M, Boldyreva A, O’Neill A (2007) Deterministic and efficientlysearchable encryption. In: Advances in Cryptology-CRYPTO 2007. Springer,Berlin/Heidelberg, pp 535–552

9. Wang C, Cao N, Li J, Ren K, Lou W (2010) Secure ranked keyword searchover encrypted cloud data. In: 30th IEEE International Conference onDistributed Computing Systems (ICDCS). IEEE, Genoa, Italy, pp 253–262

10. Wang C, Cao N, Ren K, Lou W (2012) Enabling secure and efficient rankedkeyword search over outsourced cloud data. IEEE Trans Parallel Distrib Syst23(8):1467–1479

11. Cao N, Wang C, Li M, Ren K, Lou W (2011) Privacy-preserving multi-keywordranked search over encrypted cloud data. In: Proceedings of IEEE INFOCOM.IEEE, Shanghai, China, pp 829–837

12. Yang C, Zhang W, Xu J, Xu J, Yu N (2012) A Fast Privacy-PreservingMulti-keyword Search Scheme on Cloud Data. In: International Conferenceon Cloud and Service Computing (CSC). IEEE, Shanghai, China, pp 104–110

13. Stefanov E, Papamanthou C, Shi E (2014) Practical Dynamic SearchableEncryption with Small Leakage. NDSS ’14, San Diego, CA, USA

14. Wang C, Ren K, Yu S (2012) Urs KMR Achieving usable and privacy-assuredsimilarity search over outsourced cloud data. In: Proceedings of IEEEINFOCOM. IEEE, Orlando, Florida, USA, pp 451–459

15. Li J, Wang Q, Wang C, Cao N, Ren K, Lou W (2010) Fuzzy keyword searchover encrypted data in cloud computing. In: Proceedings of IEEE INFOCOM.IEEE, San Diego, CA, USA, pp 1–5

16. Chuah M, Hu W (2011) Privacy-aware bedtree based solution for fuzzymulti-keyword search over encrypted data. In: 31st International Conferenceon Distributed Computing Systems Workshops (ICDCSW). IEEE, Minneapolis,Minnesota, USA, pp 273–281

17. Liu C, Zhu L, Li L, Tan Y (2011) Fuzzy keyword search on encrypted cloudstorage data with small index. In: IEEE International Conference on CloudComputing and Intelligence Systems (CCIS). IEEE, Beijing, China, pp 269–273

18. Ibrahim A, Jin H, Yassin AA, Zou D (2012) Approximate Keyword-basedSearch over Encrypted Cloud Data. In: IEEE Ninth International Conferenceon e-Business Engineering (ICEBE). IEEE, Hangzhou, China, pp 238–245

19. Bringer J, Chabanne H (2012) Embedding edit distance to enable privatekeyword search. Human-centric Comput Inf Sci 2(1):1–12

20. Xu J, Croft WB (1996) Query expansion using local and global documentanalysis. In: Proceedings of the 19th Annual International ACM SIGIRConference on Research and Development in Information Retrieval. ACM,New York, USA, pp 4–11

21. Fu G, Jones CB, Abdelmoty AI (2005) Ontology-based spatial queryexpansion in information retrieval. In: On the move to meaningful internetsystems 2005: CoopIS, DOA, and ODBASE. Springer, Berlin/Heidelberg,pp 1466–1482

22. Lesk ME (1969) Word-word associations in document retrieval systems.Am Doc 20(1):27–38

23. Minker J, Wilson GA, Zimmerman BH (1972) An evaluation of queryexpansion by the addition of clustered terms for a document retrievalsystem. Information Storage and Retrieval 8(6):329–348

24. Wei J, Bressan S, Ooi BC (2000) Mining term association rules for automaticglobal query expansion: methodology and preliminary results. In:Proceedings of the First International Conference on Web InformationSystems Engineering. IEEE, Hong Kong, China, pp 366–373

25. Pal D, Mitra M, Datta K (2013) Query expansion using term distribution andterm association., arXiv preprint arXiv:13030667

26. Lai L-F, Wu C-C, Lin P-Y, Huang L-T (2011) Developing a fuzzy search enginebased on fuzzy ontology and semantic search. In: IEEE InternationalConference on Fuzzy Systems (FUZZ). IEEE, Taipei, Taiwan, pp 2684–2689

27. Fonseca BM, Golgher PB, De Moura ES, Pôssas B, Ziviani N (2003)Discovering search engine related queries using association rules. J WebEng 2(4):215–227

28. Song M, Song I-Y, Hu X, Allen R (2005) Semantic query expansioncombining association rules with ontologies and information retrievaltechniques. In: Data Warehousing and Knowledge Discovery. Springer,Berlin/Heidelberg, pp 326–335

29. Song M, Song I-Y, Hu X, Allen RB (2007) Integration of association rules andontologies for semantic query expansion. Data Knowl Eng 63(1):63–75

30. Singhal A (2001) Modern information retrieval: a brief overview. IEEE DataEng Bull 24(4):35–43

31. Agrawal R, Kiernan J, Srikant R, Xu Y (2004) Order preserving encryption fornumeric data. In: Proceedings of the 2004 ACM SIGMOD InternationalConference on Management of data. ACM, Paris, France, pp 563–574

32. Boldyreva A, Chenette N, Lee Y, O’neill A (2009) Order-preserving symmetricencryption. In: Advances in Cryptology-EUROCRYPT 2009. Springer, Berlin/Heidelberg, pp 224–241

33. MOFFAT AA, Bell TC (1999) Managing gigabytes: compressing and indexingdocuments and images. Morgan Kaufmann, San Francisco, California, USA

34. Request For Comments Database. http://www.ietf.org/rfc.html. Accessed 27Oct 2013

doi:10.1186/s13677-014-0008-2Cite this article as: Xia et al.: Secure semantic expansion based searchover encrypted cloud data supporting similarity ranking. Journal of CloudComputing: Advances, Systems and Applications 2014 3:8.

Submit your manuscript to a journal and benefi t from:

7 Convenient online submission

7 Rigorous peer review

7 Immediate publication on acceptance

7 Open access: articles freely available online

7 High visibility within the fi eld

7 Retaining the copyright to your article

Submit your next manuscript at 7 springeropen.com

http://www.ietf.org/rfc.html

RESEARCH Open Access Secure semantic expansion based ... · RESEARCH Open Access Secure semantic expansion based search over encrypted cloud data supporting similarity ranking Zhihua

Documents