INFORMATION RETRIEVAL AND SEMANTIC INFERENCE FROM NATURAL LANGUAGE PRIVACY POLICIES by MITRA BOKAEI HOSSEINI, M.Sc. DISSERTATION Presented to the Graduate Faculty of The University of Texas at San Antonio In Partial Fulfillment Of the Requirements For the Degree of DOCTOR OF PHILOSOPHY IN COMPUTER SCIENCE COMMITTEE MEMBERS: Jianwei Niu, Ph.D., Chair Travis Breaux, Ph.D. Xiaoyin Wang, Ph.D. Ravi Sandhu, Ph.D. John Quarles, Ph.D. Jeff Prevost, Ph.D. THE UNIVERSITY OF TEXAS AT SAN ANTONIO College of Sciences Department of Computer Science May 2019
110
Embed
INFORMATION RETRIEVAL AND SEMANTIC INFERENCE FROM NATURAL … · INFORMATION RETRIEVAL AND SEMANTIC INFERENCE FROM NATURAL LANGUAGE PRIVACY POLICIES by MITRA BOKAEI HOSSEINI, M.Sc.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
INFORMATION RETRIEVAL AND SEMANTIC INFERENCE FROM NATURAL
LANGUAGE PRIVACY POLICIES
by
MITRA BOKAEI HOSSEINI, M.Sc.
DISSERTATIONPresented to the Graduate Faculty of
The University of Texas at San AntonioIn Partial FulfillmentOf the Requirements
THE UNIVERSITY OF TEXAS AT SAN ANTONIOCollege of Sciences
Department of Computer ScienceMay 2019
ProQuest Number:
All rights reserved
INFORMATION TO ALL USERSThe quality of this reproduction is dependent upon the quality of the copy submitted.
In the unlikely event that the author did not send a complete manuscriptand there are missing pages, these will be noted. Also, if material had to be removed,
a note will indicate the deletion.
ProQuest
Published by ProQuest LLC ( ). Copyright of the Dissertation is held by the Author.
All rights reserved.This work is protected against unauthorized copying under Title 17, United States Code
permissions and is described to users when installing an app as well as on the app’s download page
on the Google Play store. Thus, there is a direct relationship between the permissions granted to
an application by a user at installation time and eligible API method calls in the application source
code.
Morphological Variant: information type phrases are frequently variants of a common lexeme,
e.g., “device” is a morphological variant of “mobile device.”
In the definitions above, we assume that noun phrases expressed in text have a corresponding
concept and that the text describes one name for the concept. This relationship between the phrase
and concept is also arbitrary, as noted by Saussure in his theory of the signifier, which is the symbol
that represents a meaning, and the signified, which is the concept or meaning denoted by the sym-
bol [22]. Peirce defines a similar relationship between sign-vehicles and objects, respectively [36].
2.4 Context Free Grammar
A context-free grammar G is a quadruple G =< N, T,R, S >, where N is a final set of non-
terminal symbols; T is a finite set of terminal symbols; R is a finite set of productions; and S ∈ N
is the the designated start symbol of G. The productions in R are pairs of the form α → β, where
α ∈ N and β ∈ (N ∪ T ). An empty right-hand side in a production is represented with symbol ε.
2.5 Semantic Attachments
In rule-to-rule approach [41], the production rules r from CFG G are extended with semantic
attachments. To construct a semantic attachment, each production r ∈ R, r : α → β1...βn is
associated with a semantic rule α.sem: {f(β1.sem, ..., βn.sem)} to infer semantic ontological
relationships. The semantic attachment α.sem states that the semantic representation assigned to
production r contains a semantic function f that maps semantic attachments βi.sem to α.sem,
where each βi, 1 6 i 6 n is a constituent (terminal or non-terminal symbol) in production r.
The semantic attachments for each production rule is shown in brackets {. . . } to the right of the
production’s syntactic constituents.
12
2.6 λ Calculus
λ calculus is a formal system in mathematical logic for expressing computation based on function
abstraction and application using variable binding and substitution. λ calculus consists of con-
structing lambda terms and performing reduction operations on them. The following three rules
give an inductive definition that can be applied to build all syntactically valid lambda terms:
• a variable, x, is itself a valid lambda term.
• if t is a lambda term, and x is a variable, then (λx.t) is a lambda term called lambda abstrac-
tion.
• if t and s are lambda terms, then (ts) is a lambda term called an application.
The beta reduction (β reduction rule states that an application of the form (λx.t)s reduces to
the term t[x = s].
2.6.1 Grounded Theory
A qualitative inquiry approach which involves applying specific types of codes to data through
a series of coding cycles leading to development of a theory grounded in the data [59]. We use
three main strategies [21] of this method throughout this work: (1) coding qualitative data; (2)
memo-writing; and (3) theoretical sampling.
2.6.2 Word Embedding
Word embeddings are distributed representations of words as a vector in some m-dimensional
space that helps learning algorithms achieve better performance by grouping similar words to-
gether [46], [5]. Each vector dimension represents some feature of the words’ semantics in a
corpus.
Currently, the two most popular word embedding models are Global Vectors (GloVe) [52]
and Skip-gram [45]. GloVe trains word embedding vectors by constructing a word-to-word co-
13
occurrence matrix. After filling the matrix with how frequently words co-occur, matrix factoriza-
tion is used to determine the word embedding vector values for each word. For the Skip-gram
model, a window size is defined before training begins. For each word in the corpus, the sur-
rounding words (identified by the size of the window) are used as context for that word. This
context is then used as input to a neural net-work that will modify the word’s vector values. After
visiting each word in the corpus, words should be grouped together in the vector space of the vo-
cabulary based on their context. The closer together two words are in the vector space, the more
semantically similar they are assumed to be. In this work, we adopt Word2Vec 3 , an implementa-
tion of the Skip-gram model to construct domain-specific word embeddings, which is discussed in
Section 8.1.1.
2.6.3 Convolutional Neural Network
In general, convolutional neural networks (CNNs) are a kind of feed-forward network, specialized
in processing data with a grid-like topology [42]. For CNNs, there are usually three major steps in
a convolutional layer. The first step involves applying several convolutions to the input matrix to
produce a set of linear activations [29]. This is done using a sliding window, called a kernel or filter
that slides over the entire input matrix, thereby performing convolution for every set of elements.
The second step applies a non-linear function to each linear activation produced by the previous
step (e.g., tanh, relu, etc.). In the third step, different types of pooling functions (e.g., max pooling,
average pooling, etc.) are applied to sets of areas, called neighborhoods, which cover the entire
transformed input. Pooling is done to make the transformed input approximately invariant, which
emphasizes the importance of the existence of a feature in the input over the specific location of
that feature in the input [29]. The matrix result after these three steps is a representation of the
main features of the input.
3https://code.google.com/archive/p/word2vec/
14
Chapter 3: RELATED WORK
3.1 Lexicons, Ontology, and Requirements Analysis
In requirements engineering, lexicons play an important role in reducing ambiguity and improv-
ing the quality of specifications [28]. Boyd et al. examined the design of constrained natural
languages and their reliance on limited vocabulary to reduce ambiguity [9]. They proposed an
automated technique to optimally constrain lexicons by introducing the concept of term replace-
ability. The value of their approach is that the lexicon becomes more easily evolvable over time.
While a lexicon often consists of terminology and definitions, an ontology represents the semantic
relationships between terms, including whether they are hypernyms or synonyms. Breitman and
do Prado Leite describe how ontology can be used to analyze web application requirements [16].
Breaux et al. [12] utilize ontology consisting of actors and information types to infer data flow
traces across privacy requirements of different vendors. In this work, ontology was needed to align
terminology across different domains and vendor applications.
3.2 Ontology in Requirements Traceability to Code
Zimmeck et al. proposed an approach to identify the misalignments between data practices ex-
pressed in privacy policies and mobile app code without considering abstraction in policy state-
ments [73]. Privacy policy annotations in this work yielded three bag-of-words for information
types labeled as device ID, location, and contact information. For example, “IP address” is con-
tained in the bag-of-words with label device ID. Therefore, the policy statement classifier in their
approach labels policy statements using one of the three categories: device ID, location, and con-
tact information. However, ignoring the actual information type in the statement can produce false
negatives in misalignment detection tools. As an example, consider an application that mentions
collecting “IP address” in the policy and calls an Android API method to retrieve “Android ID”
in the app code. Since the Zimmeck et al.’s approach classifies “IP address” and “Android ID” as
15
device IDs, this would be interpreted as an expected alignment, which in fact is a misalignment
and the policy and code are inconsistent.
Salvin et al. [60] identify inconsistent app code with privacy policies by utilizing a manually
constructed information type ontology. The ontology contains 365 unique information types from
an analysis of 50 privacy policies. However, the cost of setting up the ontology is non-trivial as it
requires manual assignment of semantic relations between information types and handcrafting the
ontology. Identifying misalignments between natural language data practice descriptions and app
code cannot be achieved without addressing ambiguity and abstraction of data type terminology.
To address this challenge, we propose a method to construct a formal ontology that captures the
semantic relationships between information types mentioned in these descriptions.
3.3 Lexical Ontologies
WordNet is a lexical database that contains English words grouped into nouns, verbs, adjectives,
adverbs, and function words [24, 47]. Within each category, the words are organized by their
semantic relations, including hypernymy, meronymy, and synonymy [24]. However, our analysis
shows that only 14% of information types from a privacy policy ontology are found in WordNet,
mainly because the lexicon is populated with multi-word, domain-specific phrases. Therefore,
finding a category of information type along with its subordinate terms can be a challenging task for
a requirement analyst. Our work aims to identify relationships between categories of information
types with respect to hypernymy, meronymy, and synonymy in the privacy domain which can be
reused in requirement analysis tasks.
3.4 Relationship Extraction and Classification Methods
Snow et al. [62] presented a machine learning approach using hypernym-hyponym pairs in Word-
Net to identify additional pairs in the parsed sentences of the Newswire corpus. This approach
relies on the explicit expression of hypernymy pairs in text. Marti Hearst proposed six lexico-
syntactic patterns to automatically identify hypernymy in text using noun phrases and regular ex-
16
pressions [34]. Evans et al. [23] applied an extended set of 72 Hearst patterns to privacy policies
to extract hypernymy pairs. Pattern sets are limited because they must be manually extended to
address new policies. Chen et al. [19] presented an approach for gathering software terms and
their morphological forms. Terms are limited to those in the software development domain. Roy
et al. [58] presented an approach for inferring relationships between terms in math word problems.
In supervised paradigms, researchers have tried to classify the relationship between a pair of
nominals in sentences by extracting features such as part-of-speech tags, shortest dependency path,
and named entities [17, 31, 48, 70]. Performance among these methods depend on the quality of
designed features [71]. To address this problem, deep learning models have been introduced that
leverage a distributed representation of the words to reduce the number of handcrafted features.
For example, Zeng et al. [71] proposed a deep learning model that captures the semantics of a
sentence containing a pair of nominals by combining word and distance features using a convolu-
tion module. The features were used to classify the relationship between nominal pairs into four
categories. This model extracts features using convolutional neural networks and outperforms the
supervised models that use part-of-speech, stemming, and other lexical features with classifiers,
such as SVM and MaxEnt. Zhou et al. [72] proposed a model that utilizes word embeddings, Bidi-
rectional Long Short Term Memory (BLSTM), and attention-based neural networks for relation
classification. Attention-based neural networks use the results from BLSTM to generate a single
vector representing a sentence semantics. The softmax classifier is used to classify the relationships
using the sentence vectors, which outperforms the model presented by [71].
The feature-based and neural network models mentioned above are used to extract the rela-
tionships between the annotated nominals in a given sentence. These approaches are all sentence
dependent and fail to consider the semantic relations between phrases that are not in the same
sentence. Therefore, our proposed work aims to model the semantics of two information types
extracted from a pool of privacy policies and identifies the semantic relations between the them.
17
Chapter 4: ACQUIRING PRIVACY POLICY LEXICON
There is no standard method to build an ontology [67], yet, a general approach includes identi-
fying the ontology purpose and scope; identifying key concepts leading to a lexicon; identifying
relations between lexicon concepts; and formalizing those relations. A lexicon consists of termi-
nology in a domain, whereas ontologies organize terminology by semantic relations [39]. Lexicons
can be constructed using content analysis of source text, which yields an annotated corpus. This
chapter describes our approach to build privacy policy lexicon. We use this approach to construct
different lexicons including platform information lexicon and user-provided information lexicon.
Section 4.1 describes the general approach to build a lexicon. Platform information lexicon and
user-provided information lexicon are represented in Sections 4.2 and 4.3 respectively.
4.1 Lexicon Construction Approach
The mobile privacy policy lexicon (artifact A in Figure 4.1) was constructed using a combination of
crowdsourcing, content analysis, and natural language processing (NLP). The lexicon construction
method (see Figure 4.1) consists of 4 steps: (1) collecting privacy policies; (2) itemizing paragraphs
in the collected privacy policies; (3) annotating the itemized paragraphs by crowd workers based on
a specific coding frame; (4) employing an entity extractor developed by Bhatia and Breaux [6] to
analyze the annotations and extract information types which results in an information type lexicon
(artifact A in Figure 4.1). Steps 1-3 are part of a crowdsourced content analysis task based on
Breaux and Schaub [13].
We use this approach to construct different lexicons using various privacy policies for apps in
different domains. We use two different coding frames for annotating the privacy policies that cap-
tures platform information and user-provided information. More information about these lexicons
is provided in the following sections.
18
Figure 4.1: Overview of Lexicon Construction Method
4.2 Platform Information Lexicon
Using the approach discussed in Section 4.1, we constructed the platform information lexicon that
is used by Slavin et al. [60] to identify the inconsistencies between privacy policies and app code
based on the API method calls in the code.
In step 1 (see Figure 4.1), we selected the top 20 mobile apps across each of 69 sub-categories
in Google Play. From this set, we selected apps with privacy policies, removing duplicate poli-
cies when different apps shared the same policy. Next, we selected only policies that match the
following criteria: format (plain text), language (English), and explicit statements for privacy pol-
icy; yielding 501 policies, from which we randomly selected 50 policies. In step 2, the 50 policies
were segmented into 120 word paragraphs using the method described by Breaux and Schaub [13];
yielding 5,932 crowd worker annotator tasks with an average 98 words per task for input to step 3.
In step 3, the annotators select phrases corresponding to one of two category codes in a segmented
paragraph as described below for each annotator task, called a Human Intelligence Task (HIT). An
example HIT is shown in Figure 4.2.
19
Figure 4.2: Example of Crowd Sourced Policy Annotation Task for Platform Information Types
• Platform Information: any information that the app or another party accesses through the
mobile platform which is not unique to the app.
• Other Information: any other information the app or another party collects, uses, shares or
retains.
These two category codes were chosen, because our initial focus is on information types that
are automatically collected by mobile apps and mobile platforms, such as ”IP address," and ”lo-
cation information." The other information code is used to ensure that annotators remain vigilant.
In step 4, we selected only platform information types when two or more annotators agreed on
the annotation to construct the lexicon. This number follows the empirical analysis of Breaux
and Schaub [13], which shows high precision and recall for two or more annotators on the same
HIT. Next, we applied an entity extractor [6] to the selected annotations to itemize the platform
information types into unique entities included in the privacy policy lexicon.
Six privacy experts, performed the annotations. The cumulative time to annotate all HITs
was 59.8 hours across all six annotators, yielding a total 720 annotations in which two or more
annotators agreed on the annotation. The entity extractor reduced these annotations down to 356
unique information type names, which comprise the initial lexicon.
20
In chapter 5, we discuss how we utilized this lexicon to construct platform information ontol-
ogy.
4.3 User-provided Information Lexicon
Using the approach discussed in Section 4.1, we constructed three different user-provided informa-
tion lexicon for finance, health, and dating app domains that is used to identify the inconsistencies
between privacy policies and user-provided information types collected through user interface in-
put fields in apps.
To construct the lexicons, we first select five top apps in each of six sub-categories (personal
budget, banks, personal health, insurance-pharmacy, casual and serious dating) of finance, health,
and dating in Google Play, to yield 30 total apps for all three main categories. Next, we segment the
privacy policies into 120 word paragraphs using the method described by [14], which yields an-
notation tasks from each policy. Figure 4.3 shows an example annotation task, wherein annotators
are asked to annotate phrases based on the following coding frame: User-provided Information;
Automatically Collected Information; and Uncertain or Unclear.
The user-provided information annotations describe types that are explicitly stated in the poli-
cies. However, policies do not always mention how or from whom they collect the information.
For example, in Figure 4.3, it is unclear how “information” is collected. To build the privacy
policy lexicon, we consider both annotations coded as user-provided information, and uncertain
or unclear, in case the policy author described the user-provided collection in an unclear man-
ner. We included the code for automatically collected information to ensure that annotators pay
close attention about how information collection is described in the policy, since it is disjoint from
user-provided information.
We collect annotations by recruiting five crowd workers from Amazon Mechanical Turk (AMT)
to annotate each 120-word paragraph of the combined 30 privacy policies. Because this annotation
task differed from [14], we also collected annotations for the same tasks from six privacy experts
to evaluate the crowd worker lexicon. Among all annotations collected, we only add annotations
21
Figure 4.3: Example of Crowd Sourced Policy Annotation Task for User-provided InformationTypes
to the lexicon where two or more annotators agreed on the annotation. This decision follows the
study which shows high precision and recall for two or more annotators [14]. In the next step,
we applied an entity extractor [6] to the selected annotations to itemize the information types into
unique entities. Finally, the unique information types are added to the finance, health, or dating
lexicon depending on which sub-category they belong to.
Table 4.1 shows the total HITs to collect information type annotations, average word count per
HIT, total annotations collected from crowd workers and privacy experts, total unique information
types extracted, and combined annotation time for crowd workers and privacy experts.
Overall, the average time to extract an information type from a privacy policy in health, finance,
and dating is 10.6 minutes, 7.0 minutes, and 8.4 minutes, respectively. This time includes the
additional time from privacy expert annotations needed to evaluate the method.
The lexicon quality is measured by the consensus between privacy expert and crowd worker
annotations as measured by extracted, unique information types. In health, the privacy experts
and crowd workers agreed on 105/198 unique information types. In addition, the privacy experts
22
Table 4.1: User-provided Information Lexicon analysisHealth Finance Dating Overall
Total HITs 141 52 141 334Average Words per HIT 105 102 116 108Total Annotations - Crowd Workers 739 309 868 1,916Total Annotations - Privacy Experts 456 198 508 1,162Total Unique Information Types 197 112 262 490Annotation Time 34.7 13.1 36 84
missed 55 information types that the crowd workers annotated, and the crowd workers missed 34
information types that the privacy experts annotated. In finance, all annotators agreed on 69/112
information types, crowd workers annotated an additional 20 types, whereas privacy experts an-
notated an additional 23 types. In dating domain, all annotators agreed on 135/262 information
types, crowd workers annotated additional 76 information types and privacy experts annotated 51
additional unique information types. Overall, the crowd workers generally identified 18-29% more
information types, and privacy experts generally identified 17-20% more types. The consensus was
52-62% of types extracted.
In addition to comparing annotator performance, we compared the lexicon coverage across
each domain. The health and finance lexicons share 32/278 phrases, health and dating share 45/415
phrases, and finance and dating share 27/347 phrases. This is an overlap of only 8-12% across three
domains, which is due to the differences in policy focus and application features.
Section 5.3 describes our effort to construct three different ontologies using finance, health,
and dating user-provided information lexicons.
23
Chapter 5: MANUAL ONTOLOGY CONSTRUCTION
We now describe our bootstrap method for constructing a formal ontology from an information
type lexicon. This includes our choice of formalism, the tools used to express the ontology, and
the construction method.
5.1 Manual Ontology Construction Methodology
The bootstrap begins with an initial ontology, wherein each lexicon phrase is subsumed by the >
concept and no other relationship exists between phrases from a given lexicon. Next, each analyst
follows these four steps: (1) they create two copies of the initial ontology KB1 and KB2, one
for each analyst; (2) each analyst define subsumption and equivalence axioms for concept pairs
using an ontology editor by making paired comparisons among the concepts in the ontology based
on the heuristics defined bellow; (3) the two analysts compare their axioms in KB1 and KB2 to
identify missing axioms and to compute the degree of agreement. Agreement is measured using
the chance-corrected inter-rater reliability statistic Fleiss’s Kappa; and (4) finally, two analysts
meet to investigate the disagreements and reconcile the axioms in KB1 and KB2. The analysts
re-calculate agreement after each reconciliation to measure the improvement due to reconciliation.
Identifying semantic relationships is a heuristic-based procedure, wherein each analyst develops
their own heuristics or rules for identifying relationships. The reconciliation step requires analysts
to explicate and justify their choices, as well as to learn to accept or reject heuristics proposed by
the other analyst. This method is subject to cognitive bias, including the proximity of concepts to
each other in the alphabetical list, and to the recency with which the analysts encountered concepts
for comparison [54].
The bootstrap method was piloted on five privacy policies and resulted in a set of seven heuris-
tics that form a grounded theory. The heuristics explain why two concepts share an axiom in the
ontology. For a pair of concepts C1, C2, the analysts assign an axiom with respect to a TBox T and
one heuristic as follows.
24
• Hypernym (H): C1 v C2, when concept C2 is a general category of C1, e.g., “device” is
subsumed by “technology.”
• Meronym (M): C1 v C2, when concept C2 is a part of C1, e.g., “Internet protocol address”
is subsumed by “Internet protocol.”
• Modifier (D): C1_C2 v C2 and C1_C2 v C1_information, when C1 is modifying C2, e.g.,
“unique device identifier” is subsumed by “unique information” and “device identifier.”
• Plural (P): C1 ≡ C2, when C1 is a plural form of C2, e.g., “MAC addresses” is the plural
form of “MAC address.”
• Synonym (S): C1 ≡ C2, when C1 is a synonym of C2, e.g., “geo-location” is equivalent to
“geographic location.”
• Technology (T): C1 ≡ C1_information, when C1 is a technology, e.g., “device” is equiva-
lent to “device information.”
• Event (E):C1 ≡ C1_information, whenC1 is an event, e.g., “usage” is equivalent to “usage
information.”
The bootstrap method and the seven heuristics are used to construct platform information
ontology and user-provided information ontologies from platform information lexicon and user-
provided information lexicons.
5.2 Platform Information Ontology
To construct the platform information ontology, two analysts conducted a 4-step heuristic evalua-
tion of the seven heuristics using the lexicon produced by the 50 mobile app privacy policies (plat-
form information lexicon in Section 4.2)as follows: (1) two analysts separately apply the bootstrap
method on a copy of the lexicon; (2) for each ontology, an algorithm extracts each expressed and
inferred axiom between two concepts using the HermiT1 reasoner; (3) each relationship assigned to1http://www.hermit-reasoner.com/
25
Table 5.1: Example Table Comparing Concept Relationships in Platform Information OntologyLHS Concept RHS Concept Heuristics Analyst1 Analyst2
web pages web sites S/M Equiv Subads clicked usage information H Sub Subcomputer platform H Super Super
log information system activity M None Superdevice type mobile device type D Super None
tablet tablet information T None Equiv
a concept pair appears in one column per analyst (in Table 5.1, see Analyst1, where ’Super’ means
the LHS (left-hand side)concept is a superclass of the RHS (right-hand side) concept, ’Sub’ means
subclass of, ’Equiv’ means equivalence, and ’None’ means no relationship); and (4) each analyst
then separately reviews their assigned axiom type, choose the heuristic to match the assignment,
and decides whether to retain or change their axiom type.
In Table 5.1, the left-hand side (LHS) concept is compared to the right-hand side (RHS) concept
by Analyst1 and Analyst2, whose axiom types appear in their respective column, e.g., Analyst1
assigned “Equiv” to “web pages” and “web sites” and the heuristic “S” to indicate these two con-
cepts are synonyms, whereas Analyst2 assigned “Sub” and heuristic “M” to indicate “web pages”
is a part of “web sites.”
Before and after step 3, we compute the Fleiss’ Kappa statistic, which is a chance-corrected,
inter-rater reliability statistic [25]. Increases in this statistic indicate improvement in agreement
above chance. The result of evaluation of this ontology is presented in Section 5.4.
5.3 User-provided Information Ontology
Following the manual ontology approach, two analysts conducted a 4-step heuristic evaluation
of the seven heuristics on each of the user-provided information lexicons in finance, health, and
dating domain (see Section 4.3)as follows: (1) two analysts separately apply the bootstrap method
on a copy of each lexicon; (2) for each ontology, an algorithm extracts each expressed and in-
ferred axiom between two concepts using the HermiT2 reasoner; (3) each relationship assigned to
2http://www.hermit-reasoner.com/
26
Table 5.2: Example Table Comparing Concept Relationships in Finance DomainLHS Concept RHS Concept Heuristics Analyst1 Analyst2
id identification information T/H Equiv Subaccount payment account D Super Superaccount financial information H Sub Subaccount account balance M Super Super
a concept pair appears in one column per analyst (see Table 5.2, where ’Super’ means the LHS
(left-hand side) concept is a superclass of the RHS (right-hand side) concept, ’Sub’ means sub-
class of, ’Equiv’ means equivalence, and ’None’ means no relationship); and (4) each analyst then
separately reviews their assigned axiom type, choose the heuristic to match the assignment, and
decides whether to retain or change their axiom type.
Table 5.2 shows results of step 3 during constructing an ontology for finance domain. In
this table, the left-hand side (LHS) concept is compared to the right-hand side (RHS) concept
by Analyst1 and Analyst2, whose axiom types appear in their respective column, e.g., Analyst1
assigned “Equiv” to “id” and “identification information” and the heuristic “T” to indicate these
two concepts are equivalent, whereas Analyst2 assigned “Sub” and heuristic “H” to indicate “web
pages” is a part of “web sites.”
Before and after step 3, we compute the Fleiss’ Kappa statistic, which is a chance-corrected,
inter-rater reliability statistic [25]. Increases in this statistic indicate improvement in agreement
above chance. Section 5.5 provides evaluation results for the three user-provided information
ontologies.
5.4 Platform Information Ontology Evaluation and Results
The ontology was constructed using the bootstrap method and evaluated in two iterations (see
Table 5.3): Round 1 covered 25/50 policies to yield 235 concept names and 573 axioms from
the 4-step heuristic evaluation, and Round 2 began with the result of Round 1 and added the
concepts from the remaining 25 policies to yield a total 368 concept names and 849 axioms. The
resulting ontology produced 13 new concepts that were not found in the lexicon, because the
27
analysts added tacit concepts to fit lexicon phrases into existing subsumption hierarchies. Table 5.3
presents the results of the number of “Super,” “Sub,” and “Equiv” axioms and “None” identified
after the bootstrap method.
Table 5.3: Number of Ontological Relations Identified by Each Analyst during Each RoundIteration Analyst Super Sub Equiv None
Round 11 151 203 77 1422 157 172 78 166
Round 21 304 343 142 602 313 352 151 33
Table 5.4 presents agreements, disagreements, the consensus (ratio of agreements over total
axioms compared), and Kappa for the bootstrap method without reconciliation, called Initial, and
after reconciliation, called Reconciled. The round 1 ontology began with 235 concepts and 573
relations from both analysts. The round 2 ontology extended the reconciled round 1 ontology with
132 new concepts.
Table 5.4: Number of Agreements, Disagreements, and Kappa for Concepts and Axioms perRound
Step 1: Each information type pair is mapped to their respective tag sequence pair, e.g., pair
(mobile device, device name) is mapped to (mt, tp), yielding 974 unique tag sequence pairs, which
we call the strata.
Step 2: Proportional stratified sampling is used to draw at least 2,000 samples from all strata with
layer sizes ranging from 1 to 490. The wide range in layer sizes implies unbalanced strata; e.g.,
strata that contain 1-3 pairs when divided by the total number of information type pairs yields zero.
Therefore, we guarantee that all information type pairs in a strata of size one are selected to ensure
each strata is covered. Next, for strata of size two and three, one random information type pair is
selected. For the remaining strata with sizes greater than three, sample sizes are proportional to the
strata size, which yields one or more pairs per stratum. For each stratum, the first sample is drawn
randomly. To draw the remaining samples, we compute a similarity distance between the already
selected pairs and remaining pairs in each stratum as follows. First, we create a bag-of-lemmas
by obtaining word lemmas in the already selected pairs. Next, in each stratum, the pairs with the
least common lemmas with the bag-of-lemmas are selected. We update the bag-of-lemmas after
each selection by adding the lemmas of the selected information type pairs. This strategy insures
that the information type pairs with the lower similarity measure is selected, resulting in a broader
variety of words in the sampled set. Moreover, we ensure that each tag sequence is represented by
at least one sampled item, and that sequences with a larger number of examples are proportionally
represented by a larger portion of the sample. Using the initial sample size of 2,000, we captured
2,283 samples from 1,466,328 phrase pairs. The original sample size differs from the final one due
to our strategy for ensuring at least one sampled pair for each strata. Our samples contain 1,138
unique information types from Lexicon L2.
To address RQ5 on method reliability, we require a ground truth for relations between the
62
information types within the sampled pairs. For this reason, we published a survey following the
method described in Section 6.3 [37]. The survey asks subjects to choose a relation for pair (A,B)
from one of the following six options:
s: A is a kind of B, e.g., “mobile device” is a kind of “device.”
S: A is a general form of B, e.g., “device” is a general form of “mobile device.”
P: A is a part of B, e.g., “device identifier” is a part of “device.”
W: A is a whole of B, e.g., “device’ is a whole of “device identifier.”
E: A is equivalent to B, e.g.,“IP” is equivalent to “Internet protocol.”
U: A is unrelated to B, e.g., “device identifier” is unrelated to “location.”
For this survey, we recruited 30 qualified Amazon Mechanical Turk participants following
the criteria mentioned in Section 7.2.1. Using the survey results and the approach mentioned in
Section 7.2.1, a multi-viewpoint ground truth (GT) was constructed. We plan to publish the survey
results and the GT publicly. We measure the number of true positives (TP), false positives (FP),
and false negatives (FN) by comparing the semantic relations with the multi-view GT to compute
Precision (Prec.) and Recall (Rec.), which are presented in Table 7.4. Our method yields 21,745
total semantic relations from the sampled information types, which we plan to publish publicly
both in text and OWL format. Overall, the method correctly identifies 1,686/2,283 of semantic
relations in the GT ontology.
7.3 Conclusion and Future Work
In this chapter, we introduced a method to infer semantic relations between information types in
privacy policies and their morphological variants based on a context-free grammar and semantic
attachments. This method is constructed based on grounded analysis of information types in 50
privacy policies and tested on information types from 30 policies. Our method shows an improve-
ment in reducing the number of false negatives, the time, and effort required to infer semantic
relations, compared to our previously proposed methods by formally representing the information
types.
63
In the next chapter, we discuss our neural network classification model to infer semantic rela-
tions that are independent of syntax and purely rely on tacit knowledge, such as hypernymy relation
between “phone” and “mobile device.”
64
Chapter 8: LEARNING ONTOLOGIES FROM NATURAL LANGUAGE
POLICIES
Our prior work to construct a privacy ontology (See Chapter 5) [38] requires comparing informa-
tion types with every other type in a lexicon and assigning a semantic relationship to each pair
using seven heuristics. The required effort for this task is quadratic m × n × (n − 1)/2, where n
is the number of information types in the lexicon, and m is the amount of time required to assign
a relationship to a pair, estimated at 20 seconds [37]. In addition, a new policy introduces between
11-36 new types that are not encountered in the existing lexicon [6]. Considering app markets
contain hundreds of thousands of apps that change daily, we need to automate ontology construc-
tion. Thus, we propose to predict candidate relationships between information type pairs. Unlike
syntax-based method we proposed in Chapters 6 and 7, our approach relies on tacit knowledge
by learning the semantics of words and phrases using word embeddings and convolutional neural
network (CNN).
In this chapter, we describe an empirical method to learn and construct a formal ontology from a
naïve set of all pairs of information types contained in a lexicon. The contributions of this chapter
are three-fold: (1) a novel neural network architecture for learning to predict semantic relation-
ships among information type pairs; (2) a novel method to sample an existing ontology to create
a training and testing set that accounts for dependencies among concepts and formal ontological
relations; and (3) an empirical evaluation of the neural network on a real dataset. The remainder of
this chapter is organized as follows. In Section 8.1, we present the relation classification model to
learn an ontology; Section 8.2 describes the experiment designs; the evaluation and results appear
in Section 8.3; and finally, in Section 8.4 we present discussion, limitations of our work, and future
direction.
65
8.1 Relation classification Model
Figure 8.1 shows the learning architecture for the relation classification model with a pair of infor-
mation types as input. Throughout the chapter, we present the information type pair as information-
typeLHS and information-typeRHS , e.g., (device information, device ID), where LHS (left-hand
side) and RHS (right-hand-side) indicate two predicates in an asymmetric ontological relationship.
The input information types in a privacy policy lexicon can be from a single statement in a policy,
different sections of a single policy, or completely different policies. Given an information type
pair, the Embedding layer first maps the words in an information type to their corresponding word
embedding vectors. Second, word embeddings are fed into the Phrase Modeling layer, creating
a phrase-level semantic vector for each information type phrase. Third, the Semantic Similarity
Calculation compares the direction and distance of the two phrase-level vectors and generates a
similarity vector. Finally, the similarity vector is input to Softmax, generating three probabilities
corresponding to hypernymy, synonymy, and unrelated. We select the most probable relation for
each information type pair.
We now describe these four steps in further detail.
8.1.1 Embedding Layer
Each word in an information type phrase is presented using a pre-trained 200-dimensional vector
called a word embedding [5]. To create domain-specific word embeddings, we followed the ap-
proach by Harkous et al. [32] and trained the Word2Vec model (see Section 2.6.2) using 77,556
English privacy policies collected from mobile applications on the Google Play Store . Common-
purpose word embeddings trained on the English Wikipedia dump [8, 44, 52] or Google News
dataset [46] exist, however, previous research has shown improvements on classification accuracy
by utilizing domain-specific word embeddings [66].
To obtain the privacy policy corpus to train the Word2Vec model, we crawled the metadata
archive for more than 1,402,894 Android apps provided by the PlayDrone project [68] from which
109,933 contained a valid link to a privacy policy. We used the BeautifulSoup library in Python to
66
Figure 8.1: Ontology Learning Architecture
67
extract the text from the HTML files by stripping HTML tags associated with: head, script, URL,
navigation, button, and option information. Next, we filtered non-English policy text files, yielding
77,556 privacy policies with the majority of text in English by using the DetectLang library in
Python. In the next step, for each privacy policy, we tokenized the sentences and removed all non-
English sentences. We also expanded the contractions (e.g., “won’t” is transformed to “will not”),
and removed punctuation, numbers, email addresses, URLs, and special characters. Finally, we
transformed the remaining characters to lower-case. The resulting pre-processed text was used to
train the Word2Vec [46] model.
The trained word embeddings for the words in our privacy policy corpus are stacked in a word
embedding matrix, which is used in the mapping process. The Embedding layer maps every word
in an input information type phrase from a privacy policy lexicon to its corresponding embedding
vector read from the embedding matrix. To this end, we first identify the maximum phrase length t
by analyzing the number of words in all information types in the privacy policy lexicon. Next, the
information type phrase is padded automatically if the number of words is less than t to reach the
maximum length. This approach ensures that all the input information types have the same length.
Next, using the word embedding matrix, each word in the padded information type is mapped to
its corresponding word embedding vector. If a word cannot be found in the embedding matrix,
our approach assigns a 200-dimension vector to the word with its elements randomly generated
using the uniform distribution. In the next section, we illustrate how the word embeddings for each
padded information type phrase are utilized to generate a phrase-level semantic vector.
8.1.2 Phrase Modeling Layer
In this section, we describe the Phrase Modeling layer (see Figure 8.2) that transforms word em-
beddings of an input information type to a low dimensional, fixed-sized vector using CNN with
three different filter widths [66]. Implementing CNN with multiple filter widths captures local
semantics of n-gram of various granularities [65]. In our case, convolutional filters with widths 1,
2, and 3 capture the semantics of uni-grams, bigrams, and trigram, respectively.
68
Figure 8.2: Phrase Modeling Layer
We present an example for convolution filter of width w = 3 for the padded information
type P : device1information2pad3 . . . padt−1padt with length t, where t is the maximum phrase
length in the privacy policy lexicon as discussed in Section 8.1.1. The words/pads in the infor-
mation type P are represented as a list of vectors (x1, x2, . . . , xt−1, xt), where xi ∈ Rn corre-
sponds to the word embedding of word/pad i ∈ P and n represents the dimension of word embed-
dings n = 200 (see Section 8.1.1). Our approach automatically assigns a 200-dimension vector
with random uniform values for pads and also the words that cannot be found in the embedding
matrix. Using embedding vectors and filter width w = 3, the phrase is represented as follows:
{[x1;x2;x3], . . . , [xt−2;xt−1;xt]}, where “;” shows vertical vector concatenations. In general, the
result of this module is a matrix X ∈ Rn0×t, where n0 = w×n. To convolve all the features in X ,
we process X using the linear transformation in Equation (8.1).
Z = W1X (8.1)
where W1 ∈ Rn1×n0 is the linear transformation matrix and n1 is a hyper-parameter represent-
ing the number of filters, which we set as 128 in our approach. The result of linear transformation
is shown as Z ∈ Rn1×t, which is dependent on t, the maximum phrase length in the privacy policy
lexicon.
69
We further apply hyperbolic tangent (tanh) as a non-linear activation function, see Equation
(8.2) on the result of the linear transformation. To determine the useful features, we apply a maxi-
mum pooling on h. The result of this process is a feature vector of size n1 which is independent of
the phrase length.
h = tanh(Z) (8.2)
After applying the Phrase Modeling layer to the input information type with three different
convolution filter widths, we retrieve three high-level feature vectors of size n1 as shown in Fig-
ure 8.2. Finally, we concatenate these three feature vectors to create a single vector representing
the phrase-level semantics. We follow this approach to generate two phrase-level vectors for both
information-typeLHS and information-typeRHS . These two vectors are compared in the Semantic
Similarity Calculation layer, which we discuss next.
8.1.3 Semantic Similarity and Softmax
The Semantic Similarity Calculation layer compares the input phrase-level vectors from the Phrase
Modeling layer. We adopt the structure proposed by Tai et al., where the direction and distance of
the two input vectors are compared using the following equations [64]. Two vectors PLVLHS and
PLVRHS refer to Phrase-Level-VectorLHS and Phrase-Level-VectorRHS in Figure 8.1, which are
shortened for simplicity in the equations.
dir = PLVLHS � PLVRHS (8.3)
Dis = |PLVLHS − PLVRHS| (8.4)
sim = σ(Wdir + Udir + b) (8.5)
70
Equation (8.3) compares the direction of two semantic vectors PLVLHS and PLV RHS for
each dimension using the point-wise multiplication operator. For calculating the distance between
PLV LHS and PLVRHS , we utilize the absolute vector subtraction presented in Equation (8.4).
To integrate the results of Equation (8.3) and Equation (8.4) on PLV LHS and PLV RHS, we
use a hidden sigmoid layer presented in Equation (8.5). The similarity vector as the output of the
function is then sent to a Softmax classifier as shown in Equation (8.6) to predict the probabilities
of hypernymy, synonymy, and unrelated. We select the prediction with the highest probability as
the relationship between information-typeLHS and information-typeRHS . Next, we discuss the loss
function used to train the relation classifier.
Prelation = softmax(Wpsim+ bp) (8.6)
8.1.4 Loss function
We use weighted cross-entropy loss to measure the performance of the relation classifier with re-
spect to the predicted probability Prelation, which has been normalized with Softmax, and the actual
label (hypernym, synonymy, unrelated). Cross-entropy loss increases as the predicted probability
Prelation diverges from the actual label. We use weights to account for imbalance in ontological
relations, i.e., the number of unrelated pairs is several orders of magnitude larger than all other
relations combined. The weights are calculated using the frequency of each relation’s presence
in the ontology as a simple ratio: unrealetd/total and ((hypernymy+synonymy))/total. The unre-
lated ratio is applied to the hypernymy and synonymy classes, as determined by the actual label,
to give more weight when determining the loss, and the hypernymy and synonymy ratio is applied
to the unrelated class to give less weight when determining the loss. The loss function is defined
in Equation (8.7).
loss = −wi∑i∈T
yiln(P irelation) (8.7)
where T is the training pairs, and wi, yi, and P irelation are the weight, actual label, and predicted
71
probability, respectively, for the ith information type pair in the training set.
The training involves sending the derivative of the loss function through back-propagation to
update the network parameters using the stochastic gradient descent method. Each epoch iterates
over all the training data, which are divided into multiple batches. After processing each batch of
training data within an epoch, the parameters are updated based on the gradient of the loss function
and another hyper-parameter called the learning rate. This hyper-parameter determines how fast
or slow the network should move towards the optimal solution. The hyper-parameters, including
learning rate, are defined in Sections 8.2.4 and 8.2.5. The training process stops when the loss
value is sufficiently small or fails to decrease [30].
8.2 Experiment Designs
In this section, we first describe our motivation for evaluating the relation classification model,
before describing our experimental designs.
The ability to predict an ontology relationship is a multi-class classification problem, in which
given an ordered information type pair, we are interested in whether the first item in the pair is a
hyponym, synonym or in another type of relation, including unrelated. To this end, we consider
two views about how the classification model learns to predict these relationships: (1) each rela-
tionship is independent, and the network has learned direct relationships among concepts ignoring
the transitive closure of hypernymy; or (2) relationships can be dependent on one another, and the
network has learned a partial semantic representation that is some subset of the transitive closure
of hypernymy. For example, we assume that (Android ID, mobile device ID) are related through
hypernymy relationship. Similarly, hypernymy holds for (mobile device ID, device identifier). We
assume the classification model learns semantic relationships among these words based on how
they are used in policy sentences. Under view one, we train the classification model to only learn
these two relationships from the embeddings trained on the policies. Under view two, however,
we further train the model to learn relationships inferred through transitivity, which includes the
hypernymy relationship for (Android ID, device identifier). We hypothesize that this additional
72
training generalizes to improve the classification of hypernymy, because more abstract hypernyms
can be used to group semantically similar, but not directly related concepts.
We conducted two experiments to evaluate the relation classification model described in Sec-
tion 8.1. In experiment 1, we evaluate the model’s ability to classify whether an information type
pair is a direct hypernymy, synonymy, or otherwise. Experiment 2 differs from experiment 1 by
considering entailed hypernymy relations, which include direct and indirect hyponyms in a single
class, and thus we evaluate the model’s ability to classify information type pairs into one of three
classes: hypernymy (direct and indirect), synonymy, or otherwise.
In this section, we first discuss the unique challenge of learning with ontologies. Second, we
introduce our ground-truth ontology. Third, we introduce a method to generate an early version
of the ontology which serves as the training ontology for our model. Finally, we discuss the
experiments.
8.2.1 Learning with Ontology
In traditional machine learning, each data point in a data frame is independent and thus data points
can be randomly divided into training and testing data. Models are fitted to training data, and
then classifications or predictions are evaluated on testing data. When predicting extensions to an
ontology, however, the relationships between ontology classes are not independent: the relation-
ships in hypernymy are transitive, and thus removing a relationship between a superordinate and
subordinate concept can lead to misclassification between the subordinate concept and its ancestor
concepts. To address this challenge, each ontology is treated as a versioned dataset, wherein later
versions contain new information types dependent (via relations) on types found in earlier versions.
To learn a later version, we train the relation classification model on an earlier ontology version.
The earlier version is constructed by randomly eliminating information types from the later ver-
sion, while repairing the earlier version so that all entailments of the early version are contained
in the entailment of any subsequent version (i.e., for each future version, the entailment is mono-
tonically increasing). Therefore, we train the relation classification model on an early version of
73
Table 8.1: Examples of Information Types in Platform Information Lexicon L
Information Type FrequencyIP address 41browser type 21Location 11Operating system type 5Geo-location 4Ads clicked 2Media access control 2Adverstising ID 1Device brand 1
the ontology (training ontology), and we validate the model on the later version (testing ontology).
We now introduce the ontology used to evaluate our method followed by the steps to generate an
early version for training purposes.
8.2.2 Platform Ontology
As our ground-truth, we utilize the platform ontology (see Section 5.2) that is manually built from
the platform information lexicon (see Section 4.2), which we call L. This lexicon was extracted
from 50 privacy policies [60]. Lexicon L contains phrases that correspond to platform informa-
tion types, defined as “any information that the app or another party accesses through the mobile
platform that is not unique to the app.” Each information type in L has a frequency, or number of
times the type appeared in annotations of the 50 policies. Table 8.1 shows example types and their
frequencies in L.
As mentioned in Chapter 5, we manually constructed the platform ontology from L by applying
seven heuristics that were identified through grounded analysis of five privacy policies [38]. The
platform ontology contains 367 information types, which are used to comprise 1583 hypernymy
and 310 synonymy relationships between pairs of information types.
Formally, the platform ontology is a knowledge base KB expressed using FL0, a sublanguage
of the Attribute Language (AL) in Description Logic (DL). A DL knowledge base KB is comprised
of two components, the TBox and the ABox [3]. The TBox consists of terminology, i.e., the
74
vocabulary (concepts and roles) of an application domain. The ABox contains assertions about
named individuals using this vocabulary. The platform ontology knowledge base KB only contains
terminology, which we call the TBox T .
The semantics of FL0 concepts begins with an interpretation I that consists of a non-empty set
δI (the domain of the interpretation) and an interpretation function, which assigns to every atomic
concept C, a set CI ⊆ ∆I . The TBox T also contains terminological axioms that relate concepts
to each other in the form of subsumption and equivalence, which we use to formalize hypernymy
and synonymy, respectively. A concept C is subsumed by a concept D, written T |= C v D, if
CI v DI for all interpretations I that satisfy the TBox T . The conceptC is equivalent to a concept
D, written T |= C ≡ D, if CI = DI for all interpretations I that satisfy the TBox T . Axioms of
the first kind (C v D) are called inclusions, where axioms of the second kind (C ≡ D) are called
equalities [3]. Note that the equalities C ≡ D can be rewritten as two inclusion axioms C v D
and D v C [63]. Using this formal representation, we now describe our method to construct an
early version of platform ontology.
8.2.3 Training Ontology
The procedure to create an early version of an ontology is analogous to sampling from a graph,
wherein concepts (nodes) are related via axioms (edges). Thus, we first briefly introduce traditional
graph sampling goals, after which we introduce our sampling goal and corresponding method.
Graph sampling is the problem of creating a small sample graph that has similar properties
as the target graph [43]. Scale-down sampling and back-in-time sampling describe two common
goals in graph sampling [43]. In scale-down sampling, the goal is to create a sample S on n′ nodes
from a static graph G containing n snodes, where n′ � n. Sample S has similar properties as
graph G, such as degree distribution, clustering coefficient distribution, and hop-plot. Back-in-
time sampling corresponds to traveling back in time and trying to mimic past versions of graph G
[43]. Let Gn′ denote the graph G at some point in time, when it had exactly n′ nodes. The goal is
to find a sample S on n′ nodes with similar properties as graphGn′ , i.e., when graphGwas the size
75
of S. Sampling methods for scale-down and back-in-time sampling are summarized in three main
groups: (1) methods based on randomly selecting nodes; (2) methods that select edges randomly;
and (3) exploration techniques that simulate random walks on a target graph.
In our approach, we are training a supervised machine learning algorithm to predict future
versions of an ontology, and thus our goal is to create a training ontology T ′ that can be used to
construct a training set for our learning task from a target ontology T with n concepts. We call this
sampling goal version sampling, which has the following two properties:
P1. The training ontology T ′ follows the scale-down sampling goal, where the number of
sampled concepts n′ � n.
P2. The training ontology T ′ is an early version of the target ontology T with n′ concepts,
similar to back-in-time sampling.
Our sampling goal differs from the traditional graph sam-pling goals for two reasons: (1)
scale-down sampling can yield disconnected graphs and node reachability is not the main concern,
whereas transitivity is essential to subsumption inference and ensuring that the entailment of future
ontology ver-sions is monotonically increasing; (2) back-in-time sampling aims to sample on an
early version of the graph, whereas our aim is to use the early version of the graph as the sample.
Version sampling is comprised of two algorithms: (1) a weighted random sampling algorithm
to identify candidate concepts to remove from the target ontology; and (2) the remove-and-repair
algorithm that repairs the modified target ontology after removing concepts from the target T to
yield the early version ontology T ′.
Weighted Random Sampling
Concepts in the platform ontology occurred across 50 privacy policies according to the frequencies
recorded in the lexicon L. When choosing concepts to remove from the target ontology to yield an
earlier version, one must consider the likelihood that the concept would be seen in a policy, before
it was added to the ontology during construction. For this reason, we introduce a probability
proportional to size sampling algorithm that is a weighted random sampling method [51]. This
76
algorithm iteratively takes an information type from lexicon L with inclusion probability inversely
proportional to the information type frequency in L. The algorithm terminates after sampling n′′
information types from L.
Weighted Random Sampling Algorithm:
Input: Lexicon L with n information types and their frequencies
Output: Sampled lexicon L′ with n′′ sampled information types
1: While sample-size < n′′
2: Randomly select an information type i ∈ L− L′
3: Let Fi be the frequency of information type i in L
4: If Fi ≤ 0 then
5: Select information type i and insert it to L′
6: Else
7: Decrement Fi by one
8: End-While
The weighted random sampling algorithm ensures the selection of information types with lower
frequency. Elimination of sampled information types from platform ontology results in an ontology
T ′ that is more likely to contain information types with higher frequency, hence higher probability
to appear in the privacy policies.
Early Version Repair Algorithm
After the n′′ information types have been selected using weighted random sampling, we apply the
remove-and-repair algorithm to the target ontology T to yield the early version ontology T ′, which
contains n′ = n− n′′ concepts. In this algorithm, let f(x) : L′ → concepts in T .
Remove-and-Repair Algorithm:
Input: Sampled lexicon L′, Platform ontology as TBox T
Output: Training ontology as TBox T ′ with n′ concepts.
1: Create a copy of TBox T , called TBox T ′
77
2: For information type i ∈ L′
3: Take all inclusion axioms of form Cj v f(i) in T ′
4: Take all inclusion axioms of form f(i) v Pk in T ′
5: Create axioms Cj v Pk and add it to T ′
6: Take all equalities of form f(i) ≡ Em in T ′
7: Create axioms Em v Pk and add it to T ′
8: Omit concept f(i) and all inclusion and equality axioms containing f(i) in T ′
9: End-For
For both experiments, we apply the version sampling algorithm with n′′ = 100 on the platform
ontology TBox T which contains n = 367 concepts. The generated training ontology T ′ contains
n′ = 267 concepts, thus version sampling method satisfies property P1 : n′ � n. In addition, T ′ is
a consequence of T , denoted as for every axiom t ∈ T ′, it is true that T |= t for all interpretations
T that satisfy the TBox T [63].
We illustrate the version sampling algorithm using a hypothetical ontology in Figure 8.3 that is
created using Web Ontology Language (OWL) in Protégé1, an open-source ontology editor. Every
concept in OWL is a sub-class of the class owl:Thing, which is the top concept in DL. In Figure 8.3,
concepts C-2 and C-3 are equivalent and this equality is shown using two inclusions. Herein, let
L′ = {C-3, C-5}. Both L′ and the ontology in Figure 8.3 are the input to the version sampling
algorithm. First, we create a copy of the ontology called T ′.
For concept C-3, we list all the inclusion axioms of form Cj v C-3 representing the sub-classes
of C-3. This list only contains C-4, which we show in the form of a singleton set: {C-4}. Next, we
list all the inclusion axioms of the form C-3 v Pk, stating the super-classes of C-3. This list only
contains {C-1} as the super-class of C-3. We now create a new inclusion axiom C-4 v C-1 and
add it to the TBox T ′. Next, we list all the equalities for concept C-3, which only includes {C-2}.
In line 8 of the version sampling algorithm, we create a new inclusion axiom C-2 v C-1 and add it
to T ′. Finally, we omit concept C-3 and axioms C-3 v C-1 and C-3 ≡ C-2 from the new ontology.
1https://protege.stanford.edu/
78
Figure 8.3: An Example Ontology
For the second iteration, we repeat lines 3-8 of the algorithm for concept C-5. Since C-5 has no
sub or equivalent classes in the target ontology, there is no need to repair the ontology by adding
additional inclusion axioms. Therefore, the algorithm omits C-5 from T ′ and terminates.
The training ontology generated through our method contains 1,026 hypernyms and 183 syn-
onyms. We now describe our two experiments. Training ontology and platform ontology (i.e., our
testing ontolo-gy) are static artifacts in both experiments.
8.2.4 Experiment 1
In experiment 1, we aim to classify whether a new information type pair describes a direct hy-
pernymy relationship, a synonymy, or otherwise. To this end, we define direct hypernymy for
concepts C, D if their relation satisfies the following three criteria: (1) C v D; (2) there exists no
concept E such that C v E and E v D; and (3) C 6≡ D. We define synonymy relationship for
conceptsC,D ifC ≡ D. If a pair is related in any way other than a direct hypernymy or synonymy
relationship, we classify this relationship unrelated. Using this definition, we identify direct hyper-
nyms, synonyms, and unrelated pairs in the training ontology as training-set1 for this experiment.
Similarly, we identify direct hypernyms, synonyms, and unrelated pairs in the testing ontology as
testing-set1 used to evaluate the model. Both datasets are available online .Table 8.2 presents the
79
Table 8.2: Experiment 1: Number of Hypernym, Synonym, and Unrelated Pairs in Training andTesting Sets
Direct Hypernymy Synonymy UnrelatedTraining-set1 1,026 183 34,302testing-set1 1,583 310 65,268
Table 8.3: Experiment 1: Training-set1 Information Type Pairs and Semantic Relations
Information-typeLHS Information-typeRHS Semantic Relation LabelAndroid ID Mobile device ID Direct HypernymyMobile device ID Device identifier Direct HypernymyDID Device identifier SynonymyURL URLs SynonymyAndroid ID Device identifier UnrelatedCall duration Advertising ID Unrelated
number of pairs identified for each class in training-set1 and testing-set1. In addition, Table 8.3
presents examples of information type pairs along with their relationships in training-set1.
In this experiment, we aim to answer the following re-search questions.
RQ1: What is the precision, recall, and F-1 score for the predicted relations?
RQ2: How well the relation classification model can reduce the manual ontology construction
effort?
RQ3: What is the effect of missing transitive hypernymy on classification performance?
For experiment 1, we identify the best model configuration based on classification performance
(i.e., average F-1 score on three classes) on testing-set1. To this end, we use grid search over six
hyper-parameters of relation classification model, including number of epochs, dropout keep rate,
batch size, learning rate, convolution activation function, and prediction function. The parameters,
their different configuration options, and best performing selections are shown in Table 8.4.
8.2.5 Experiment 2
In experiment 2, we aim to classify whether a new information type pair is one of hypernymy,
synonymy, or unrelated. This experiment diverges from experiment 1 by listing both direct and
80
Table 8.4: Experiment 1: Parameter Configuration Options and Selections
Model Hyper-parameter Hyper-parameter Options Best Hyper-parameter SelectionNumber of Epochs 10, 15 10Dropout Keep Rate 0.7, 0.8, 0.9 0.9Batch Size 30, 128, 200 128learning Rate 0.01, 0.001 0.001Convolution Activation Function tanh, relu, sigmoid tanhPrediction Function sigmoid, softmax sigmoid
Table 8.5: Experiment 2: Number of Hypernym, Synonym, and Unrelated Pairs in Training andTesting Sets
Direct Hypernymy Synonymy UnrelatedTraining-set2 3,827 183 31,501testing-set2 7,070 310 59,781
transitive hypernymy relationships from a TBox entailment. Therefore, we define hypernymy
relationship between two concepts C, D, such that: (1) C v D; and (2) C 6≡ D. We define
synonymy relationship for concepts C, D if C ≡ D. If a pair is related in any way other than a
hypernymy or synonymy relationship, we classify this relationship unrelated. Using this definition,
we create training-set2 and testing-set2 using training and testing ontologies (see Table 8.5 for the
resulting counts). Additionally, Table 8.6 presents example pairs in training-set2. In contrast to
instances listed in Table 8.3, the pair Android ID, device identifier is labeled as hypernymy in
experiment 2.
Experiment 2 raises the following research question based on the role of transitive hypernymy
Table 8.6: Experiment 2: Training-set2 Information Type Pairs and Semantic Relations
Information-typeLHS Information-typeRHS Semantic Relation LabelAndroid ID Mobile device ID HypernymyMobile device ID Device identifier HypernymyAndroid ID Device identifier HypernymyDID Device identifier SynonymyURL URLs SynonymyCall duration Advertising ID Unrelated
81
Table 8.7: Experiment 2: Parameter Configuration Options and Selections
Model Hyper-parameter Hyper-parameter Options Best Hyper-parameter SelectionNumber of Epochs 10, 15 10Dropout Keep Rate 0.7, 0.8, 0.9 0.9Batch Size 30, 128, 200 200learning Rate 0.01, 0.001 0.001Convolution Activation Function tanh, relu, sigmoid reluPrediction Function sigmoid, softmax softmax
relations:
RQ4: How does entailment in hypernymy affect the performance of relation classification
model in terms of precision, recall, and F-1 score?
We also identify the best model configuration based on classification performance (i.e., av-
erage F-1 score on three classes) using grid search over six hyper-parameters for experiment 2.
The parameters, their different configuration options, and best performing selections are shown in
Table 8.7.
Next, we present results for experiments 1 and 2 and address the research questions.
8.3 Experimental Results
In this section, we report our results and answer the research questions described in Sections 8.2.4
and 8.2.5. Recall, we have two experiments: (1) to evaluate hypernymy prediction assuming
independent relations; and (2) to evaluate hypernymy prediction assuming dependent relations.
8.3.1 Experiment 1 Results
In experiment 1, we compare the labels of the testing-set1 with the predicted relations to answer
RQ1 and investigate the number of relations correctly predicted by our model. testing-set1 contains
1,583 information type pairs labeled as direct hypernymy, 310 information type pairs labeled as
synonymy, and 65,268 pairs as unrelated (see Section 8.2.4 for more details). We use precision,
recall, and F-1 score as measures to evaluate the performance of the relation classification model on
82
Table 8.8: Confusion Matrix for Experiment 1Actual Direct Hypernymy Actual Synonymy Actual Unrelated Total Predictions/Class