Classifying Web Texts Based on Syntactic and Grammatical Modeling

Classifying Web Texts Based on Syntactic and Grammatical Modeling

Miaomiao Zhang Department of Language and Communication Studies

NTNU – Norwegian University of Science and Technology NO-7491 Trondheim, Norway

[email protected]

Qinghua Wang Aalto University, School of Electrical Engineering Department of Communications and Networking

P.O.Box 13000, FI-00076 AALTO, Finland [email protected]

Abstract—Internet has become one indispensable part of human life. One annoying thing on the Internet is that there are malicious web texts, such as web sites containing viruses and spam emails. The contribution of this paper is to propose a web text classification system and algorithms which could help us differentiate malicious web texts from normal ones. Syntactic features and grammatical features of web texts have been used to provide inputs for the system, and it is the first time that such features have been exploited for a system of web text classification.

Keywords-Computational linguistics; syntactic modeling; grammatical modeling; Internet security;web text classification

I. INTRODUCTION The Internet is anarchy and different views (either good or

bad) can be expressed freely. This is a great human invention and it protects democracy and freedom of speech. However, there are things people may not want to access on the Internet, such as unhealthy websites, websites containing viruses, spam emails, etc. We call these contexts the malicious web texts. On the contrary, the texts other than the malicious ones are called normal texts. The contribution of this paper is that an innovative web text classification system is proposed to help us differentiate malicious web texts from the normal ones.

Many previous works have been focused on the research of detecting malicious web texts, such as spam emails [1-4]. Among them, Xie et. al. [4] and Thomas [2] use the feature of URL to distinguish spam messages from legitimate ones. Using personal social networks, Li et al. [3] adopts spam keywords to enhance a Bayesian filter. In addition, Gao et. al. [1] uses a complex of features, such as the cluster size, the sender’s social degree, interaction history, average time interval, average URL number per message, etc., to detect spams on the online social networks (e.g., Facebook). This paper innovates by proposing to use syntactic and grammatical features of web texts which have been rarely paid attention to by earlier researches.

In the following of this paper, Section II presents the system model for web text classification. Section III presents our algorithm used for profiling web texts and comparing web texts, which is also the core of the classification system. Section IV concludes the paper.

II. SYSTEM MODEL The purpose of this paper is to design a web text

classification system which can automatically classify interesting web texts into pre-defined categories, e.g. malicious web texts and normal texts. The core of the system is a web text profiling algorithm which uses decision tree to profile selected web texts based on syntactic and grammatical analyses. The details of the algorithm are elaborated in Section III. The system works by comparing the similarities between an unknown profile and a pre-learned profile. If the unknown profile is similar to the pre-learned profile, then the web texts described by the unknown profile share the same category as the web texts described by the pre-learned profile.

There are three steps in order to have the system to work. The details are shown in Figure 1. The first step is to learn the profiles for labeled web texts (presumably labeled by humans) which belong to a particular pre-defined category. By doing this, each category of web texts has a standard profile. This step is called the training phase. The second step is to learn the profiles for unlabeled interesting web texts in the same way as it is in the training phase. The profiled web texts in this step are those waiting to be classified (e.g. an incoming email which is not labeled in other means). This step is called the processing phase or the detection phase if the purpose is to detect malicious texts or spam emails. The last step is to compare the profile learned for unlabeled texts with those pre-learned profiles learned for labeled texts. If the profiles are similar, then the unlabeled texts are considered to belong to the same category as those used for the pre-learned profile.

Figure 1: System Model

III. PROFILING WEB TEXTS USING DECISION TREES

Decision tree is a machine learning and data mining technology which is used to assist in decision making or in profiling sequences of operation. It builds a tree-like graph based on the features of training items and later can be used to classify different groups of items (i.e. a decision) or to predict a missing feature of an item given other features. In linguistics, a message consists of different types of words and these words appear in sequences according to grammar rules. We can thus model and classify different categories of messages using decision trees. In the following of this section, the decision tree technique is utilized in a slightly different way from its traditional use. Instead of making decisions, a decision tree is used to build message pattern profiles for different categories of web texts.

A. Building Decision Trees Based on Syntactic Decomposition Tagging word classes and modeling the syntactic rules are

two important tasks in the field of computational linguistics which deals with the processing and computation of natural languages. Assume that a computer has acquired different word classes and syntactic rules of a certain language (e.g. English) after training by linguists and computer scientists. A sentence is thus able to be described as a sequence of word types. Take the following sentence which is from a spam message as an example:

Your bank information is not valid.

The phrase structure of this sentence is shown in Figure 2.

Figure 2: Syntactic decomposition of a spam message

If we only consider the syntactic categories of the words forming this sentence, then the message can be abstracted as:

Det (determiner) - N (noun) - N (noun) - V (verb) - Adv (adverb) - Adj (adjective) The purpose is to learn whether the appearance of this abstracted pattern is normal or it represents a malicious message, i.e. a spam in this case. In order to learn all patterns of malicious (or normal) messages and their probability of appearance, we can train a decision tree using labeled training data.

Suppose that a sentence has a maximum of N words. We can have an example training set of messages as it is in Table 1.

Table 1: An example training set of malicious/normal sentences with syntactic features

No. 1st Word

2nd Word

3rd Word

… Nth Word

Label

1 N V Adv … N/A Normal 2 N V V … N/A Malicious 3 V V Adv … Adv Malicious 4 Det N N … Adj Malicious 5 Det N V … N/A Normal 6 N V Adv … N/A Malicious 7 Det N V … N/A Malicious 8 N V Adv … N/A Normal

In Table 1, each word can be one of the word types: {N (noun), V (verb), Adj (adjective), P (preposition), Adv (adverb), Det (determiner), Deg (degree word), Qual (qualifier), Aux (auxiliary), Con (conjunction), N/A (not available)}. A kth word is assigned the value N/A if there is no kth word in a sentence. In addition, a message can be labeled either as a “malicious” message or as a “normal” message.

When it comes to the identification of a malicious message, we may say that a message is malicious because it exhibits a pattern (e.g. in the form of the items in Table 1) which has been observed in other malicious messages (e.g. in the training set provided by Table 1). But this kind of malicious message identification is very inaccurate because a pattern observed in malicious messages may also be observed in normal messages. For example, item 5 and item 7 in Table 1 have the same set of attribute values but they have been separately observed as malicious as well as normal. We thus need a more advanced technique which could describe the difference between malicious messages and normal messages from a holistic point of view.

Decision tree provides a method to build complete pattern profiles either for malicious messages or for normal messages. In Table 1, each item represents a message and it has a series of attributes, namely 1st word, 2nd word, 3rd word, …, Nth word. Each attribute can take attribute values from the set: {N, V, Adj, P, Adv, Det, Deg, Qual, Aux, Con, N/A}. From Table 1, we can construct a decision tree for malicious messages by iteratively partitioning the data (i.e. those items which are labeled as malicious) into subsets that share the same attribute values. A possible tree is as following:

Figure 3: Decision tree model for malicious messages based

on syntactic analysis

For brevity’s sake, only the attributes of the first three words are analyzed and tree branches with probability 0 have been omitted. In Figure 3, each branch from top down represents a pattern that has been observed for malicious messages. For example, the left-most branch in Figure 3 tells the pattern N-V-V. At the place of each node in the tree, there is an observation probability associated with the node. If it is a non-leaf node, the probability associated with it tells the marginal probability that a malicious message exhibits a partial pattern from the root to the node of interest. If it is a leaf node, the probability associated with it tells the probability that a malicious message has the full pattern from the top to the bottom. Because of the adoption of the “N/A” attribute value, variable-length patterns can be easily handled under the framework of a fixed-depth decision tree.

Similarly, we can also build a decision tree model for the normal messages shown in Table 1. The result is not shown due to space limitation.

Because of the different categories of messages, we shall expect that the decision tree built for malicious messages is different from the one built for normal messages. If there is a set of messages (e.g. extracted from a web site or from an email) whose category is not determined, we can also build a decision tree for them. In order to determine the category of this set of messages, a similarity test can be performed between the newly built decision tree and the ones which have been trained for malicious messages and for normal messages. We shall say that the set of messages is malicious if its decision tree is more similar to the one trained for malicious messages than to the one trained for normal messages. Otherwise, we say that the set of messages is normal.

As it is mentioned earlier that the probability associated with a leaf node in a decision tree is the probability that a pattern represented by the branch from the root to the leaf node is observed. If each attribute is considered as one dimension in the value space of the patterns, then a specific pattern is a data point in the N-dimensional (supposing there are N attributes) value space. From this perspective, the probability of observing a specific pattern is also the probability that a specific discrete data point (i.e. a value) is taken by a random pattern variable. Considering a complete decision tree, it represents the joint probability distribution of the random pattern variable which is associated with the category of messages that have been used to build the decision tree. This kind of interpretation is very useful as we can now compare the similarities of two decision trees by comparing the similarities of their respective probability distributions.

Let Tm and Tn be respectively the probability distributions represented by the decision tree of malicious messages and by the decision tree of normal messages. Let P be the probability distribution represented by the decision tree trained by an unknown set of messages. The similarities or the distances between Tm and P and between Tn and P can be measured using the Kullback-Leibler (K-L) divergence:

DKL (Tm || P) = ∑i Tm(i) × ln (Tm(i) /P(i)) (1)

DKL (Tn || P) = ∑i Tn(i) × ln (Tn(i) /P(i)) (2) As it can be seen from the definition, the K-L divergence

is the average of the logarithmic difference between two probability distributions. In (1) and (2), i is a data point in the pattern value space for which P(i) is non-zero. In order to have a meaningful measurement of the K-L divergence, all the distributions involved must sum to 1. That means it must be a probability distribution represented by a complete decision tree where all non-zero probability tree branches have been included.

In order to determine the category of an unknown set of messages, their K-L divergences with respect to Tm and Tn must be compared. A K-L divergence tells the difference between two distributions and it is always a non-negative value according to the Gibbs’ inequality. Therefore, the larger a K-L divergence is, less similar the two distributions compared (or their associated two decision trees) are. We say that an unknown set of messages is malicious if

DKL (Tm || P) < DKL (Tn || P), (3)

where P is the pattern probability distribution represented by the decision tree built according to the unknown set of messages. Otherwise, we say that the unknown set of messages is normal.

If only one distribution among Tm and Tn is known, the category of an unknown set of messages can be determined by comparing DKL (Tm || P) or DKL (Tn || P) with an empirical threshold.

B. Building Decision Trees Based on Grammatical Decomposition Grammatical roles specify relations between words in

sentences. They are bounded categories and well defined (compared to semantic roles which are inherently unbounded and generally not clearly defined). In Section III.A, it is assumed that a syntactic decomposition can be performed by computers. If we precede one step further, we can assume that computers are able to understand grammatical relations in sentences. This is actually a practical assumption. The grammatical role of a word in a sentence can be determined by the syntactic category it belongs to and the position it appears in the sentence. For example, the subject in English is the nominal element that the verb agrees with. It comes right before the verb in unmarked, declarative clauses, and when pronominalized, employs subjective pronouns. As it is seen, the definition of a grammatical role is quite clear, without referring to the meanings of words, and can be easily understood by a computer program. Because grammatical relations are language specific and vary from one language to another, the grammatical roles we mention in the following only apply to English.

Once again, we take the sentence from a spam message as an example: Your bank information is not valid. The grammatical structure of this message is shown in Figure 4.

Figure 4: Grammatical decomposition of a spam message From Figure 4, we know this spam message has a pattern

Subject – Verb – Complement in terms of grammatical relationships.

In English, a complete set of grammatical roles is defined as: {Subject, Verb, Indirect Object, (Direct) Object, Complement, and Adverbial}. English is actually a quite structured language in terms of the strict orders among different grammatical roles. For a declarative sentence (and other sentence types include: interrogative, imperative and exclamative sentences) consisting of a single clause, its construction appears this way in its entirety:

Adverbial – Subject – Adverbial – Verb – Indirect Object – Object – Complement – Adverbial

In the above, the Verb element is the most central element and it is normally obligatory in all sentences. The Subject element is another element which is indispensible. Other elements do not have important roles and are mainly optional.

As it is in Section III.A, we also hope that different groups of web texts exhibit different patterns or at least different pattern distributions in terms of grammatical relations. The grammatical patterns profiling different groups of web texts can also be learned in the form of decision trees. Similar to Table 1 in Section III.A, we also have an example training set here shown in Table 2, where N is assumed to be the maximum number of grammatical parts in a sentence (or a clause) and N/A is again used to represent an empty value for absent grammatical parts.

Table 2: An example training set of malicious/normal

sentences with grammatical features No. 1st

Part 2nd Part

3rd Part

… Nth Part

Label

1 Subject Verb N/A … N/A Normal 2 Subject Verb Object … Compl. Normal 3 Subject Verb Compl. … N/A Normal 4 Subject Verb Adv. … N/A Normal 5 Adv. Subject Verb … N/A Normal 6 Subject Adv Compl. … N/A Malicious 7 Subject Verb Compl. … N/A Malicious 8 Subject Compl. Adv. … Adv. Malicious

Due to space consideration, we only show the decision tree learned for normal messages. In Figure 5, the decision tree for normal messages is drawn by iteratively partitioning items in Table 2 into subsets which share the same attribute values (in this case, the attributes are 1st Part, 2nd Part, etc.). Only three levels of attributes are drawn in Figure 5.

Figure 5: Decision tree model for normal messages based on

grammatical analysis As the decision tree in Figure 5 can also be interpreted as a

joint probability distribution as it is in Figure 3, methods used to compare similarities of probability distributions can also be used to compare similarities of decision trees (or web texts in the context of this paper as decision trees are used to profile web texts in this paper). If an unknown group of web texts are also profiled with a decision tree, the comparison of the decision tree with a pre-learned one (such as the one in Figure 5) tells whether this unknown group of web texts shares the same group with the web texts that have been used to profile the pre-learned decision tree. The details of the web text classification technology based on the comparison of decision trees have been shown in Equations (1)-(3) and their surrounding texts.

IV. CONCLUSIONS This paper presents a web text classification system which

uses decision trees to profile texts. Syntactic analysis and grammatical analysis of texts have been used to help building precise profiles. The technique presented in this paper can be used to detect malicious and spam messages in the Internet. It can also be used to improve the efficiency of language documentation by automating the process of identifying interesting texts from the samples gathered (or crawled) from the Internet.

REFERENCES [1] Hongyu Gao, Yan Chen, Kathy Lee, Diana Palsetia & Alok Choudhary,

“Towards online spam filtering in social networks”, in the Proc. of 19th Network & Distributed System Security Symposium (NDSS), 2012.

[2] K. Thomas, C. Grier, J. Ma, V. Paxon, and D Song, “Design and Evaluation of a Real-Time URL Spam Filtering Service”. In Proceedings of the IEEE Symposium on Security and Privacy (May 2011).

[3] Z. Li and H. Shen, “SOAP: A Social Network Aided Personalized and Effective Spam Filter to Clean Your E-mail Box”. In Proceedings of the IEEE INFOCOM (April 2011).

[4] Y. Xie, F. Yu, K. Achan, Panigraphy R., Hulten G. and I. Osipkov, “Spamming botnets: signatures and characteristics”. In Proc. of SIGCOMM (2008).

Classifying Web Texts Based on Syntactic and Grammatical Modeling

Documents