data mining answer vtu 2015

VTU QUESTION PAPER

SOLVED ANSWERPART A

1.a) What is the KDD Process?The term Knowledge Discovery in Databases, or KDD for short, refers to the broad process of finding knowledge in data, and emphasizes the "high-level" application of particular data mining methods. It is of interest to researchers in machine learning, pattern recognition, databases, statistics, artificial intelligence, knowledge acquisition for expert systems, and data visualization.

The unifying goal of the KDD process is to extract knowledge from data in the context of large databases.

It does this by using data mining methods (algorithms) to extract (identify) what is deemed knowledge, according to the specifications of measures and thresholds, using a database along with any required preprocessing, subsampling, and transformations of that database.

An Outline of the Steps of the KDD Process

The overall process of finding and interpreting patterns from data involves the repeated application of the following steps:

1. Developing an understanding of

the application domain

the relevant prior knowledge

the goals of the end-user

2. Creating a target data set: selecting a data set, or focusing on a subset of variables, or data samples, on which discovery is to be performed.

3. Data cleaning and preprocessing.

Removal of noise or outliers.

Collecting necessary information to model or account for noise.

Strategies for handling missing data fields.

Accounting for time sequence information and known changes.

4. Data reduction and projection.

Finding useful features to represent the data depending on the goal of the task.

Using dimensionality reduction or transformation methods to reduce the effective number of variables under consideration or to find invariant representations for the data.

5. Choosing the data mining task.

Deciding whether the goal of the KDD process is classification, regression, clustering, etc.

6. Choosing the data mining algorithm(s).

Selecting method(s) to be used for searching for patterns in the data.

Deciding which models and parameters may be appropriate.

Matching a particular data mining method with the overall criteria of the KDD process.

7. Data mining.

Searching for patterns of interest in a particular representational form or a set of such representations as classification rules or trees, regression, clustering, and so forth.

8. Interpreting mined patterns.

9. Consolidating discovered knowledge.

The terms knowledge discovery and data mining are distinct.

KDD refers to the overall process of discovering useful knowledge from data. It involves the evaluation and possibly interpretation of the patterns to make the decision of what qualifies as knowledge. It also includes the choice of encoding schemes, preprocessing, sampling, and projections of the data prior to the data mining step.Data mining refers to the application of algorithms for extracting patterns from data without the additional steps of the KDD process.

1.b) tasks of data mining :

Data mining involves six common classes of tasks:

Anomaly detection(Outlier/change/deviation detection) The identification of unusual data records, that might be interesting or data errors that require further investigation.

Association rule learning(Dependency modeling) Searches for relationships between variables. For example a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis.

Clustering is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data.

Classification is the task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam".

Regression Attempts to find a function which models the data with the least error.

Summarization providing a more compact representation of the data set, including visualization and report generation.

Sequential pattern mining Sequential pattern mining finds sets of data items that occur together frequently in some sequences. Sequential pattern mining, which extracts frequent subsequences from a sequence database, has attracted a great deal of interest during the recent data mining research because it is the basis of many applications, such as: web user analysis, stock trend prediction, DNA sequence analysis, finding language or linguistic patterns from natural language texts, and using the history of symptoms to predict certain kind of disease.

2 .a) Types of attributes :

There are four different types of attributes in data mining.

1.NominalExamples: ID numbers, eye color, zip codes

2. OrdinalExamples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short}

3. IntervalExamples: calendar dates, temperatures in Celsius or Fahrenheit.

4. RatioExamples: temperature in Kelvin, length, time, counts

Discrete and Continuous Attributes

Discrete Attribute[Nominal and Ordinal]

Has only a finite or countably infinite set of values

Examples: zip codes, counts, or the set of words in a collection of documents

Often represented as integer variables.

Note: binary attributes are a special case of discrete attributes

Continuous Attribute [interval and ratio]

Has real numbers as attribute values

Examples: temperature, height, or weight.

Practically, real values can only be measured and represented using a finite number of digits.

Continuous attributes are typically represented as floating-point variables.

2.b) Data Preprocessing :No quality data, no quality mining results!

Quality decisions must be based on quality data

e.g., duplicate or missing data may cause incorrect or even misleading statistics.

Data warehouse needs consistent integration of quality data

Data extraction, cleaning, and transformation comprise the majority of the work of building a data warehouse. Bill Inmon.

Major Tasks in Data Processing

Data cleaning

Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies

Data integration

Integration of multiple databases, data cubes, or files

Data transformation

Normalization and aggregation

Data reduction

Obtains reduced representation in volume but produces the same or similar analytical results

Data discretization

Part of data reduction but with particular importance, especially for numerical data

Forms of Data Processing:

Data Cleaning

Importance Data cleaning is one of the three biggest problems in data warehousingRalph Kimball

Data cleaning is the number one problem in data warehousingDCI survey

Data cleaning tasks

Fill in missing values

Identify outliers and smooth out noisy data

Correct inconsistent data

Resolve redundancy caused by data integration

Missing Data

Data is not always available

E.g., many tuples have no recorded value for several attributes, such as customer income in sales data

Missing data may be due to

Equipment malfunction

Inconsistent with other recorded data and thus deleted

Data not entered due to misunderstanding

Certain data may not be considered important at the time of entry

Not register history or changes of the data

Missing data may need to be inferred.

How to Handle Missing Data? Ignore the tuple: usually done when class label is missing (assuming the tasks in classificationnot effective when the percentage of missing values per attribute varies considerably. Fill in the missing value manually: tedious + infeasible?

Fill in it automatically with

A global constant: e.g., unknown, a new class?!

the attribute mean

the attribute mean for all samples belonging to the same class: smarter

the most probable value: inference-based such as Bayesian formula or decision treeNoisy Data Noise: random error or variance in a measured variable Incorrect attribute values may due to

faulty data collection instruments

data entry problems

data transmission problems

technology limitation

inconsistency in naming convention

Other data problems which requires data cleaning

duplicate records

incomplete data

inconsistent data

How to Handle Noisy Data? Binning method: first sort data and partition into (equi-depth) bins

then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.

Clustering

detect and remove outliers

Combined computer and human inspection

detect suspicious values and check by human (e.g., deal with possible outliers)

Regression

smooth by fitting the data into regression functions

Simple Discretization Methods: Binning Equal-width (distance) partitioning: Divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B A)/N. The most straightforward, but outliers may dominate presentation

Skewed data is not handled well. Equal-depth (frequency) partitioning:

Divides the range into N intervals, each containing approximately same number of samples

Good data scaling

Managing categorical attributes can be tricky.

Binning methods

They smooth a sorted data value by consulting its neighborhood, that is the values around it.

The sorted values are partitioned into a number of buckets or bins.

Smoothing by bin means: Each value in the bin is replaced by the mean value of the bin.

Smoothing by bin medians: Each value in the bin is replaced by the bin median.

Smoothing by boundaries: The min and max values of a bin are identified as the bin boundaries.

Each bin value is replaced by the closest boundary value.

Example: Binning Methods for Data Smoothing Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

Partition into (equi-depth) bins:

- Bin 1: 4, 8, 9, 15

- Bin 2: 21, 21, 24, 25

- Bin 3: 26, 28, 29, 34

Smoothing by bin means:

- Bin 1: 9, 9, 9, 9

- Bin 2: 23, 23, 23, 23

- Bin 3: 29, 29, 29, 29

Smoothing by bin boundaries:

- Bin 1: 4, 4, 4, 15

- Bin 2: 21, 21, 25, 25

- Bin 3: 26, 26, 26, 34

3 .a) Best split measures :Data mine tools have to infer a model from the database, and in the case of supervised learning this requires the user to define one or more classes. The database contains one or more attributes that denote the class of a tuple and these are known as predicted attributes whereas the remaining attributes are called predicting attributes. A combination of values for the predicted attributes defines a class.

When learning classification rules the system has to find the rules that predict the class from the predicting attributes so firstly the user has to define conditions for each class, the data mine system then constructs descriptions for the classes. Basically the system should given a case or tuple with certain known attribute values be able to predict what class this case belongs to.

Once classes are defined the system should infer rules that govern the classification therefore the system should be able to find the description of each class. The descriptions should only refer to the predicting attributes of the training set so that the positive examples should satisfy the description and none of the negative. A rule said to be correct if its description covers all the positive examples and none of the negative examples of a class.

A rule is generally presented as, if the left hand side (LHS) then the right hand side (RHS), so that in all instances where LHS is true then RHS is also true, are very probable. The categories of rules are:

exact rule - permits no exceptions so each object of LHS must be an element of RHS

strong rule - allows some exceptions, but the exceptions have a given limit

probablistic rule - relates the conditional probability P(RHS|LHS) to the probability P(RHS)

Other types of rules are classification rules where LHS is a sufficient condition to classify objects as belonging to the concept referred to in the RHS.

3.b) K-nearest neighbor algorithm :

In pattern recognition, the k-Nearest Neighbors algorithm (or k-NN for short) is a non-parametric method used for classification and regression.[1] In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression:

In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k=1, then the object is simply assigned to the class of that single nearest neighbor.

In k-NN regression, the output is the property value for the object. This value is the average of the values of its k nearest neighbors.

k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification. The k-NN algorithm is among the simplest of all machine learning algorithms.

Both for classification and regression, it can be useful to weight the contributions of the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones. For example, a common weighting scheme consists in giving each neighbor a weight of 1/d, where d is the distance to the neighbor.[2]The neighbors are taken from a set of objects for which the class (for k-NN classification) or the object property value (for k-NN regression) is known. This can be thought of as the training set for the algorithm, though no explicit training step is required.

A shortcoming of the k-NN algorithm is that it is sensitive to the local structure of the data.

CNN for data reduction

Condensed nearest neighbor (CNN, the Hart algorithm) is an algorithm designed to reduce the data set for k-NN classification.[17] It selects the set of prototypes U from the training data, such that 1NN with U can classify the examples almost as accurately as 1NN does with the whole data set.

Calculation of the border ratio.

Three types of points: prototypes, class-outliers, and absorbed points.

Given a training set X, CNN works iteratively:

1. Scan all elements of X, looking for an element x whose nearest prototype from U has a different label than x.

2. Remove x from X and add it to U3. Repeat the scan until no more prototypes are added to U.

Use U instead of X for classification. The examples that are not prototypes are called "absorbed" points.

It is efficient to scan the training examples in order of decreasing border ratio.[18] The border ratio of a training example x is defined as

a(x) = ||x'-y|| / ||x-y||

where ||x-y|| is the distance to the closest example y having a different color than x, and ||x'-y|| is the distance from y to its closest example x' with the same label as x.

The border ratio is in the interval [0,1] because ||x'-y|| never exceeds ||x-y||. This ordering gives preference to the borders of the classes for inclusion in the set of prototypesU. A point of a different label than x is called external to x. The calculation of the border ratio is illustrated by the figure on the right. The data points are labeled by colors: the initial point is x and its label is red. External points are blue and green. The closest to x external point is y. The closest to y red point is x' . The border ratio a(x)=||x'-y||/||x-y|| is the attribute of the initial point x.

Below is an illustration of CNN in a series of figures. There are three classes (red, green and blue). Fig. 1: initially there are 60 points in each class. Fig. 2 shows the 1NN classification map: each pixel is classified by 1NN using all the data. Fig. 3 shows the 5NN classification map. White areas correspond to the unclassified regions, where 5NN voting is tied (for example, if there are two green, two red and one blue points among 5 nearest neighbors). Fig. 4 shows the reduced data set. The crosses are the class-outliers selected by the (3,2)NN rule (all the three nearest neighbors of these instances belong to other classes); the squares are the prototypes, and the empty circles are the absorbed points. The left bottom corner shows the numbers of the class-outliers, prototypes and absorbed points for all three classes.

4.a)Association analysis :Association rule learning is a popular and well researched method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using different measures of interestingness.[1] Based on the concept of strong rules, Rakesh Agrawal et al.[2] introduced association rules for discovering regularities between products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets. For example, the rule found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, he or she is likely to also buy hamburger meat. Such information can be used as the basis for decisions about marketing activities such as, e.g., promotional pricing or product placements. In addition to the above example from market basket analysis association rules are employed today in many application areas including Web usage mining, intrusion detection, Continuous production, and bioinformatics. In contrast with sequence mining, association rule learning typically does not consider the order of items either within a transaction or across transactions.

To select interesting rules from the set of all possible rules, constraints on various measures of significance and interest can be used. The best-known constraints are minimum thresholds on support and confidence.

The support of an itemset is defined as the proportion of transactions in the data set which contain the itemset. In the example database, the itemset has a support of since it occurs in 20% of all transactions (1 out of 5 transactions).

The confidence of a rule is defined . For example, the rule has a confidence of in the database, which means that for 100% of the transactions containing butter and bread the rule is correct (100% of the times a customer buys butter and bread, milk is bought as well). Be careful when reading the expression: here supp(XY) means "support for occurrences of transactions where X and Y both appear", not "support for occurrences of transactions where either X or Y appears", the latter interpretation arising because set union is equivalent to logical disjunction. The argument of is a set of preconditions, and thus becomes more restrictive as it grows (instead of more inclusive).

Confidence can be interpreted as an estimate of the probability , the probability of finding the RHS of the rule in transactions under the condition that these transactions also contain the LHS.[3] The lift of a rule is defined as or the ratio of the observed support to that expected if X and Y were independent. The rule has a lift of .

The conviction of a rule is defined as . The rule has a conviction of , and can be interpreted as the ratio of the expected frequency that X occurs without Y (that is to say, the frequency that the rule makes an incorrect prediction) if X and Y were independent divided by the observed frequency of incorrect predictions. In this example, the conviction value of 1.2 shows that the rule would be incorrect 20% more often (1.2 times as often) if the association between X and Y was purely random chance.

Apriori is designed to operate on databases containing transactions (for example, collections of items bought by customers, or details of a website frequentation). Other algorithms are designed for finding association rules in data having no transactions (Winepi and Minepi), or having no timestamps (DNA sequencing). Each transaction is seen as a set of items (an itemset). Given a threshold , the Apriori algorithm identifies the item sets which are subsets of at least transactions in the database.

Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a time (a step known as candidate generation), and groups of candidates are tested against the data. The algorithm terminates when no further successful extensions are found.

Apriori uses breadth-first search and a Hash tree structure to count candidate item sets efficiently. It generates candidate item sets of length from item sets of length . Then it prunes the candidates which have an infrequent sub pattern. According to the downward closure lemma, the candidate set contains all frequent -length item sets. After that, it scans the transaction database to determine frequent item sets among the candidates.

The pseudo code for the algorithm is given below for a transaction database , and a support threshold of . Usual set theoretic notation is employed, though note that is a multiset. is the candidate set for level . At each step, the algorithm is assumed to generate the candidate sets from the large item sets of the preceding level, heeding the downward closure lemma. accesses a field of the data structure that represents candidate set , which is initially assumed to be zero. Many details are omitted below, usually the most important part of the implementation is the data structure used for storing the candidate sets, and counting their frequencies.

4.b) In addition to confidence, other measures of interestingness for rules have been proposed. Some popular measures are:

All-confidence[10] Collective strength[11] Conviction[12] Leverage[13] Lift (originally called interest)[14]A definition of these measures can be found here. Several more measures are presented and compared by Tan et al.[15] Looking for techniques that can model what the user has known (and using these models as interestingness measures) is currently an active research trend under the name of "Subjective Interestingness."

Statistically sound associationsOne limitation of the standard approach to discovering associations is that by searching massive numbers of possible associations to look for collections of items that appear to be associated, there is a large risk of finding many spurious associations. These are collections of items that co-occur with unexpected frequency in the data, but only do so by chance. For example, suppose we are considering a collection of 10,000 items and looking for rules containing two items in the left-hand-side and 1 item in the right-hand-side. There are approximately 1,000,000,000,000 such rules. If we apply a statistical test for independence with a significance level of 0.05 it means there is only a 5% chance of accepting a rule if there is no association. If we assume there are no associations, we should nonetheless expect to find 50,000,000,000 rules. Statistically sound association discovery[16]

HYPERLINK "http://en.wikipedia.org/wiki/Association_rule_learning" \l "cite_note-17" [17] controls this risk, in most cases reducing the risk of finding any spurious associations to a user-specified significance level.

PART B

5.A) FP tree :

FP-Tree structure

The frequent-pattern tree (FP-tree) is a compact structure that stores quantitative information about frequent patterns in a database [4].

Han defines the FP-tree as the tree structure defined below [1]:

1. One root labeled as null with a set of item-prefix subtrees as children, and a frequent-item-header table (presented in the left side of Figure 1);

2. Each node in the item-prefix subtree consists of three fields:

1. Item-name: registers which item is represented by the node;

2. Count: the number of transactions represented by the portion of the path reaching the node;

3. Node-link: links to the next node in the FP-tree carrying the same item-name, or null if there is none.

1. Each entry in the frequent-item-header table consists of two fields:

1. Item-name: as the same to the node;

2. Head of node-link: a pointer to the first node in the FP-tree carrying the item-name.

Additionally the frequent-item-header table can have the count support for an item. The Figure 1 below show an example of a FP-tree.

Figure 1: An example of an FP-tree from [17].

The original algorithm to construct the FP-Tree defined by Han in [1] is presented below in Algorithm 1.

Algorithm 1: FP-tree constructionInput: A transaction database DB and a minimum support threshold?.

Output: FP-tree, the frequent-pattern tree of DB.

Method: The FP-tree is constructed as follows.

1. Scan the transaction database DB once. Collect F, the set of frequent items, and the support of each frequent item. Sort F in support-descending order as FList, the list of frequent items.

2. Create the root of an FP-tree, T, and label it as null. For each transaction Trans in DB do the following:

Select the frequent items in Trans and sort them according to the order of FList. Let the sorted frequent-item list in Trans be [ p | P], where p is the first element and P is the remaining list. Call insert tree([ p | P], T ).

The function insert tree([ p | P], T ) is performed as follows. If T has a child N such that N.item-name = p.item-name, then increment N s count by 1; else create a new node N , with its count initialized to 1, its parent link linked to T , and its node-link linked to the nodes with the same item-name via the node-link structure. If P is nonempty, call insert tree(P, N ) recursively.

By using this algorithm, the FP-tree is constructed in two scans of the database. The first scan collects and sort the set of frequent items, and the second constructs the FP-Tree.

FP-Growth Algorithm

After constructing the FP-Tree its possible to mine it to find the complete set of frequent patterns. To accomplish this job, Han in [1] presents a group of lemmas and properties, and thereafter describes the FP-Growth Algorithm as presented below in Algorithm 2.

Algorithm 2: FP-GrowthInput: A database DB, represented by FP-tree constructed according to Algorithm 1, and a minimum support threshold?.

Output: The complete set of frequent patterns.

Method: call FP-growth(FP-tree, null).

Procedure FP-growth(Tree, a) {

(01) if Tree contains a single prefix path then // Mining single prefix-path FP-tree {

(02) let P be the single prefix-path part of Tree;

(03) let Q be the multipath part with the top branching node replaced by a null root;

(04) for each combination (denoted as ) of the nodes in the path P do

(05) generate pattern a with support = minimum support of nodes in ;

(06) let freq pattern set(P) be the set of patterns so generated;

}

(07) else let Q be Tree;

(08) for each item ai in Q do { // Mining multipath FP-tree

(09) generate pattern = ai a with support = ai .support;

(10) construct s conditional pattern-base and then s conditional FP-tree Tree ;

(11) if Tree then

(12) call FP-growth(Tree , );

(13) let freq pattern set(Q) be the set of patterns so generated;

}

(14) return(freq pattern set(P) freq pattern set(Q) (freq pattern set(P) freq pattern set(Q)))

}

When the FP-tree contains a single prefix-path, the complete set of frequent patterns can be generated in three parts: the single prefix-path P, the multipath Q, and their combinations (lines 01 to 03 and 14). The resulting patterns for a single prefix path are the enumerations of its subpaths that have the minimum support (lines 04 to 06). Thereafter, the multipath Q is defined (line 03 or 07) and the resulting patterns from it are processed (lines 08 to 13). Finally, in line 14 the combined results are returned as the frequent patterns found.

An example

This section presents a simple example to illustrate how the previous algorithm works. The original example can be viewed in [18].

Consider the transactions below and the minimum support as 3:

i(t)

1ABDE

2BCE

3ABDE

4ABCE

5ABCDE

6BCD

To build the FP-Tree, frequent items support are first calculated and sorted in decreasing order resulting in the following list: { B(6), E(5), A(4), C(4), D(4) }. Thereafter, the FP-Tree is iteratively constructed for each transaction, using the sorted list of items as shown in Figure 2.

(a) Transaction 1: BEAD

(b) Transaction 2: BEC

(c) Transaction 3: BEAD

(d) Transaction 4: BEAC

(e) Transaction 5: BEACD

(f) Transaction 6: BCD

Figure 2: Constructing the FP-Tree iteratively.

As presented in Figure 3, the initial call to FP-Growth uses the FP-Tree obtained from the Algorithm 1, presented in Figure 2 (f), to process the projected trees in recursive calls to get the frequent patterns in the transactions presented before.

Using a depth-first strategy the projected trees are determined to items D, C, A, E and B, respectively. First the projected tree for D is recursively processed, projecting trees for DA, DE and DB. In a similar manner the remaining items are processed. At the end of process the frequent itemset is: { DAE, DAEB, DAB, DEB, CE, CEB, CB, AE, AEB, AB, EB }.

Figure 3: Projected trees and frequent patterns founded by the recursively calls to FP-Growth Algorithm.

5. b) Support confidence frames :

Generating rules efficiently

1. Brute-force method (for small item sets):

Generate all possible subsets of an item sets, excluding the empty set (2n - 1) and use them as rule consequents (the remaining items form the antecedents).

Compute the confidence: divide the support of the item set by the support of the antecedent (get it from the hash table).

Select rules with high confidence (using a threshold).

2. Better way: iterative rule generation within minimal accuracy.

Observation: if an n-consequent rule holds then all corresponding (n-1)-consequent rules hold as well.

Algorithm: generate n-consequent candidate rules from (n-1)-consequent rules (similar to the algorithm for the item sets).

3. Weka's approach (default settings for Apriori): generate best 10 rules. Begin with a minimum support 100% and decrease this in steps of 5%. Stop when generate 10 rules or the support falls below 10%. The minimum confidence is 90%.

Advanced association rules

1. Multi-level association rules: using concept hierarchies.

Example: no frequent item sets.

ABCD...

1010...

0101...

...............

Assume now that A and B are children of A&B, and C and D are children of C&D in concept hierarchies. Assume also that A&B and C&D aggregate the values for their children. Then {A&B, C&D} will be a frequent item set with support 2.

2. Approaches to mining multi-level association rules:

Using uniform support: same minimum support for all levels in the hierarchies: top-down strategy.

Using reduced minimum support at lower levels: various approaches to define the minimum support at lower levels.

3. Interpretation of association rules:

Single-dimensional rules: single predicate. Example: buys(x, diapers) => buys(x, beers). Create a table with as many columns as possible values for the predicate. Consider these values as binary attributes (0,1) and when creating the item sets ignore the 0's.

diapersbeersmilkbread...

1101...

1110...

...............

Multidimensional association rules: multiple predicates. Example: age(x, 20) and buys(x, computer) => buys(x, computer_games). Mixed-type attributes. Problem: the algorithms discussed so far cannot handle numeric attributes.

agecomputercomputer_games...

2011...

3510...

............

4. Handling numeric attributes.

Static discretization: discretization based on predefined ranges.

Discretization based on the distribution of data: binning. Problem: grouping together very distant values.

Distance-based association rules:

cluster values by distance to generate clusters (intervals or groups of nominal values).

search for frequent cluster sets.

Approximate Association Rule Mining. Read the paper by Nayak and Cook.

Correlation analysis

1. High support and high confidence rules are not necessarily interesting. Example:

Assume that A occurs in 60% of the transactions, B - in 75% and both A and B - in 40%.

Then the association A => B has support 40% and confidence 66%.

However, P(B)=75%, higher than P(B|A)=66%.

In fact, A and B are negatively correlated, corr(A,B)=0.4/(0.6*0.75)=0.89 B, that is, whether A implies B and to what extend.

3. Correlation between occurrences of A and B:

corr(A,B) = P(A,B)/(P(A)P(B))

corr(A,B) A and B are negatively correlated.

corr(A,B)>1 => A and B are positively correlated.

corr(A,B)=1 => A and B are independent.

4. Contingency table:

outlook=sunnyoutlooksunnyRow total

play=yes 2 7 9

play=no 3 2 5

Column total 5 914

if outlook=sunny then play=yes [support=14%, confidence=40%].

corr(outlook=sunny,play=yes) = (2/14)/[(5/14)*(9/14)] = 0.62 < 1 => negaive correlation.

if outlook=sunny then play=no [support=21%, confidence=60%].

corr(outlook=sunny,play=no) = (3/14)/[(5/14)*(5/14)] = 1.68 > 1 => positive correlation.

6.a)Cluster analysis :

Cluster Analysis

In an unsupervised learning environment the system has to discover its own classes and one way in which it does this is to cluster the data in the database as shown in the following diagram. The first step is to discover subsets of related objects and then find descriptions e.eg D1, D2, D3 etc. which describe each of these subsets.

Figure 5: Discovering clusters and descriptions in a database

Clustering and segmentation basically partition the database so that each partition or group is similar according to some criteria or metric. Clustering according to similarity is a concept which appears in many disciplines. If a measure of similarity is available there are a number of techniques for forming clusters. Membership of groups can be based on the level of similarity between members and from this the rules of membership can be defined. Another approach is to build set functions that measure some property of partitions ie groups or subsets as functions of some parameter of the partition. This latter approach achieves what is known as optimal partitioning.

Many data mining applications make use of clustering according to similarity for example to segment a client/customer base. Clustering according to optimization of set functions is used in data analysis e.g. when setting insurance tariffs the customers can be segmented according to a number of parameters and the optimal tariff segmentation achieved.

Clustering/segmentation in databases are the processes of separating a data set into components that reflect a consistent pattern of behaviour. Once the patterns have been established they can then be used to "deconstruct" data into more understandable subsets and also they provide sub-groups of a population for further analysis or action which is important when dealing with very large databases. For example a database could be used for profile generation for target marketing where previous response to mailing campaigns can be used to generate a profile of people who responded and this can be used to predict response and filter mailing lists to achieve the best response.

6.b) K- means algorithm :K Means Method

It is a major classification from partition method of cluster analysis.

It is formulated with Euclidean distance as,

D(x,y) = ( (xi yi)2)1/2

Its well known and ease of use

All objects try to be some clusters.

Steps :

Pick some samples

Make as centroids

Find distance by using Euclidean distance formula

Try to allocate them to near by (less value) clusters

Suggestions to improve

Not dealing with overlap clusters

Not considering size of cluster

Depend on initial guess only cluster formation

Sensitive to outliers

Not suitable for categorical data

7.a) set value- link value :

dwr.util.setValue(id, value)

dwr.util.setValue(id, value, options) finds the element with the id specified in the first parameter and alters its contents to be the value in the second parameter.

By default DWR protects against XSS in dwr.util.setValue by performing output escaping. The optional options object parameter allows output escaping to be disabled { escapeHtml:false }.

For example:

dwr.util.setValue('x', "Hi");

LINK VALUE :

Link Value Factors - Intro

It's a known fact that no two links are equally the same, in this research I attempted to see every two links as equals. While this obviously makes it impossible to create a 100% accurate piece of content, this - together with the dozens of interesting comments- results in a document that might be of value for everybody who's somehow envolved in building links. The grain of salt that comes with this research is far being outweighed by the value of the present data.

While some might use this document for entertaining purposes only, these results and factors can be enterpreted by everybody as an indication that, while lots and lots of different factors come into play of the process, building links sure isn't rocket science. It's way cooler.

7.b)content based retrieval :

Content basedThe content based approach exploits semantic connections between documents and parts thereof, and semantic connections between queries and documents. Most content based document retrieval systems use an inverted index algorithm.

A signature file is a technique that creates a quick and dirty filter, for example a Bloom filter, that will keep all the documents that match to the query and hopefully a few ones that do not. The way this is done is by creating for each file a signature, typically a hash coded version. One method is superimposed coding. A post-processing step is done to discard the false alarms. Since in most cases this structure is inferior to inverted files in terms of speed, size and functionality, it is not used widely. However, with proper parameters it can beat the inverted files in certain environments.

8.a ) Data mining has many and varied fields of application some of which are listed below.

Retail/Marketing

Identify buying patterns from customers

Find associations among customer demographic characteristics

Predict response to mailing campaigns

Market basket analysis

Banking

Detect patterns of fraudulent credit card use

Identify `loyal' customers

Predict customers likely to change their credit card affiliation

Determine credit card spending by customer groups

Find hidden correlations between different financial indicators

Identify stock trading rules from historical market data

Insurance and Health Care

Claims analysis - i.e which medical procedures are claimed together

Predict which customers will buy new policies

Identify behaviour patterns of risky customers

Identify fraudulent behaviour

Transportation

Determine the distribution schedules among outlets

Analyse loading patterns

Medicine

Characterise patient behaviour to predict office visits

Identify successful medical therapies for different illnesses

8.b) Text retrieval :

Text mining, also referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).

Text analysis involves information retrieval, lexical analysis to study word frequency distributions, pattern recognition, tagging/annotation, information extraction, data mining techniques including link and association analysis, visualization, and predictive analytics. The overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing (NLP) and analytical methods.

A typical application is to scan a set of documents written in a natural language and either model the document set for predictive classification purposes or populate a database or search index with the information extracted.

data mining answer vtu 2015

Documents

data reduction

data samples

data errors

data visualization

data cleaning

data mining algorithms

data mining task

data mining step