VTU QUESTION PAPER
SOLVED ANSWERPART A
1.a) What is the KDD Process?The term Knowledge Discovery in
Databases, or KDD for short, refers to the broad process of finding
knowledge in data, and emphasizes the "high-level" application of
particular data mining methods. It is of interest to researchers in
machine learning, pattern recognition, databases, statistics,
artificial intelligence, knowledge acquisition for expert systems,
and data visualization.
The unifying goal of the KDD process is to extract knowledge
from data in the context of large databases.
It does this by using data mining methods (algorithms) to
extract (identify) what is deemed knowledge, according to the
specifications of measures and thresholds, using a database along
with any required preprocessing, subsampling, and transformations
of that database.
An Outline of the Steps of the KDD Process
The overall process of finding and interpreting patterns from
data involves the repeated application of the following steps:
1. Developing an understanding of
the application domain
the relevant prior knowledge
the goals of the end-user
2. Creating a target data set: selecting a data set, or focusing
on a subset of variables, or data samples, on which discovery is to
be performed.
3. Data cleaning and preprocessing.
Removal of noise or outliers.
Collecting necessary information to model or account for
noise.
Strategies for handling missing data fields.
Accounting for time sequence information and known changes.
4. Data reduction and projection.
Finding useful features to represent the data depending on the
goal of the task.
Using dimensionality reduction or transformation methods to
reduce the effective number of variables under consideration or to
find invariant representations for the data.
5. Choosing the data mining task.
Deciding whether the goal of the KDD process is classification,
regression, clustering, etc.
6. Choosing the data mining algorithm(s).
Selecting method(s) to be used for searching for patterns in the
data.
Deciding which models and parameters may be appropriate.
Matching a particular data mining method with the overall
criteria of the KDD process.
7. Data mining.
Searching for patterns of interest in a particular
representational form or a set of such representations as
classification rules or trees, regression, clustering, and so
forth.
8. Interpreting mined patterns.
9. Consolidating discovered knowledge.
The terms knowledge discovery and data mining are distinct.
KDD refers to the overall process of discovering useful
knowledge from data. It involves the evaluation and possibly
interpretation of the patterns to make the decision of what
qualifies as knowledge. It also includes the choice of encoding
schemes, preprocessing, sampling, and projections of the data prior
to the data mining step.Data mining refers to the application of
algorithms for extracting patterns from data without the additional
steps of the KDD process.
1.b) tasks of data mining :
Data mining involves six common classes of tasks:
Anomaly detection(Outlier/change/deviation detection) The
identification of unusual data records, that might be interesting
or data errors that require further investigation.
Association rule learning(Dependency modeling) Searches for
relationships between variables. For example a supermarket might
gather data on customer purchasing habits. Using association rule
learning, the supermarket can determine which products are
frequently bought together and use this information for marketing
purposes. This is sometimes referred to as market basket
analysis.
Clustering is the task of discovering groups and structures in
the data that are in some way or another "similar", without using
known structures in the data.
Classification is the task of generalizing known structure to
apply to new data. For example, an e-mail program might attempt to
classify an e-mail as "legitimate" or as "spam".
Regression Attempts to find a function which models the data
with the least error.
Summarization providing a more compact representation of the
data set, including visualization and report generation.
Sequential pattern mining Sequential pattern mining finds sets
of data items that occur together frequently in some sequences.
Sequential pattern mining, which extracts frequent subsequences
from a sequence database, has attracted a great deal of interest
during the recent data mining research because it is the basis of
many applications, such as: web user analysis, stock trend
prediction, DNA sequence analysis, finding language or linguistic
patterns from natural language texts, and using the history of
symptoms to predict certain kind of disease.
2 .a) Types of attributes :
There are four different types of attributes in data mining.
1.NominalExamples: ID numbers, eye color, zip codes
2. OrdinalExamples: rankings (e.g., taste of potato chips on a
scale from 1-10), grades, height in {tall, medium, short}
3. IntervalExamples: calendar dates, temperatures in Celsius or
Fahrenheit.
4. RatioExamples: temperature in Kelvin, length, time,
counts
Discrete and Continuous Attributes
Discrete Attribute[Nominal and Ordinal]
Has only a finite or countably infinite set of values
Examples: zip codes, counts, or the set of words in a collection
of documents
Often represented as integer variables.
Note: binary attributes are a special case of discrete
attributes
Continuous Attribute [interval and ratio]
Has real numbers as attribute values
Examples: temperature, height, or weight.
Practically, real values can only be measured and represented
using a finite number of digits.
Continuous attributes are typically represented as
floating-point variables.
2.b) Data Preprocessing :No quality data, no quality mining
results!
Quality decisions must be based on quality data
e.g., duplicate or missing data may cause incorrect or even
misleading statistics.
Data warehouse needs consistent integration of quality data
Data extraction, cleaning, and transformation comprise the
majority of the work of building a data warehouse. Bill Inmon.
Major Tasks in Data Processing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the same
or similar analytical results
Data discretization
Part of data reduction but with particular importance,
especially for numerical data
Forms of Data Processing:
Data Cleaning
Importance Data cleaning is one of the three biggest problems in
data warehousingRalph Kimball
Data cleaning is the number one problem in data warehousingDCI
survey
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
Missing Data
Data is not always available
E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data
Missing data may be due to
Equipment malfunction
Inconsistent with other recorded data and thus deleted
Data not entered due to misunderstanding
Certain data may not be considered important at the time of
entry
Not register history or changes of the data
Missing data may need to be inferred.
How to Handle Missing Data? Ignore the tuple: usually done when
class label is missing (assuming the tasks in classificationnot
effective when the percentage of missing values per attribute
varies considerably. Fill in the missing value manually: tedious +
infeasible?
Fill in it automatically with
A global constant: e.g., unknown, a new class?!
the attribute mean
the attribute mean for all samples belonging to the same class:
smarter
the most probable value: inference-based such as Bayesian
formula or decision treeNoisy Data Noise: random error or variance
in a measured variable Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which requires data cleaning
duplicate records
incomplete data
inconsistent data
How to Handle Noisy Data? Binning method: first sort data and
partition into (equi-depth) bins
then one can smooth by bin means, smooth by bin median, smooth
by bin boundaries, etc.
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human (e.g., deal with
possible outliers)
Regression
smooth by fitting the data into regression functions
Simple Discretization Methods: Binning Equal-width (distance)
partitioning: Divides the range into N intervals of equal size:
uniform grid if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B A)/N. The most
straightforward, but outliers may dominate presentation
Skewed data is not handled well. Equal-depth (frequency)
partitioning:
Divides the range into N intervals, each containing
approximately same number of samples
Good data scaling
Managing categorical attributes can be tricky.
Binning methods
They smooth a sorted data value by consulting its neighborhood,
that is the values around it.
The sorted values are partitioned into a number of buckets or
bins.
Smoothing by bin means: Each value in the bin is replaced by the
mean value of the bin.
Smoothing by bin medians: Each value in the bin is replaced by
the bin median.
Smoothing by boundaries: The min and max values of a bin are
identified as the bin boundaries.
Each bin value is replaced by the closest boundary value.
Example: Binning Methods for Data Smoothing Sorted data for
price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
3 .a) Best split measures :Data mine tools have to infer a model
from the database, and in the case of supervised learning this
requires the user to define one or more classes. The database
contains one or more attributes that denote the class of a tuple
and these are known as predicted attributes whereas the remaining
attributes are called predicting attributes. A combination of
values for the predicted attributes defines a class.
When learning classification rules the system has to find the
rules that predict the class from the predicting attributes so
firstly the user has to define conditions for each class, the data
mine system then constructs descriptions for the classes. Basically
the system should given a case or tuple with certain known
attribute values be able to predict what class this case belongs
to.
Once classes are defined the system should infer rules that
govern the classification therefore the system should be able to
find the description of each class. The descriptions should only
refer to the predicting attributes of the training set so that the
positive examples should satisfy the description and none of the
negative. A rule said to be correct if its description covers all
the positive examples and none of the negative examples of a
class.
A rule is generally presented as, if the left hand side (LHS)
then the right hand side (RHS), so that in all instances where LHS
is true then RHS is also true, are very probable. The categories of
rules are:
exact rule - permits no exceptions so each object of LHS must be
an element of RHS
strong rule - allows some exceptions, but the exceptions have a
given limit
probablistic rule - relates the conditional probability
P(RHS|LHS) to the probability P(RHS)
Other types of rules are classification rules where LHS is a
sufficient condition to classify objects as belonging to the
concept referred to in the RHS.
3.b) K-nearest neighbor algorithm :
In pattern recognition, the k-Nearest Neighbors algorithm (or
k-NN for short) is a non-parametric method used for classification
and regression.[1] In both cases, the input consists of the k
closest training examples in the feature space. The output depends
on whether k-NN is used for classification or regression:
In k-NN classification, the output is a class membership. An
object is classified by a majority vote of its neighbors, with the
object being assigned to the class most common among its k nearest
neighbors (k is a positive integer, typically small). If k=1, then
the object is simply assigned to the class of that single nearest
neighbor.
In k-NN regression, the output is the property value for the
object. This value is the average of the values of its k nearest
neighbors.
k-NN is a type of instance-based learning, or lazy learning,
where the function is only approximated locally and all computation
is deferred until classification. The k-NN algorithm is among the
simplest of all machine learning algorithms.
Both for classification and regression, it can be useful to
weight the contributions of the neighbors, so that the nearer
neighbors contribute more to the average than the more distant
ones. For example, a common weighting scheme consists in giving
each neighbor a weight of 1/d, where d is the distance to the
neighbor.[2]The neighbors are taken from a set of objects for which
the class (for k-NN classification) or the object property value
(for k-NN regression) is known. This can be thought of as the
training set for the algorithm, though no explicit training step is
required.
A shortcoming of the k-NN algorithm is that it is sensitive to
the local structure of the data.
CNN for data reduction
Condensed nearest neighbor (CNN, the Hart algorithm) is an
algorithm designed to reduce the data set for k-NN
classification.[17] It selects the set of prototypes U from the
training data, such that 1NN with U can classify the examples
almost as accurately as 1NN does with the whole data set.
Calculation of the border ratio.
Three types of points: prototypes, class-outliers, and absorbed
points.
Given a training set X, CNN works iteratively:
1. Scan all elements of X, looking for an element x whose
nearest prototype from U has a different label than x.
2. Remove x from X and add it to U3. Repeat the scan until no
more prototypes are added to U.
Use U instead of X for classification. The examples that are not
prototypes are called "absorbed" points.
It is efficient to scan the training examples in order of
decreasing border ratio.[18] The border ratio of a training example
x is defined as
a(x) = ||x'-y|| / ||x-y||
where ||x-y|| is the distance to the closest example y having a
different color than x, and ||x'-y|| is the distance from y to its
closest example x' with the same label as x.
The border ratio is in the interval [0,1] because ||x'-y|| never
exceeds ||x-y||. This ordering gives preference to the borders of
the classes for inclusion in the set of prototypesU. A point of a
different label than x is called external to x. The calculation of
the border ratio is illustrated by the figure on the right. The
data points are labeled by colors: the initial point is x and its
label is red. External points are blue and green. The closest to x
external point is y. The closest to y red point is x' . The border
ratio a(x)=||x'-y||/||x-y|| is the attribute of the initial point
x.
Below is an illustration of CNN in a series of figures. There
are three classes (red, green and blue). Fig. 1: initially there
are 60 points in each class. Fig. 2 shows the 1NN classification
map: each pixel is classified by 1NN using all the data. Fig. 3
shows the 5NN classification map. White areas correspond to the
unclassified regions, where 5NN voting is tied (for example, if
there are two green, two red and one blue points among 5 nearest
neighbors). Fig. 4 shows the reduced data set. The crosses are the
class-outliers selected by the (3,2)NN rule (all the three nearest
neighbors of these instances belong to other classes); the squares
are the prototypes, and the empty circles are the absorbed points.
The left bottom corner shows the numbers of the class-outliers,
prototypes and absorbed points for all three classes.
4.a)Association analysis :Association rule learning is a popular
and well researched method for discovering interesting relations
between variables in large databases. It is intended to identify
strong rules discovered in databases using different measures of
interestingness.[1] Based on the concept of strong rules, Rakesh
Agrawal et al.[2] introduced association rules for discovering
regularities between products in large-scale transaction data
recorded by point-of-sale (POS) systems in supermarkets. For
example, the rule found in the sales data of a supermarket would
indicate that if a customer buys onions and potatoes together, he
or she is likely to also buy hamburger meat. Such information can
be used as the basis for decisions about marketing activities such
as, e.g., promotional pricing or product placements. In addition to
the above example from market basket analysis association rules are
employed today in many application areas including Web usage
mining, intrusion detection, Continuous production, and
bioinformatics. In contrast with sequence mining, association rule
learning typically does not consider the order of items either
within a transaction or across transactions.
To select interesting rules from the set of all possible rules,
constraints on various measures of significance and interest can be
used. The best-known constraints are minimum thresholds on support
and confidence.
The support of an itemset is defined as the proportion of
transactions in the data set which contain the itemset. In the
example database, the itemset has a support of since it occurs in
20% of all transactions (1 out of 5 transactions).
The confidence of a rule is defined . For example, the rule has
a confidence of in the database, which means that for 100% of the
transactions containing butter and bread the rule is correct (100%
of the times a customer buys butter and bread, milk is bought as
well). Be careful when reading the expression: here supp(XY) means
"support for occurrences of transactions where X and Y both
appear", not "support for occurrences of transactions where either
X or Y appears", the latter interpretation arising because set
union is equivalent to logical disjunction. The argument of is a
set of preconditions, and thus becomes more restrictive as it grows
(instead of more inclusive).
Confidence can be interpreted as an estimate of the probability
, the probability of finding the RHS of the rule in transactions
under the condition that these transactions also contain the
LHS.[3] The lift of a rule is defined as or the ratio of the
observed support to that expected if X and Y were independent. The
rule has a lift of .
The conviction of a rule is defined as . The rule has a
conviction of , and can be interpreted as the ratio of the expected
frequency that X occurs without Y (that is to say, the frequency
that the rule makes an incorrect prediction) if X and Y were
independent divided by the observed frequency of incorrect
predictions. In this example, the conviction value of 1.2 shows
that the rule would be incorrect 20% more often (1.2 times as
often) if the association between X and Y was purely random
chance.
Apriori is designed to operate on databases containing
transactions (for example, collections of items bought by
customers, or details of a website frequentation). Other algorithms
are designed for finding association rules in data having no
transactions (Winepi and Minepi), or having no timestamps (DNA
sequencing). Each transaction is seen as a set of items (an
itemset). Given a threshold , the Apriori algorithm identifies the
item sets which are subsets of at least transactions in the
database.
Apriori uses a "bottom up" approach, where frequent subsets are
extended one item at a time (a step known as candidate generation),
and groups of candidates are tested against the data. The algorithm
terminates when no further successful extensions are found.
Apriori uses breadth-first search and a Hash tree structure to
count candidate item sets efficiently. It generates candidate item
sets of length from item sets of length . Then it prunes the
candidates which have an infrequent sub pattern. According to the
downward closure lemma, the candidate set contains all frequent
-length item sets. After that, it scans the transaction database to
determine frequent item sets among the candidates.
The pseudo code for the algorithm is given below for a
transaction database , and a support threshold of . Usual set
theoretic notation is employed, though note that is a multiset. is
the candidate set for level . At each step, the algorithm is
assumed to generate the candidate sets from the large item sets of
the preceding level, heeding the downward closure lemma. accesses a
field of the data structure that represents candidate set , which
is initially assumed to be zero. Many details are omitted below,
usually the most important part of the implementation is the data
structure used for storing the candidate sets, and counting their
frequencies.
4.b) In addition to confidence, other measures of
interestingness for rules have been proposed. Some popular measures
are:
All-confidence[10] Collective strength[11] Conviction[12]
Leverage[13] Lift (originally called interest)[14]A definition of
these measures can be found here. Several more measures are
presented and compared by Tan et al.[15] Looking for techniques
that can model what the user has known (and using these models as
interestingness measures) is currently an active research trend
under the name of "Subjective Interestingness."
Statistically sound associationsOne limitation of the standard
approach to discovering associations is that by searching massive
numbers of possible associations to look for collections of items
that appear to be associated, there is a large risk of finding many
spurious associations. These are collections of items that co-occur
with unexpected frequency in the data, but only do so by chance.
For example, suppose we are considering a collection of 10,000
items and looking for rules containing two items in the
left-hand-side and 1 item in the right-hand-side. There are
approximately 1,000,000,000,000 such rules. If we apply a
statistical test for independence with a significance level of 0.05
it means there is only a 5% chance of accepting a rule if there is
no association. If we assume there are no associations, we should
nonetheless expect to find 50,000,000,000 rules. Statistically
sound association discovery[16]
HYPERLINK
"http://en.wikipedia.org/wiki/Association_rule_learning" \l
"cite_note-17" [17] controls this risk, in most cases reducing the
risk of finding any spurious associations to a user-specified
significance level.
PART B
5.A) FP tree :
FP-Tree structure
The frequent-pattern tree (FP-tree) is a compact structure that
stores quantitative information about frequent patterns in a
database [4].
Han defines the FP-tree as the tree structure defined below
[1]:
1. One root labeled as null with a set of item-prefix subtrees
as children, and a frequent-item-header table (presented in the
left side of Figure 1);
2. Each node in the item-prefix subtree consists of three
fields:
1. Item-name: registers which item is represented by the
node;
2. Count: the number of transactions represented by the portion
of the path reaching the node;
3. Node-link: links to the next node in the FP-tree carrying the
same item-name, or null if there is none.
1. Each entry in the frequent-item-header table consists of two
fields:
1. Item-name: as the same to the node;
2. Head of node-link: a pointer to the first node in the FP-tree
carrying the item-name.
Additionally the frequent-item-header table can have the count
support for an item. The Figure 1 below show an example of a
FP-tree.
Figure 1: An example of an FP-tree from [17].
The original algorithm to construct the FP-Tree defined by Han
in [1] is presented below in Algorithm 1.
Algorithm 1: FP-tree constructionInput: A transaction database
DB and a minimum support threshold?.
Output: FP-tree, the frequent-pattern tree of DB.
Method: The FP-tree is constructed as follows.
1. Scan the transaction database DB once. Collect F, the set of
frequent items, and the support of each frequent item. Sort F in
support-descending order as FList, the list of frequent items.
2. Create the root of an FP-tree, T, and label it as null. For
each transaction Trans in DB do the following:
Select the frequent items in Trans and sort them according to
the order of FList. Let the sorted frequent-item list in Trans be [
p | P], where p is the first element and P is the remaining list.
Call insert tree([ p | P], T ).
The function insert tree([ p | P], T ) is performed as follows.
If T has a child N such that N.item-name = p.item-name, then
increment N s count by 1; else create a new node N , with its count
initialized to 1, its parent link linked to T , and its node-link
linked to the nodes with the same item-name via the node-link
structure. If P is nonempty, call insert tree(P, N )
recursively.
By using this algorithm, the FP-tree is constructed in two scans
of the database. The first scan collects and sort the set of
frequent items, and the second constructs the FP-Tree.
FP-Growth Algorithm
After constructing the FP-Tree its possible to mine it to find
the complete set of frequent patterns. To accomplish this job, Han
in [1] presents a group of lemmas and properties, and thereafter
describes the FP-Growth Algorithm as presented below in Algorithm
2.
Algorithm 2: FP-GrowthInput: A database DB, represented by
FP-tree constructed according to Algorithm 1, and a minimum support
threshold?.
Output: The complete set of frequent patterns.
Method: call FP-growth(FP-tree, null).
Procedure FP-growth(Tree, a) {
(01) if Tree contains a single prefix path then // Mining single
prefix-path FP-tree {
(02) let P be the single prefix-path part of Tree;
(03) let Q be the multipath part with the top branching node
replaced by a null root;
(04) for each combination (denoted as ) of the nodes in the path
P do
(05) generate pattern a with support = minimum support of nodes
in ;
(06) let freq pattern set(P) be the set of patterns so
generated;
}
(07) else let Q be Tree;
(08) for each item ai in Q do { // Mining multipath FP-tree
(09) generate pattern = ai a with support = ai .support;
(10) construct s conditional pattern-base and then s conditional
FP-tree Tree ;
(11) if Tree then
(12) call FP-growth(Tree , );
(13) let freq pattern set(Q) be the set of patterns so
generated;
}
(14) return(freq pattern set(P) freq pattern set(Q) (freq
pattern set(P) freq pattern set(Q)))
}
When the FP-tree contains a single prefix-path, the complete set
of frequent patterns can be generated in three parts: the single
prefix-path P, the multipath Q, and their combinations (lines 01 to
03 and 14). The resulting patterns for a single prefix path are the
enumerations of its subpaths that have the minimum support (lines
04 to 06). Thereafter, the multipath Q is defined (line 03 or 07)
and the resulting patterns from it are processed (lines 08 to 13).
Finally, in line 14 the combined results are returned as the
frequent patterns found.
An example
This section presents a simple example to illustrate how the
previous algorithm works. The original example can be viewed in
[18].
Consider the transactions below and the minimum support as
3:
i(t)
1ABDE
2BCE
3ABDE
4ABCE
5ABCDE
6BCD
To build the FP-Tree, frequent items support are first
calculated and sorted in decreasing order resulting in the
following list: { B(6), E(5), A(4), C(4), D(4) }. Thereafter, the
FP-Tree is iteratively constructed for each transaction, using the
sorted list of items as shown in Figure 2.
(a) Transaction 1: BEAD
(b) Transaction 2: BEC
(c) Transaction 3: BEAD
(d) Transaction 4: BEAC
(e) Transaction 5: BEACD
(f) Transaction 6: BCD
Figure 2: Constructing the FP-Tree iteratively.
As presented in Figure 3, the initial call to FP-Growth uses the
FP-Tree obtained from the Algorithm 1, presented in Figure 2 (f),
to process the projected trees in recursive calls to get the
frequent patterns in the transactions presented before.
Using a depth-first strategy the projected trees are determined
to items D, C, A, E and B, respectively. First the projected tree
for D is recursively processed, projecting trees for DA, DE and DB.
In a similar manner the remaining items are processed. At the end
of process the frequent itemset is: { DAE, DAEB, DAB, DEB, CE, CEB,
CB, AE, AEB, AB, EB }.
Figure 3: Projected trees and frequent patterns founded by the
recursively calls to FP-Growth Algorithm.
5. b) Support confidence frames :
Generating rules efficiently
1. Brute-force method (for small item sets):
Generate all possible subsets of an item sets, excluding the
empty set (2n - 1) and use them as rule consequents (the remaining
items form the antecedents).
Compute the confidence: divide the support of the item set by
the support of the antecedent (get it from the hash table).
Select rules with high confidence (using a threshold).
2. Better way: iterative rule generation within minimal
accuracy.
Observation: if an n-consequent rule holds then all
corresponding (n-1)-consequent rules hold as well.
Algorithm: generate n-consequent candidate rules from
(n-1)-consequent rules (similar to the algorithm for the item
sets).
3. Weka's approach (default settings for Apriori): generate best
10 rules. Begin with a minimum support 100% and decrease this in
steps of 5%. Stop when generate 10 rules or the support falls below
10%. The minimum confidence is 90%.
Advanced association rules
1. Multi-level association rules: using concept hierarchies.
Example: no frequent item sets.
ABCD...
1010...
0101...
...............
Assume now that A and B are children of A&B, and C and D are
children of C&D in concept hierarchies. Assume also that
A&B and C&D aggregate the values for their children. Then
{A&B, C&D} will be a frequent item set with support 2.
2. Approaches to mining multi-level association rules:
Using uniform support: same minimum support for all levels in
the hierarchies: top-down strategy.
Using reduced minimum support at lower levels: various
approaches to define the minimum support at lower levels.
3. Interpretation of association rules:
Single-dimensional rules: single predicate. Example: buys(x,
diapers) => buys(x, beers). Create a table with as many columns
as possible values for the predicate. Consider these values as
binary attributes (0,1) and when creating the item sets ignore the
0's.
diapersbeersmilkbread...
1101...
1110...
...............
Multidimensional association rules: multiple predicates.
Example: age(x, 20) and buys(x, computer) => buys(x,
computer_games). Mixed-type attributes. Problem: the algorithms
discussed so far cannot handle numeric attributes.
agecomputercomputer_games...
2011...
3510...
............
4. Handling numeric attributes.
Static discretization: discretization based on predefined
ranges.
Discretization based on the distribution of data: binning.
Problem: grouping together very distant values.
Distance-based association rules:
cluster values by distance to generate clusters (intervals or
groups of nominal values).
search for frequent cluster sets.
Approximate Association Rule Mining. Read the paper by Nayak and
Cook.
Correlation analysis
1. High support and high confidence rules are not necessarily
interesting. Example:
Assume that A occurs in 60% of the transactions, B - in 75% and
both A and B - in 40%.
Then the association A => B has support 40% and confidence
66%.
However, P(B)=75%, higher than P(B|A)=66%.
In fact, A and B are negatively correlated,
corr(A,B)=0.4/(0.6*0.75)=0.89 B, that is, whether A implies B and
to what extend.
3. Correlation between occurrences of A and B:
corr(A,B) = P(A,B)/(P(A)P(B))
corr(A,B) A and B are negatively correlated.
corr(A,B)>1 => A and B are positively correlated.
corr(A,B)=1 => A and B are independent.
4. Contingency table:
outlook=sunnyoutlooksunnyRow total
play=yes 2 7 9
play=no 3 2 5
Column total 5 914
if outlook=sunny then play=yes [support=14%,
confidence=40%].
corr(outlook=sunny,play=yes) = (2/14)/[(5/14)*(9/14)] = 0.62
< 1 => negaive correlation.
if outlook=sunny then play=no [support=21%, confidence=60%].
corr(outlook=sunny,play=no) = (3/14)/[(5/14)*(5/14)] = 1.68 >
1 => positive correlation.
6.a)Cluster analysis :
Cluster Analysis
In an unsupervised learning environment the system has to
discover its own classes and one way in which it does this is to
cluster the data in the database as shown in the following diagram.
The first step is to discover subsets of related objects and then
find descriptions e.eg D1, D2, D3 etc. which describe each of these
subsets.
Figure 5: Discovering clusters and descriptions in a
database
Clustering and segmentation basically partition the database so
that each partition or group is similar according to some criteria
or metric. Clustering according to similarity is a concept which
appears in many disciplines. If a measure of similarity is
available there are a number of techniques for forming clusters.
Membership of groups can be based on the level of similarity
between members and from this the rules of membership can be
defined. Another approach is to build set functions that measure
some property of partitions ie groups or subsets as functions of
some parameter of the partition. This latter approach achieves what
is known as optimal partitioning.
Many data mining applications make use of clustering according
to similarity for example to segment a client/customer base.
Clustering according to optimization of set functions is used in
data analysis e.g. when setting insurance tariffs the customers can
be segmented according to a number of parameters and the optimal
tariff segmentation achieved.
Clustering/segmentation in databases are the processes of
separating a data set into components that reflect a consistent
pattern of behaviour. Once the patterns have been established they
can then be used to "deconstruct" data into more understandable
subsets and also they provide sub-groups of a population for
further analysis or action which is important when dealing with
very large databases. For example a database could be used for
profile generation for target marketing where previous response to
mailing campaigns can be used to generate a profile of people who
responded and this can be used to predict response and filter
mailing lists to achieve the best response.
6.b) K- means algorithm :K Means Method
It is a major classification from partition method of cluster
analysis.
It is formulated with Euclidean distance as,
D(x,y) = ( (xi yi)2)1/2
Its well known and ease of use
All objects try to be some clusters.
Steps :
Pick some samples
Make as centroids
Find distance by using Euclidean distance formula
Try to allocate them to near by (less value) clusters
Suggestions to improve
Not dealing with overlap clusters
Not considering size of cluster
Depend on initial guess only cluster formation
Sensitive to outliers
Not suitable for categorical data
7.a) set value- link value :
dwr.util.setValue(id, value)
dwr.util.setValue(id, value, options) finds the element with the
id specified in the first parameter and alters its contents to be
the value in the second parameter.
By default DWR protects against XSS in dwr.util.setValue by
performing output escaping. The optional options object parameter
allows output escaping to be disabled { escapeHtml:false }.
For example:
dwr.util.setValue('x', "Hi");
LINK VALUE :
Link Value Factors - Intro
It's a known fact that no two links are equally the same, in
this research I attempted to see every two links as equals. While
this obviously makes it impossible to create a 100% accurate piece
of content, this - together with the dozens of interesting
comments- results in a document that might be of value for
everybody who's somehow envolved in building links. The grain of
salt that comes with this research is far being outweighed by the
value of the present data.
While some might use this document for entertaining purposes
only, these results and factors can be enterpreted by everybody as
an indication that, while lots and lots of different factors come
into play of the process, building links sure isn't rocket science.
It's way cooler.
7.b)content based retrieval :
Content basedThe content based approach exploits semantic
connections between documents and parts thereof, and semantic
connections between queries and documents. Most content based
document retrieval systems use an inverted index algorithm.
A signature file is a technique that creates a quick and dirty
filter, for example a Bloom filter, that will keep all the
documents that match to the query and hopefully a few ones that do
not. The way this is done is by creating for each file a signature,
typically a hash coded version. One method is superimposed coding.
A post-processing step is done to discard the false alarms. Since
in most cases this structure is inferior to inverted files in terms
of speed, size and functionality, it is not used widely. However,
with proper parameters it can beat the inverted files in certain
environments.
8.a ) Data mining has many and varied fields of application some
of which are listed below.
Retail/Marketing
Identify buying patterns from customers
Find associations among customer demographic characteristics
Predict response to mailing campaigns
Market basket analysis
Banking
Detect patterns of fraudulent credit card use
Identify `loyal' customers
Predict customers likely to change their credit card
affiliation
Determine credit card spending by customer groups
Find hidden correlations between different financial
indicators
Identify stock trading rules from historical market data
Insurance and Health Care
Claims analysis - i.e which medical procedures are claimed
together
Predict which customers will buy new policies
Identify behaviour patterns of risky customers
Identify fraudulent behaviour
Transportation
Determine the distribution schedules among outlets
Analyse loading patterns
Medicine
Characterise patient behaviour to predict office visits
Identify successful medical therapies for different
illnesses
8.b) Text retrieval :
Text mining, also referred to as text data mining, roughly
equivalent to text analytics, refers to the process of deriving
high-quality information from text. High-quality information is
typically derived through the devising of patterns and trends
through means such as statistical pattern learning. Text mining
usually involves the process of structuring the input text (usually
parsing, along with the addition of some derived linguistic
features and the removal of others, and subsequent insertion into a
database), deriving patterns within the structured data, and
finally evaluation and interpretation of the output. 'High quality'
in text mining usually refers to some combination of relevance,
novelty, and interestingness. Typical text mining tasks include
text categorization, text clustering, concept/entity extraction,
production of granular taxonomies, sentiment analysis, document
summarization, and entity relation modeling (i.e., learning
relations between named entities).
Text analysis involves information retrieval, lexical analysis
to study word frequency distributions, pattern recognition,
tagging/annotation, information extraction, data mining techniques
including link and association analysis, visualization, and
predictive analytics. The overarching goal is, essentially, to turn
text into data for analysis, via application of natural language
processing (NLP) and analytical methods.
A typical application is to scan a set of documents written in a
natural language and either model the document set for predictive
classification purposes or populate a database or search index with
the information extracted.