Top Banner
Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems with a database or data warehouse system, Major issues in data mining, Data pre-processing: data cleaning, data integration and transformation, data reduction etc.
32

Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems

Aug 21, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems

Basics of data mining, Knowledge Discovery in databases,

KDD process, data mining tasks primitives, Integration of

data mining systems with a database or data warehouse

system, Major issues in data mining, Data pre-processing:

data cleaning, data integration and transformation, data

reduction etc.

Page 2: Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems

If a data mining system is not integrated with a database

or a data warehouse system, then there will be no system

to communicate with. This scheme is known as the non-

coupling scheme. In this scheme, the main focus is on

data mining design and on developing efficient and

effective algorithms for mining the available data sets.

The list of Integration Schemes is as follows −

Integration of data mining systems with a database or

data warehouse system

Page 3: Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems

1. No Coupling − In this scheme, the data mining system

does not utilize any of the database or data warehouse

functions. It fetches the data from a particular source and

processes that data using some data mining algorithms.

The data mining result is stored in another file.

Integration of data mining systems with a database or

data warehouse system

Page 4: Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems

2. Loose Coupling − In this scheme, the data mining

system may use some of the functions of database and

data warehouse system. It fetches the data from the

data respiratory managed by these systems and

performs data mining on that data. It then stores the

mining result either in a file or in a designated place in

a database or in a data warehouse.

Integration of data mining systems with a database or

data warehouse system

Page 5: Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems

Integration of data mining systems with a database or

data warehouse system

3. Semi−tight Coupling − In this scheme, the data

mining system is linked with a database or a data

warehouse system and in addition to that, efficient

implementations of a few data mining primitives can be

provided in the database.

Page 6: Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems

Integration of data mining systems with a database or

data warehouse system

4. Tight coupling − In this coupling scheme, the data

mining system is smoothly integrated into the database

or data warehouse system. The data mining subsystem is

treated as one functional component of an information

system.

Page 7: Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems

Major issues in data mining

Major issues in data mining regarding mining

methodology, user interaction, performance, and diverse

data types. These issues are following:

Mining Methodology and User Interaction

Performance Issues

Diverse Data Types Issues

Page 8: Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems
Page 9: Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems

1. Mining Methodology and User Interaction

Issues

It refers to the following kinds of issues −

Mining different kinds of knowledge in databases −

Different users may be interested in different kinds of

knowledge. Therefore it is necessary for data mining to

cover a broad range of knowledge discovery task.

Page 10: Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems

Interactive mining of knowledge at multiple levels of

abstraction − The data mining process needs to be

interactive because it allows users to focus the search for

patterns, providing and refining data mining requests based

on the returned results.

Incorporation of background knowledge − To guide

discovery process and to express the discovered patterns, the

background knowledge can be used. Background knowledge

may be used to express the discovered patterns not only in

concise terms but at multiple levels of abstraction.

Page 11: Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems

Data mining query languages and ad hoc data

mining − Data Mining Query language that allows the user

to describe ad hoc mining tasks, should be integrated with

a data warehouse query language and optimized for

efficient and flexible data mining.

Presentation and visualization of data mining

results − Once the patterns are discovered it needs to be

expressed in high level languages, and visual

representations. These representations should be easily

understandable.

Page 12: Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems

Handling noisy or incomplete data − The data

cleaning methods are required to handle the noise and

incomplete objects while mining the data regularities. If

the data cleaning methods are not there then the accuracy

of the discovered patterns will be poor.

Pattern evaluation − The patterns discovered should

be interesting because either they represent common

knowledge or lack novelty.

Page 13: Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems

2. Performance Issues

There can be performance-related issues such as follows −

Efficiency and scalability of data mining

algorithms − In order to effectively extract the

information from huge amount of data in databases, data

mining algorithm must be efficient and scalable.

Page 14: Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems

Parallel, distributed, and incremental mining

algorithms − The factors such as huge size of databases,

wide distribution of data, and complexity of data mining

methods motivate the development of parallel and

distributed data mining algorithms. These algorithms

divide the data into partitions which is further processed in

a parallel fashion. Then the results from the partitions is

merged. The incremental algorithms, update databases

without mining the data again from scratch.

Page 15: Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems

3. Diverse Data Types Issues

Handling of relational and complex types of data − The

database may contain complex data objects, multimedia data

objects, spatial data, temporal data etc. It is not possible for

one system to mine all these kind of data.

Mining information from heterogeneous databases and

global information systems − The data is available at

different data sources on LAN or WAN. These data source

may be structured, semi structured or unstructured. Therefore

mining the knowledge from them adds challenges to data

mining.

Page 16: Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems

Data Mining Task Primitives

We can specify a data mining task in the form of a

data mining query.

This query is input to the system.

A data mining query is defined in terms of data

mining task primitives.

These primitives allow us to communicate in an

interactive manner with the data mining system.

Page 17: Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems

List of Data Mining Task Primitives −

Set of task relevant data to be mined.

Kind of knowledge to be mined.

Background knowledge to be used in discovery

process.

Interestingness measures and thresholds for pattern

evaluation.

Representation for visualizing the discovered patterns.

Page 18: Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems

Data mining task primitives

A data mining query is defined in terms of the following

primitives:

1. Task-relevant data: This is the database portion to be

investigated. For example, suppose that you are a

manager of All Electronics in charge of sales in the

United States and Canada. In particular, you would like to

study the buying trends of customers in Canada. Rather

than mining on the entire database. These are referred to

as relevant attributes.

Page 19: Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems

Data mining task primitives

2. The kinds of knowledge to be mined: This specifies

the data mining functions to be performed, such as

characterization, discrimination, association,

classification, clustering, or evolution analysis. For

instance, if studying the buying habits of customers in

Canada, you may choose to mine associations between

customer profiles and the items that these customers like

to buy

Page 20: Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems

Data mining task primitives

3. Background knowledge: Users can specify

background knowledge, or knowledge about the domain to

be mined. This knowledge is useful for guiding the

knowledge discovery process, and for evaluating the

patterns found. There are several kinds of background

knowledge.

Page 21: Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems

4. Interestingness measures: These functions are used to

separate uninteresting patterns from knowledge. They may

be used to guide the mining process, or after discovery, to

evaluate the discovered patterns. Different kinds of

knowledge may have different interestingness measures.

5. Presentation and visualization of discovered

patterns: This refers to the form in which discovered

patterns are to be displayed. Users can choose from

different forms for knowledge presentation, such as rules,

tables, charts, graphs, decision trees, and cubes.

Page 22: Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems
Page 23: Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems
Page 24: Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems

Data pre-processing is a data mining technique which is

used to transform the raw data in a useful and efficient

format.

Steps Involved in Data Pre-processing:

1.Data Cleaning: The data can have many irrelevant and

missing parts. To handle this part, data cleaning is done.

It involves handling of missing data, noisy data etc.

Page 25: Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems

(a).Missing Data: This situation arises when some data is

missing in the data. It can be handled in various ways.

Some of them are:

Ignore the tuples: This approach is suitable only when the

dataset we have is quite large and multiple values are

missing within a tuple.

Fill the Missing values: There are various ways to do this

task. You can choose to fill the missing values manually,

by attribute mean or the most probable value.

Page 26: Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems

b).Noisy Data: Noisy data is a meaningless data that can’t

be interpreted by machines. It can be generated due to

faulty data collection, data entry errors etc. It can be

handled in following ways :

1. Binning Method: This method works on sorted data in

order to smooth it. The whole data is divided into segments

of equal size and then various methods are performed to

complete the task. Each segmented is handled separately.

One can replace all data in a segment by its mean or

boundary values can be used to complete the task.

Page 27: Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems

2.Regression: Here data can be made smooth by fitting

it to a regression function. The regression used may be

linear (having one independent variable) or multiple

(having multiple independent variables).

3.Clustering: This approach groups the similar data in a

cluster. The outliers may be undetected or it will fall

outside the clusters.

Page 28: Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems

2.Data Transformation: This step is taken in order to

transform the data in appropriate forms suitable for

mining process. This involves following ways:

1.Normalization: It is done in order to scale the data

values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)

2.Attribute Selection: In this strategy, new attributes are

constructed from the given set of attributes to help the

mining process.

Page 29: Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems

3.Discretization: This is done to replace the raw values

of numeric attribute by interval levels or conceptual

levels.

4.Concept Hierarchy Generation: Here attributes are

converted from level to higher level in hierarchy. For

Example-The attribute “city” can be converted to

“country”.

Page 30: Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems

3. Data Reduction: Data reduction techniques can be

applied to obtain a reduced representation of the data set

that is much smaller in volume, yet closely maintains the

integrity of the original data. That is, mining on the

reduced data set should be more efficient yet produce

the same (or almost the same) analytical results.

Strategies for data reduction include the following.

Page 31: Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems

1. Data cube aggregation, where aggregation operations

are applied to the data in the construction of a data cube.

2. Dimension reduction, where irrelevant, weakly

relevant or redundant attributes or dimensions may be

detected and removed.

3. Data compression, where encoding mechanisms are

used to reduce the data set size.

Page 32: Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems

4. Numerosity reduction, where the data are replaced or

estimated by alternative, smaller data representations such as

parametric models (which need store only the model

parameters instead of the actual data), or nonparametric

methods such as clustering, sampling, and the use of

histograms.

5. Discretization and concept hierarchy generation, where

raw data values for attributes are replaced by ranges or higher

conceptual levels. Concept hierarchies allow the mining of data

at multiple levels of abstraction, and are a powerful tool for

data mining.