Data mining notes

A.V.C.COLLEGE OF ENGINEERING

MANNAMPANDAL, MAYILADUTHURAI-609 305

COURSE MATERIAL

FOR THE SUBJECT OF

S UB NAME : CS1011 DATA WAREHOUSING AND DATA MINING

SE M : VII

D E P AR T M E NT : COMPUTER SCIENCE AND ENGINEERING

ACAD E MIC Y E A R : 2013-2013

NAME O F TH E F AC U LT Y : PARVATHI.M

D ES I G NA T I O N : Asst.Professor

1

A.V.C College of Engineering

Department of Computer Science & Engineering

2013 Odd Semester

Lesson Plan

SYLLABUS

ELECTIVE II

CS1011 – DATA WAREHOUSING AND DATA MINING L T P3 0 0

UNIT I BASICS OF DATA WAREHOUSING 8Introduction − Data warehouse − Multidimensional data model − Data warehouse architecture −Implementation − Further development − Data warehousing to data mining.

UNIT II DATA PREPROCESSING, LANGUAGE, ARCHITECTURES, CONCEPT

DESCRIPTION 8Why preprocessing − Cleaning − Integration − Transformation − Reduction − Discretization – Concept hierarchy generation − Data mining primitives − Query language − Graphical user interfaces − Architectures − Concept description − Data generalization − Characterizations − Class comparisons − Descriptive statistical measures.

UNIT III ASSOCIATION RULES 9Association rule mining − Single-dimensional boolean association rules from transactional databases − Multi level association rules from transaction databases

UNIT IV CLASSIFICATION AND CLUSTERING 12Classification and prediction − Issues − Decision tree induction − Bayesian classification – Association rule based − Other classification methods − Prediction − Classifier accuracy − Cluster analysis – Types of data − Categorization of methods − Partitioning methods − Outlier analysis.

UNIT V RECENT TRENDS 8Multidimensional analysis and descriptive mining of complex data objects − Spatial databases −Multimedia databases − Time series and sequence data − Text databases − World Wide Web −Applications and trends in data mining.

Total: 45

2

TEXT BOOKS1. Han, J. and Kamber, M., “Data Mining: Concepts and Techniques”, Harcourt India / Morgan Kauffman, 2001.2. Margaret H. Dunham, “Data Mining: Introductory and Advanced Topics”, Pearson Education 2004.

REFERENCES1. Sam Anahory and Dennis Murry, “Data Warehousing in the real world”, Pearson Education, 2003.2. David Hand, Heikki Manila and Padhraic Symth, “Principles of Data Mining”, PHI 2004.3. W.H.Inmon, “Building the Data Warehouse”, 3rd Edition, Wiley, 2003.4. Alex Bezon and Stephen J.Smith, “Data Warehousing, Data Mining and OLAP”, McGraw- Hill Edition, 2001.5. Paulraj Ponniah, “Data Warehousing Fundamentals”, Wiley-Interscience Publication, 2003.

3

UNIT I BASICS OF DATA WAREHOUSINGIntroduction − Data warehouse − Multidimensional data model − Data warehouse architecture −Implementation – Fur ther development − Data warehousing to data mining.

1.1Introduction to Data WarehousingA data warehouse is a collection of data marts representing historical data from different operations in the company. This data is stored in a structure optimized for querying and data analysis as a data warehouse. Table design, dimensions and organization should be consistent throughout a data warehouse so that reports or queries across the data warehouse are consistent. A data warehouse can also be viewed as a database for historical data from different functions within a company.

Bill Inmon coined the term Data Warehouse in 1990, which he defined in the following way: "A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process".

Subject Oriented: Data that gives information about a particular subject instead of about a company's ongoing operations. Focusing on the modelling and analysis of data for decision makers, not on daily operations or transaction processing. It is used to provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process.

Integrated: Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Data cleaning and data integration techniques are applied. It is used to ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources. E.g., Hotel price: currency, tax, breakfast covered, etc.

Time-variant: All data in the data warehouse is identified with a particular time period. The time horizon for the data warehouse is significantly longer than that of operational systems.

o Operational database: current value data.o Data warehouse data: provide information from a historical

perspective (e.g., past 5-10 years)

Non-volatile: Data is stable in a data warehouse. More data is added but data is never removed. Operational update of data does not occur in the data warehouse environment. It does not require transaction

4

processing, recovery, and concurrency control mechanisms. It requires only two operations in data accessing:

o initial loading of data ando Access of data.

Data Warehouse is a single, complete and consistent store of data obtained from a variety of different sources made available to end users in what they can understand and use in a business context. It can be

Used for decision Support Used to manage and control business Used by managers and end-users to understand the business and make

judgments Data Warehousing is an architectural construct of information systems that provides users with current and historical decision support information that is hard to access or present in traditional operational data stores

Other important terminology

Enterprise Data warehouse: It collects all information about subjects (customers, products, sales, assets, personnel) that span the entire organization

Data Mart: Departmental subsets that focus on selected subjects. A data mart is a segment of a data warehouse that can provide data for reporting and analysis on a section, unit, department or operation in the company. E.g. sales, payroll, production. Data marts are sometimes complete individual data warehouses which are usually smaller than the corporate data warehouse.

Decision Support System (DSS): Information technology to help the knowledge worker (executive, manager, and analyst) makes faster & better decisions

Drill-down: Traversing the summarization levels from highly summarized data to the underlying current or old detail

Metadata: Data about data. It is used to describe location and description of warehouse system components such as names, definition, structure…

Benefits of data warehousing

Data warehouses are designed to perform well with aggregate queries running on large amounts of data.

The structure of data warehouses is easier for end users to navigate, understand and query against unlike the relational databases primarily designed to handle lots of transactions.

5

Data warehouses enable queries that cut across different segments of a company's operation. E.g. production data could be compared against inventory data even if they were originally stored in different databases with different structures.

Queries that would be complex in normalized databases could be easier to build and maintain in data warehouses, decreasing the workload on transaction systems.

Data warehousing is an efficient way to manage and report on data that is from a variety of sources, non uniform and scattered throughout a company.

Data warehousing is an efficient way to manage demand for lots of information from lots of users.

Data warehousing provides the capability to analyze large amounts of historical data for nuggets of wisdom that can provide an organization with competitive advantage.

Operational and informational Data1. Operational Data are used for focusing on transactional function such as bank card withdrawals and deposits and they are

Detailed Updateable Reflects current data

2. Informational Data are used for focusing on providing answers to problems posed by decision makers

Summarized Non updateable

1.1 Building a Data WarehouseThe selection of data warehouse technology - both hardware and software - depends on many factors, such as:

the volume of data to be accommodated, the speed with which data is needed, the history of the organization, which level of data is being built, how many users there will be, what kind of analysis is to be performed, Cost of technology, etc.

The hardware is typically mainframe, parallel, or client/server hardware. The software that must be selected is for the basic data base manipulation of the data as it resides on the hardware. Typically the software is either full function DBMS or specialized data base software that has been optimized for the data warehouse.

Other software that needs to be considered is the interface software that provides transformation and metadata capability such as PRISM Solutions Warehouse Manager. A final piece of software that is important is the software needed for changed data capture.

6

A rough sizing of data needs to be done to determine the fitness of the hardware and software platforms. If the hardware and DBMS software are much too large for the data warehouse, the costs of building and running the data warehouse will be exorbitant. Even though performance will be no problem, development and operational costs and finances will be a problem. Conversely, if the hardware and DBMS software are much too small for the size of the data warehouse, then performance of operations and the ultimate end user satisfaction with the data warehouse will suffer. So, it is important that there be a comfortable fit between the data warehouse and the hardware and DBMS software that will house and manipulate the warehouse.

There are two factors required to build and use data warehouse. They are:Business factors:

Business users want to make decision quickly and correctly using all available data.

Technological factors: To address the incompatibility of operational data stores IT infrastructure is changing rapidly. Its capacity is increasing and cost is

decreasing so that building a data warehouse is easy

There are several things to be considered while building a successful data warehouse

1.2.1 Business considerations:

Organizations interested in development of a data warehouse can choose one of the following two approaches:

Top - Down Approach (Suggested by Bill Inmon) Bottom - Up Approach (Suggested by Ralph Kimball)

a. Top - Down Approach

In the top down approach suggested by Bill Inmon, we build a centralized repository to house corporate wide business data. This repository is called Enterprise Data Warehouse (EDW). The data in the EDW is stored in a normalized form in order to avoid redundancy.

The central repository for corporate wide data helps us maintain one version of truth of the data. The data in the EDW is stored at the most detail level. The reason to build the EDW on the most detail level is to leverage the flexibility to be used by multiple departments and to cater for future requirements.

The disadvantages of storing data at the detail level are1. The complexity of design increases with increasing level of detail.

7

2. It takes large amount of space to store data at detail level, hence increased cost.

Once the EDW is implemented we start building subject area specific data marts which contain data in a de normalized form also called star schema. The data in the marts are usually summarized based on the end users analytical requirements.

The reason to de normalize the data in the mart is to provide faster access to the data for the end users analytics. If we were to have queried a normalized schema for the same analytics, we would end up in a complex multiple level joins that would be much slower as compared to the one on the de normalized schema.

The top-down approach can be used when1. The business has complete clarity on all or multiple subject areas data

warehouse requirements. 2. The business is ready to invest considerable time and money.

The advantage of using the Top Down approach is that we build a centralized repository to cater for one version of truth for business data. This is very important for the data to be reliable, consistent across subject areas and for reconciliation in case of data related contention between subject areas.

The disadvantage of using the Top Down approach is that it requires more time and initial investment. The business has to wait for the EDW to be implemented followed by building the data marts before which they can access their reports.

b. Bottom Up Approach

The bottom up approach suggested by Ralph Kimball is an incremental approach to build a data warehouse. In this approach data marts are built separately at different points of time as and when the specific subject area requirements are clear. The data marts are integrated or combined together to form a data warehouse. Separate data marts are combined through the use of conformed dimensions and conformed facts. A conformed dimension and a conformed fact is one that can be shared across data marts.

A Conformed dimension has consistent dimension keys, consistent attribute names and consistent values across separate data marts. The conformed dimension means exact same thing with every fact table it is joined.

A Conformed fact has the same definition of measures, same dimensions joined to it and at the same granularity across data marts.

The bottom up approach helps us incrementally build the warehouse by developing and integrating data marts as and when the requirements are clear. We don’t have

8

to wait for knowing the overall requirements of the warehouse. We should implement the bottom up approach when

1. We have initial cost and time constraints. 2. The complete warehouse requirements are not clear. We have clarity to only

one data mart.

Merits of Bottom Up approach: It does not require high initial costs and have a faster implementation time;

hence the business can start using the marts much earlier as compared to the top-down approach.

Drawbacks of Bottom Up approach: It stores data in the de normalized format, so there would be high space

usage for detailed data. We have a tendency of not keeping detailed data in this approach hence

losing out on advantage of having detail data.

1.2.2 Design considerationsA successful data warehouse designer must adopt a holistic approach by considering all data warehouse components as parts of a single complex system, and take into account all possible data sources and all known usage requirements.

Most successful data warehouses have the following common characteristics: 1. Are based on a dimensional model2. Contain historical and current data3. Include both detailed and summarized data4. Consolidate disparate data from multiple sources while retaining

consistency

Data warehouse is difficult to build due to the following reason: Heterogeneity of data sources Use of historical data Growing nature of data base

Data warehouse design approach muse be business driven, continuous and iterative engineering approach. In addition to the general considerations there are following specific points relevant to the data warehouse design:

1. Data Content The content and structure of the data warehouse are reflected in its data model. The data model is the template that describes how information will be organized within the integrated warehouse framework. The data in a data warehouse must be a detailed data. It must be formatted, cleaned up and transformed to fit the warehouse data model.

2. Meta Data

9

It defines the location and contents of data in the warehouse. Meta data is searchable by users to find definitions or subject areas. In other words, it must provide decision support oriented pointers to warehouse data and thus provides a logical link between warehouse data and decision support applications.

3. Data Distribution One of the biggest challenges when designing a data warehouse is the data placement and distribution strategy. Data volumes continue to grow in nature. Therefore, it becomes necessary to know how the data should be divided across multiple servers and which users should get access to which types of data. The data can be distributed based on the subject area, location (geographical region), or time (current, month, year).

4. Tools A number of tools are available that are specifically designed to help in the implementation of the data warehouse. All selected tools must be compatible with the given data warehouse environment and with each other. All tools must be able to use a common Meta data repository.

Design stepsThe following nine-step method is followed in the design of a data warehouse:

1. Choosing the subject matter2. Deciding what a fact table represents3. Identifying and conforming the dimensions4. Choosing the facts5. Storing pre calculations in the fact table6. Rounding out the dimension table7. Choosing the duration of the db8. The need to track slowly changing dimensions9. Deciding the query priorities and query models

1.2.3 Technical ConsiderationsA number of technical issues are to be considered when designing a data warehouse environment. These issues include:

The hardware platform that would house the data warehouse The DBMS that supports the warehouse database The communication infrastructure that connects data marts, operational

systems and end users The hardware and software to support meta data repository The systems management framework that enables centralized management

and administration of the entire environment.

1.2.4 Implementation Considerations

10

The following logical steps needed to implement a data warehouse: Collect and analyze business requirements Create a data model and a physical design Define data sources Choose the database technology and platform Extract the data from operational database, transform it, clean it up and load

it into the warehouse Choose database access and reporting tools Choose database connectivity software Choose data analysis and presentation software Update the data warehouse

Access Tools

Data warehouse implementation relies on selecting suitable data access tools. The best way to choose this is based on the type of data and the kind of access it permits for a particular user. The following lists the various types of data that can be accessed:

Simple tabular form data Ranking data Multivariable data Time series data Graphing, charting and pivoting data Complex textual search data Statistical analysis data Data for testing of hypothesis, trends and patterns Predefined repeatable queries Ad hoc user specified queries Reporting and analysis data Complex queries with multiple joins, multi level sub queries and

sophisticated search criteriaData Extraction, Clean Up, Transformation and Migration

A proper attention must be paid to data extraction which represents a success factor for data warehouse architecture. When implementing data warehouse the following selection criteria should be considered:

Timeliness of data delivery to the warehouse The tool must have the ability to identify the particular data and that can be

read by conversion tool The tool must support flat files, indexed files since corporate data is still in

this type The tool must have the capability to merge data from multiple data stores The tool should have specification interface to indicate the data to be

extracted The tool should have the ability to read data from data dictionary The code generated by the tool should be completely maintainable

11

The tool should permit the user to extract the required data The tool must have the facility to perform data type and character set

translation The tool must have the capability to create summarization, aggregation and

derivation of records The data warehouse database system must be able to perform loading data

directly from these tools

Data Placement StrategiesAs a data warehouse grows, there are at least two options for data placement. One is to put some of the data in the data warehouse into another storage media. The second option is to distribute the data in the data warehouse across multiple servers.It considers Data Replication and Database gateways.

MetadataMeta data can define all data elements and their attributes, data sources and timing and the rules that govern data use and data transformations.

User Sophistication LevelsThe users of data warehouse data can be classified on the basis of their skill level in accessing the warehouse. There are three classes of users:

Casual users: are most comfortable in retrieving information from warehouse in pre defined formats and running pre existing queries and reports. These users do not need tools that allow for building standard and ad hoc reports

Power Users: can use pre defined as well as user defined queries to create simple and ad hoc reports. These users can engage in drill down operations. These users may have the experience of using reporting and query tools.

Expert users: These users tend to create their own complex queries and perform standard analysis on the info they retrieve. These users have the knowledge about the use of query and report tools.

1.2 Multi-Tier ArchitectureThe functions of data warehouse are based on the relational data base technology. The relational data base technology is implemented in parallel manner. There are two advantages of having parallel relational data base technology for data warehouse:

Linear Speed up: refers the ability to increase the number of processor to reduce response time

Linear Scale up: refers the ability to provide same performance on the same requests as the database size increases

1.3.1 Types of parallelism12

There are two types of parallelism: Inter Query Parallelism: In which different server threads or processes

handle multiple requests at the same time. Intra Query Parallelism: This form of parallelism decomposes the serial

SQL query into lower level operations such as scan, join, sort etc. Then these lower level operations are executed concurrently in parallel.

Intra query parallelism can be done in either of two ways: Horizontal Parallelism: the data base is partitioned across multiple disks

and parallel processing occurs within a specific task that is performed concurrently on different processors against different set of data.

Vertical Parallelism: This occurs among different tasks. All query components such as scan, join, sort etc are executed in parallel in a pipelined fashion. In other words, an output from one task becomes an input into another task.

1.3.2 Database ArchitectureThere are three DBMS software architecture styles for parallel processing:

1. Shared memory or shared everything Architecture2. Shared disk architecture3. Shared nothing architecture

1. Shared Memory ArchitectureTightly coupled shared memory systems have the following characteristics:

Multiple Processor Units share memory. Each Processor Unit has full access to all shared memory through a

common bus. Communication between nodes occurs via shared memory. Performance is limited by the bandwidth of the memory bus.

Fig. 1.3.2.1 Shared Memory Architecture

Symmetric multiprocessor (SMP) machines are often nodes in a cluster. Multiple SMP nodes can be used with Oracle Parallel Server in a tightly coupled system, where memory is shared by the multiple Processor Units, and is accessible by all the Processor Units through a memory bus. Examples of tightly coupled systems include the Pyramid, Sequent, and Sun SparcServer.Performance is limited in a tightly coupled system by the factors:

13

Interconnection Network

Processor Unit

(PU)

Processor Unit

(PU)

Processor Unit

(PU)

Processor

Unit

Global Shared Memory

Memory bandwidth Processor Unit to Processor Unit communication bandwidth Memory availability I/O bandwidth and Bandwidth of the common bus.

Parallel processing advantages of shared memory systems are these: Memory access is cheaper than inter-node communication. This means that

internal synchronization is faster than using the Lock Manager. Shared memory systems are easier to administer than a cluster.

A disadvantage of shared memory systems for parallel processing is as follows: Scalability is limited by bus bandwidth and latency, and by available

memory.

2. Shared Disk ArchitectureShared disk systems are typically loosely coupled. Such systems, illustrated in following figure, have the following characteristics:

Each node consists of one or more Processor Units and associated memory. Memory is not shared between nodes. Communication occurs over a common high-speed bus. Each node has access to the same disks and other resources. A node can be an SMP if the hardware supports it. Bandwidth of the high-speed bus limits the number of nodes of the system.

Fig. 1.3.2.2 Shared Disk ArchitectureEach node is having its own data cache as the memory is not shared among the nodes. Cache consistency must be maintained across the nodes and a lock manager is needed to maintain the consistency. Additionally, instance locks using the DLM on the Oracle level must be maintained to ensure that all nodes in the cluster see identical data.

There is additional overhead in maintaining the locks and ensuring that the data caches are consistent. The performance impact is dependent on the hardware and

14


Processor

Unit (PU)

Processor

Unit (PU)

Processor

Unit (PU)

ProcessorUnit (PU)

Global Shared Memory

software components, such as the bandwidth of the high-speed bus through which the nodes communicate, and DLM performance.

Merits of shared disk systems: Shared disk systems permit high availability. All data is accessible even if

one node dies. These systems have the concept of one database, which is an advantage

over shared nothing systems. Shared disk systems provide for incremental growth.

Drawbacks of shared disk systems: Inter-node synchronization is required, involving DLM overhead and

greater dependency on high-speed interconnect. If the workload is not partitioned well, there may be high synchronization

overhead. There is operating system overhead of running shared disk software.

3. Shared Nothing ArchitectureShared nothing systems are typically loosely coupled. In shared nothing systems only one CPU is connected to a given disk. If a table or database is located on that disk, access depends entirely on the Processor Unit which owns it. Shared nothing systems can be represented as follows:

Fig. 1.3.2.3 Distributed Memory Architecture

Shared nothing systems are concerned with access to disks, not access to memory. Nonetheless, adding more PUs and disks can improve scaleup. Oracle Parallel Server can access the disks on a shared nothing system as long as the operating system provides transparent disk access, but this access is expensive in terms of latency.

Advantages of Shared nothing systems: Shared nothing systems provide for incremental growth. System growth is practically unlimited. MPPs are good for read-only databases and decision support applications. Failure is local: if one node fails, the others stay up.

15


Processor

Unit (PU)

Processor

Unit (PU)

Processor

Unit (PU)

ProcessorUnit (PU)

LocalMemory

LocalMemory

LocalMemory

LocalMemory

Drawbacks of Shared nothing systems: More coordination is required. More overhead is required for a process working on a disk belonging to

another node. If there is a heavy workload of updates or inserts, as in an online transaction

processing system, it may be worthwhile to consider data-dependent routing to alleviate contention.

1.3 Data Warehousing SchemaThere are three basic schemas that are used in dimensional modeling:

1. Star schema2. Snowflake schema 3. Fact constellation schema

1.4.1 Star schemaThe multidimensional view of data that is expressed using relational data base semantics is provided by the data base schema design called star schema. The basic of star schema is that information can be classified into two groups:

Facts Dimension

Star schema has one large central table (fact table) and a set of smaller tables (dimensions) arranged in a radial pattern around the central table.

Facts are core data element being analyzed Dimensions are attributes about the facts.

The determination of which schema model should be used for a data warehouse should be based upon the analysis of project requirements, accessible tools and project team preferences.

Fig. 1.4.1.1 Star Schema

Star schema has points radiating from a center. The center of the star consists of fact table and the points of the star are the dimension tables. Usually the fact tables in a star schema are in third normal form (3NF) whereas dimensional tables are de-

16

normalized. Star schema is the simplest architecture and is most commonly used and recommended by Oracle.

Fact TablesA fact table is a table that contains summarized numerical and historical data (facts) and a multipart index composed of foreign keys from the primary keys of related dimension tables.

A fact table typically has two types of columns: foreign keys to dimension tables and measures those that contain numeric facts. A fact table can contain fact's data on detail or aggregated level.

Dimension TablesDimensions are categories by which summarized data can be viewed. E.g. a profit summary in a fact table can be viewed by a Time dimension (profit by month, quarter, year), Region dimension (profit by country, state, city), Product dimension (profit for product1, product2).

A dimension is a structure usually composed of one or more hierarchies that categorizes data. If a dimension hasn't got a hierarchies and levels it is called flat dimension or list. The primary keys of each of the dimension tables are part of the composite primary key of the fact table. Dimensional attributes help to describe the dimensional value. They are normally descriptive, textual values. Dimension tables are generally small in size then fact table.

MeasuresMeasures are numeric data based on columns in a fact table. They are the primary data which end users are interested in. E.g. a sales fact table may contain a profit measure which represents profit on each sale.

Cubes are data processing units composed of fact tables and dimensions from the data warehouse. They provide multidimensional views of data, querying and analytical capabilities to clients.

The main characteristics of star schema: Simple structure and easy to understand. Great query effectives for small number of tables to join Relatively long time of loading data into dimension tables for de-

normalization, redundancy data caused that size of the table could be large.

The most commonly used in the data warehouse implementations.

1.4.2 Snowflake schema: is the result of decomposing one or more of the dimensions. The many-to-one relationships among sets of attributes of a dimension can separate new dimension tables, forming a hierarchy. The decomposed snowflake structure visualizes the hierarchical structure of dimensions very well.

17

1.4.3 Fact constellation schema: For each star schema it is possible to construct fact constellation schema. The fact constellation architecture contains multiple fact tables that share many dimension tables.

The main shortcoming of the fact constellation schema is a more complicated design because many variants for particular kinds of aggregation must be considered and selected.

1.4 Multidimensional data modelMultidimensional data model is to view it as a cube. The cable at the left contains detailed sales data by product, market and time. The cube on the right associates sales number (unit sold) with dimensions-product type, market and time with the unit variables organized as cell in an array.

This cube can be expended to include another array-price-which can be associates with all or only some dimensions. As number of dimensions increases number of cubes cell increase exponentially.

Dimensions are hierarchical in nature i.e. time dimension may contain hierarchies for years, quarters, months, week and day. GEOGRAPHY may contain country, state, city etc.

Fig. 1.5.1 Multidimensional cube

Each side of the cube represents one of the elements of the question. The x-axis represents the time, the y-axis represents the products and the z-axis represents different centers. The cells in the cube represents the number of product sold or can represent the price of the items.

When the size of the dimension increases, the size of the cube will also increase exponentially. The time response of the cube depends on the size of the cube.

1.5.1 Operations in Multidimensional Data Model:

18

• Aggregation (roll-up)– dimension reduction: e.g., total sales by city– summarization over aggregate hierarchy: e.g., total sales by city and

year -> total sales by region and by year• Selection (slice) defines a sub cube

– e.g., sales where city = Palo Alto and date = 1/15/96• Navigation to detailed data (drill-down)

– e.g., (sales - expense) by city, top 3% of cities by average income• Visualization Operations (e.g., Pivot or dice)

1.5 OLAP operations

OLAP stands for Online Analytical Processing. It uses database tables (fact and dimension tables) to enable multidimensional viewing, analysis and querying of large amounts of data. OLAP technology could provide management with fast answers to complex queries on their operational data or enable them to analyze their company's historical data for trends and patterns.Online Analytical Processing (OLAP) applications and tools are those that are designed to ask “complex queries of large multidimensional collections of data.”

Operations:

Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction

Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or detailed data,

or introducing new dimensions Slice and dice:

project and select Pivot (rotate):

reorient the cube, visualization, 3D to series of 2D planes. Other operations

drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its back-end

relational tables (using SQL)

1.6.1 OLAP GuidelinesDr. E.F. Codd the “father” of the relational model, created a list of rules to deal with the OLAP systems. Users should priorities these rules according to their needs to match their business requirements. These rules are:

1) Multidimensional conceptual view: The OLAP should provide an appropriate multidimensional Business model that suits the Business problems and Requirements.

2) Transparency: The OLAP tool should provide transparency to the input data for the users.

19

3) Accessibility: The OLAP tool should only access the data required only to the analysis needed.

4) Consistent reporting performance: The Size of the database should not affect in any way the performance.

5) Client/server architecture: The OLAP tool should use the client server architecture to ensure better performance and flexibility.

6) Generic dimensionality: Data entered should be equivalent to the structure and operation requirements.

7) Dynamic sparse matrix handling: The OLAP too should be able to manage the sparse matrix and so maintain the level of performance.

8) Multi-user support: The OLAP should allow several users working concurrently to work together.

9) Unrestricted cross-dimensional operations: The OLAP tool should be able to perform operations across the dimensions of the cube.

10)Intuitive data manipulation. “Consolidation path re-orientation, drilling down across columns or rows, zooming out, and other manipulation inherent in the consolidation path outlines should be accomplished via direct action upon the cells of the analytical model, and should neither require the use of a menu nor multiple trips across the user interface.”

11)Flexible reporting: It is the ability of the tool to present the rows and column in a manner suitable to be analyzed.

12)Unlimited dimensions and aggregation levels: This depends on the kind of Business, where multiple dimensions and defining hierarchies can be made.

In addition to these guidelines an OLAP system should also support: Comprehensive database management tools: This gives the database

management to control distributed Businesses The ability to drill down to detail source record level: Which requires that

The OLAP tool should allow smooth transitions in the multidimensional database.

Incremental database refresh: The OLAP tool should provide partial refresh. Structured Query Language (SQL interface): the OLAP system should be

able to integrate effectively in the surrounding enterprise environment.

1.7Data warehouse implementation

1.7.1 Efficient Data Cube Computation

Data cube can be viewed as a lattice of cuboids The bottom-most cuboid is the base cuboid The top-most cuboid (apex) contains only one cell How many cuboids in an n-dimensional cube with L levels?

Materialization of data cube

20

Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization)

Selection of which cuboids to materialize

Based on size, sharing, access frequency, etc.

Cube definition and computation in DMQL define cube sales[item, city, year]: sum(sales_in_dollars)compute cube salesTransform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.’96)SELECT item, city, year, SUM (amount)FROM SALESCUBE BY item, city, year

Need compute the following Group-Bys (date, product, customer),(date,product),(date, customer), (product, customer),(date), (product), (customer)()

21

Join index: JI(R-id, S-id) where R (R-id, …) >< S (S-id, …)

Traditional indices map the values to a list of record ids It materializes relational join in JI file and speeds up relational join

— a rather costly operation In data warehouses, join index relates the values of the dimensions of a start

schema to rows in the fact table. E.g. fact table: Sales and two dimensions city and product

A join index on city maintains for each distinct city a list of R-IDs of the tuples recording the Sales in the city

Join indices can span multiple dimensions

22

1.8 Data Warehouse to Data Mining

1.8.1 Data Warehouse Usage

Three kinds of data warehouse applications Information processing

supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs

Analytical processing multidimensional analysis of data warehouse data supports basic OLAP operations, slice-dice, drilling, pivoting

Data mining knowledge discovery from hidden patterns supports associations, constructing analytical models,

performing classification and prediction, and presenting the mining results using visualization tools.

Differences among the three tasks

23

1.8.2 From On-Line Analytical Processing to On Line Analytical Mining (OLAM

Why online analytical mining? High quality of data in data warehouses

DW contains integrated, consistent, cleaned data Available information processing structure surrounding data

warehouses ODBC, OLEDB, Web accessing, service facilities, reporting

and OLAP tools OLAP-based exploratory data analysis

mining with drilling, dicing, pivoting, etc. On-line selection of data mining functions

integration and swapping of multiple mining functions, algorithms, and tasks.

1.8.3Architecture of OLAM

24

UNIT II DATA PREPROCESSING, LANGUAGE, ARCHITECTURES, CONCEPT DESCRIPTION Why preprocessing − Cleaning − Integration − Transformation − Reduction − Discretization – Concept hierarchy generation − Data mining primitives − Query language − Graphical user interfaces − Architectures − Concept description − Data generalization − Characterizations − Class comparisons − Descriptive statistical measures.

2.1 Data preprocessingData preprocessing transforms the data into a format that will be more easily and effectively processed for the purpose of the user.Data preprocessing describes any type of processing performed on raw data to prepare it for another processing procedure. Commonly used as a preliminary data mining practice, data preprocessing transforms the data into a format that will be more easily and effectively processed for the purpose of the user.

We need data processing as data in the real world are dirty. It can be in incomplete, noisy and inconsistent from. These data needs to be preprocessed in order to improve the quality of the data, and quality of the mining results. If no quality data, then no quality mining results. The quality decision is always

based on the quality data. If there is much irrelevant and redundant information present or noisy and

unreliable data, then knowledge discovery during the training phase is more difficult.

Incomplete data may come fromo “Not applicable” data value when collectedo Different considerations between the time when the data was collected

and when it is analyzed.o Due to Human/hardware/software problemso e.g., occupation=“ ”.

Noisy data (incorrect values) may come fromo Faulty data collection by instrumentso Human or computer error at data entryo Errors in data transmission and contain errors or outliers data. e.g.,

Salary=“-10” Inconsistent data may come from

o Different data sourceso Functional dependency violation (e.g., modify some linked data)o Having discrepancies in codes or names. e.g., Age=“42”

Birthday=“03/07/1997”

2.5.1 Major Tasks in Data Preprocessing Data cleaning

o Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies

25

Data integrationo Integration of multiple databases, data cubes, or files

Data transformationo Normalization and aggregation

Data reductiono Obtains reduced representation in volume but produces the same or

similar analytical results Data discretization

o Part of data reduction but with particular importance, especially for numerical data

Fig. 2.5.1.1 Forms of Data Preprocessing

2.2 Data cleaning:

Data cleaning routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data.

i. Missing Values:The various methods for handling the problem of missing values in data tuples include:

(a) Ignoring the tuple: When the class label is missing the tuple can be ignored. This method is not very effective unless the tuple contains several attributes with missing values. It is especially poor when the percentage of missing values per attribute varies considerably.(b) Manually filling in the missing value: In general, this approach is time-consuming and may not be a reasonable task for large data sets with

26

many missing values, especially when the value to be filled in is not easily determined.(c) Using a global constant to fill in the missing value: Replace all missing attribute values by the same constant, such as a label like “Unknown,” or −∞. If missing values are replaced by, say, “Unknown,” then the mining program may mistakenly think that they form an interesting concept, since they all have a value in common — that of “Unknown” . (d) Using the attribute mean for quantitative (numeric) values or attribute mode for categorical (nominal) values, for all samples belonging to the same class as the given tuple: For example, if classifying customers according to credit risk, replace the missing value with the average income value for customers in the same credit risk category as that of the given tuple.(e) Using the most probable value to fill in the missing value: This may be determined with regression, inference-based tools using Bayesian formalism, or decision tree induction. For example, using the other customer attributes in your data set, you may construct a decision tree to predict the missing values for income.

ii. Noisy data:Noise is a random error or variance in a measured variable. Data smoothing tech is used for removing such noisy data.

Several Data smoothing techniques used:a. Binning Methodb. Regression Methodc. Cluster Method

1 Binning methods: Binning methods smooth a sorted data value by consulting the neighborhood", or values around it. The sorted values are distributed into a number of 'buckets', or bins. Because binning methods consult the neighborhood of values, they perform local smoothing.

In this technique,1. The data for first sorted 2. Then the sorted list partitioned into equi-depth of bins.3. Then one can smooth by bin means, smooth by bin median, smooth by bin

boundaries, etc.a. Smoothing by bin means: Each value in the bin is replaced by the

mean value of the bin.b. Smoothing by bin medians: Each value in the bin is replaced by the

bin median.c. Smoothing by boundaries: The min and max values of a bin are

identified as the bin boundaries. Each bin value is replaced by the closest boundary value.

27

Example: Binning Methods for Data SmoothingSorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34o Partition into (equi-depth) bins(equi depth of 3 since each bin

contains three values):Bin 1: 4, 8, 9, 15Bin 2: 21, 21, 24, 25Bin 3: 26, 28, 29, 34

o Smoothing by bin means:Bin 1: 9, 9, 9, 9Bin 2: 23, 23, 23, 23Bin 3: 29, 29, 29, 29

o Smoothing by bin boundaries:Bin 1: 4, 4, 4, 15Bin 2: 21, 21, 25, 25Bin 3: 26, 26, 26, 34

In smoothing by bin means, each value in a bin is replaced by the mean value of the bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original value in this bin is replaced by the value 9.

Smoothing by bin medians can be employed, in which each bin value is replaced by the bin median. In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value.

Suppose that the data for analysis include the attribute age. The age values for the data tuples are (in increasing order): 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.

(a) Use smoothing by bin means to smooth the above data, using a bin depth of 3. The following steps are required to smooth the above data using smoothing by bin means with a bin depth of 3. Step 1: Sort the data. (This step is not required here as the data are already

sorted.) Step 2: Partition the data into equidepth bins of depth 3.Bin 1: 13, 15, 16 Bin 2: 16, 19, 20 Bin 3: 20, 21, 22Bin 4: 22, 25, 25 Bin 5: 25, 25, 30 Bin 6: 33, 33, 35Bin 7: 35, 35, 35 Bin 8: 36, 40, 45 Bin 9: 46, 52, 70

Step 3: Calculate the arithmetic mean of each bin. Step 4: Replace each of the values in each bin by the arithmetic mean

calculated for the bin.Bin 1: 14, 14, 14 Bin 2: 18, 18, 18 Bin 3: 21, 21, 21Bin 4: 24, 24, 24 Bin 5: 26, 26, 26 Bin 6: 33, 33, 33Bin 7: 35, 35, 35 Bin 8: 40, 40, 40 Bin 9: 56, 56, 562 Regression: smooth by fitting the data into regression functions.

28

Linear regression involves finding the best of line to fit two variables, so that one variable can be used to predict the other.

Fig. 2.5.1.2 Regression Multiple linear regression is an extension of linear regression, where more than

two variables are involved and the data are fit to a multidimensional surface. Using regression to find a mathematical equation to fit the data helps smooth out the noise.3. Clustering: Outliers in the data may be detected by clustering, where similar values are organized into groups, or ‘clusters’. Values that fall outside of the set of clusters may be considered outliers.

Fig. 2.5.1.3 Clustering

iii. Data Cleaning Process: Field overloading: is a kind of source of errors that typically occurs when

developers compress new attribute definitions into unused portions of already defined attributes.

Unique rule is a rule says that each value of the given attribute must be different from all other values of that attribute

Consecutive rule is a rule says that there can be no missing values between the lowest and highest values of the attribute and that all values must also be unique.

Null rule specifies the use of blanks, question marks, special characters or other strings that may indicate the null condition and how such values should be handled.

2.3 Data Integration

It combines data from multiple sources into a coherent store. There are number of issues to consider during data integration. Issues:

Schema integration: refers integration of metadata from different sources. Entity identification problem: Identifying entity in one data source similar

to entity in another table. For example, customer_id in one database and customer_no in another database refer to the same entity

29

Detecting and resolving data value conflicts: Attribute values from different sources can be different due to different representations, different scales. E.g. metric vs. British units

Redundancy: Redundancy can occur due to the following reasons: Object identification: The same attribute may have different names

in different db Derived Data: one attribute may be derived from another attribute. Correlation analysis is used to detect the redundancy.

2.4 Data Transformation

In data transformation, the data are transformed or consolidated into forms appropriate for mining.

Data transformation can involve the following: Smoothing is used to remove noise from the data. It includes binning,

regression, and clustering. Aggregation operations such as are applied to the data.

o For example, the daily sales data may be aggregated so as to compute monthly and annual total amounts.

Generalization of the data, where low-level or “primitive” (raw) data are replaced by higher-level concepts through the use of concept hierarchies. For example, categorical attributes, like street, can be generalized to higher-level concepts, like city or country.

Normalization is used to scale the attribute data to fall within a small specified range, such as -1:0 to 1:0, or 0:0 to 1:0.

Attribute construction (or feature construction) is use to construct new attributes which can be added from the given set of attributes to help the mining process.

2.5 Data Reduction

Data reduction is a technique used to have a reduced representation of data set.

Various Strategies used for data reduction:1. Data cube aggregation uses aggregation operations that can be applied to the data in the construction of a data cube.2. Attribute subset selection, where irrelevant, weakly relevant or redundant attributes or dimensions may be detected and removed.3. Dimensionality reduction, where encoding mechanisms are used to reduce the data set size.4. Numerosity reduction, where the data are replaced or estimated by smaller data representations such as parametric models or nonparametric methods such as clustering, sampling, and the use of histograms.

30

2.6.Data Discretization

Raw data values for attributes are replaced by ranges or higher conceptual levels in data discretization. The various methods used in Data Discretization are Binning, Histogram Analysis, Entropy-Based Discretization, Interval merging by x2 analysis and Clustering.

Three types of attributes: Nominal — values from an unordered set Ordinal — values from an ordered set Continuous — real numbers

Discretization: divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical attributes. Reduce data size by discretization.

Concept hierarchies reduce the data by collecting and replacing low level concepts (such

as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior).

Prepare for further analysis

Binning Histogram analysis Clustering analysis Entropy-based discretization Segmentation by natural partitioning

Entropy-Based Discretization

Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the entropy after partitioning is

The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization.

The process is recursively applied to partitions obtained until some stopping criterion is met, e.g.,

Experiments show that it may reduce data size and improve classification accuracy

Concept eneration for Categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or experts

Specification of a portion of a hierarchy by explicit data grouping

31

Specification of a set of attributes, but not of their partial ordering Specification of only a partial set of attributes

2.7 Data Mining Primitives

Finding all the patterns autonomously in a database? — unrealistic because the patterns could be too many but uninteresting

Data mining should be an interactive process User directs what to be mined

Users must be provided with a set of primitives to be used to communicate with the data mining system

Incorporating these primitives in a data mining query language More flexible user interaction Foundation for design of graphical user interface Standardization of data mining industry and practice

Database or data warehouse name Database tables or data warehouse cubes Condition for data selection Relevant attributes or dimensions Data grouping criteria

Types of knowledge to be mined

Characterization Discrimination Association Classification/prediction Clustering Outlier analysis Other data mining tasks

Background knowledge:Concept Hierarchies

Schema hierarchy E.g., street < city < province_or_state < country

Set-grouping hierarchy E.g., {20-39} = young, {40-59} = middle_aged

Operation-derived hierarchy email address: login-name < department < university < country

Rule-based hierarchy low_profit_margin (X) <= price(X, P1) and cost (X, P2) and (P1 -

P2) < $50

Measurements of Pattern Interestingness32

Simplicitye.g., (association) rule length, (decision) tree size

Certaintye.g., confidence, P(A|B) = n(A and B)/ n (B), classification reliability or

accuracy, certainty factor, rule strength, rule quality, discriminating weight, etc. Utility

potential usefulness, e.g., support (association), noise threshold (description) Novelty

not previously known, surprising (used to remove redundant rules, e.g., Canada vs. Vancouver rule implication support ratio.

2.8 Data Mining Query Language (DMQL)

Motivation A DMQL can provide the ability to support ad-hoc and interactive

data mining By providing a standardized language like SQL

Hope to achieve a similar effect like that SQL has on relational database

Foundation for system development and evolution Facilitate information exchange, technology transfer,

commercialization and wide acceptance Design

DMQL is designed with the primitives described earlier

Syntax for DMQL

Syntax for specification of task-relevant data the kind of knowledge to be mined concept hierarchy specification interestingness measure pattern presentation and visualization

Putting it all together — a DMQL query

Syntax for task-relevant data specification

use database database_name, or use data warehouse data_warehouse_name from relation(s)/cube(s) [where condition] in relevance to att_or_dim_list

33

order by order_list group by grouping_list having condition

Syntax for specifying the kind of knowledge to be mined

CharacterizationMine_Knowledge_Specification ::= mine characteristics [as pattern_name] analyze measure(s)

DiscriminationMine_Knowledge_Specification ::= mine comparison [as pattern_name] for target_class where target_condition {versus contrast_class_i where contrast_condition_i} analyze measure(s)

AssociationMine_Knowledge_Specification ::= mine associations [as pattern_name]

ClassificationMine_Knowledge_Specification ::= mine classification [as pattern_name] analyze classifying_attribute_or_dimension

PredictionMine_Knowledge_Specification ::= mine prediction [as pattern_name] analyze prediction_attribute_or_dimension {set {attribute_or_dimension_i= value_i}}

Syntax for concept hierarchy specification To specify what concept hierarchies to use

use hierarchy <hierarchy> for <attribute_or_dimension> We use different syntax to define different type of hierarchies

schema hierarchiesdefine hierarchy time_hierarchy on date as [date,month quarter,year]

set-grouping hierarchiesdefine hierarchy age_hierarchy for age on customer as

level1: {young, middle_aged, senior} < level0: alllevel2: {20, ..., 39} < level1: younglevel2: {40, ..., 59} < level1: middle_aged level2: {60, ..., 89} < level1: senior

Syntax for interestingness measure specification

Interestingness measures and thresholds can be specified by the user with the statement:

34

with <interest_measure_name> threshold = threshold_value Example:

with support threshold = 0.05with confidence threshold = 0.7

2.9 Designing Graphical User Interfaces based on a data mining query language

What tasks should be considered in the design GUIs based on a data mining query language?

Data collection and data mining query composition Presentation of discovered patterns Hierarchy specification and manipulation Manipulation of data mining primitives Interactive multilevel mining Other miscellaneous information

2.10 Data Mining System Architectures

Coupling data mining system with DB/DW system No coupling—flat file processing, not recommended Loose coupling

Fetching data from DB/DW Semi-tight coupling—enhanced DM performance

Provide efficient implement a few data mining primitives in a DB/DW system, e.g., sorting, indexing, aggregation, histogram analysis, multiway join, precomputation of some stat functions

Tight coupling—A uniform information processing environment DM is smoothly integrated into a DB/DW system, mining

query is optimized based on mining query, indexing, query processing methods, etc.

2.11Concept Description

Descriptive vs. predictive data mining Descriptive mining: describes concepts or task-relevant data sets in

concise, summarative, informative, discriminative forms Predictive mining: Based on data and analysis, constructs models for

the database, and predicts the trend and properties of unknown data Concept description:

Characterization : provides a concise and succinct summarization of the given collection of data

35

Comparison : provides descriptions comparing two or more collections of data

Concept Description vs. OLAP

Concept description: can handle complex data types of the attributes and their

aggregations a more automated process

OLAP: restricted to a small number of dimension and measure types user-controlled process

2.12Data Generalization and Summarization-based Characterization

Data generalization A process which abstracts a large set of task-relevant data in a

database from a low conceptual levels to higher ones. Approaches:

Data cube approach(OLAP approach) Attribute-oriented induction approach

Characterization: Data Cube Approach (without using AO-Induction)

Perform computations and store results in data cubes Strength

An efficient implementation of data generalization Computation of various kinds of measures

e.g., count( ), sum( ), average( ), max( ) Generalization and specialization can be performed on a data cube

by roll-up and drill-down Limitations

handle only dimensions of simple nonnumeric data and measures of simple aggregated numeric values.

Lack of intelligent analysis, can’t tell which dimensions should be used and what levels should the generalization reach

Attribute-Oriented Induction

Proposed in 1989 (KDD ‘89 workshop) Not confined to categorical data nor particular measures. How it is done?

36

Collect the task-relevant data( initial relation) using a relational database query

Perform generalization by attribute removal or attribute generalization.

Apply aggregation by merging identical, generalized tuples and accumulating their respective counts.

Interactive presentation with users.

Basic Principles of Attribute-Oriented Induction

Data focusing : task-relevant data, including dimensions, and the result is the initial relation.

Attribute-removal : remove attribute A if there is a large set of distinct values for A but (1) there is no generalization operator on A, or (2) A’s higher level concepts are expressed in terms of other attributes.

Attribute-generalization : If there is a large set of distinct values for A, and there exists a set of generalization operators on A, then select an operator and generalize A.

Attribute-threshold control : typical 2-8, specified/default. Generalized relation threshold control : control the final relation/rule size.

Basic Algorithm for Attribute-Oriented Induction

InitialRel : Query processing of task-relevant data, deriving the initial relation.

PreGen: Based on the analysis of the number of distinct values in each attribute, determine generalization plan for each attribute: removal? or how high to generalize?

PrimeGen : Based on the PreGen plan, perform generalization to the right level to derive a “prime generalized relation”, accumulating the counts.

Presentation : User interaction: (1) adjust levels by drilling, (2) pivoting, (3) mapping into rules, cross tabs, visualization presentations.

Example DMQL: Describe general characteristics of graduate students in the Big-

University databaseuse Big_University_DB mine characteristics as “Science_Students”in relevance to name, gender, major, birth_place, birth_date, residence, phone#, gpa from studentwhere status in “graduate”

Corresponding SQL statement:Select name, gender, major, birth_place, birth_date, residence, phone#, gpa

37

from studentwhere status in {“Msc”, “MBA”, “PhD” }

Presentation of Generalized Results Generalized relation :

Relations where some or all attributes are generalized, with counts or other aggregation values accumulated.

Cross tabulation : Mapping results into cross tabulation form (similar to contingency

tables). Visualization techniques : Pie charts, bar charts, curves, cubes, and other visual forms.

Quantitative characteristic rules : Mapping generalized result into characteristic rules with quantitative

information associated with it, e.g.,

2.13 Mining Class Comparisons

Comparison: Comparing two or more classes. Method:

Partition the set of relevant data into the target class and the contrasting class(es)

Generalize both classes to the same high level concepts Compare tuples with the same high level descriptions Present for every tuple its description and two measures:

support - distribution within single class comparison - distribution between classes

Highlight the tuples with strong discriminant features Relevance Analysis:

Find attributes (features) which best distinguish different classes. Task

Compare graduate and undergraduate students using discriminant rule.

DMQL query use Big_University_DB mine comparison as “grad_vs_undergrad_students” in relevance to name, gender, major, birth_place, birth_date, residence,

phone#, gpa for “graduate_students” where status in “graduate” versus “undergraduate_students” where status in “undergraduate” analyze count% from student

38

Example: Analytical comparison (2)

Given attributes name, gender, major, birth_place, birth_date, residence,

phone# and gpa Gen(ai) = concept hierarchies on attributes ai Ui = attribute analytical thresholds for attributes ai Ti = attribute generalization thresholds for attributes ai R = attribute relevance threshold

Example: Analytical comparison (3)

1. Data collection target and contrasting classes 2. Attribute relevance analysis remove attributes name, gender, major, phone#

3. Synchronous generalization controlled by user-specified dimension thresholds prime target and contrasting class(es) relations/cuboids

Class Description

Quantitative characteristic rule

necessary Quantitative discriminant rule

sufficient Quantitative description rule

necessary and sufficient

2.14Mining descriptive statistical measures in large databases

2.14.1 Mining Data Dispersion Characteristics Motivation

To better understand the data: central tendency, variation and spread Data dispersion characteristics

median, max, min, quantiles, outliers, variance, etc. Numerical dimensions correspond to sorted intervals

Data dispersion: analyzed with multiple granularities of precision Boxplot or quantile analysis on sorted intervals

Dispersion analysis on computed measures Folding measures into numerical dimensions Boxplot or quantile analysis on the transformed cube

Measuring the Central Tendency

Mean

39

Weighted arithmetic mean Median : A holistic measure

Middle value if odd number of values, or average of the middle two values otherwise

estimated by interpolation Mode

Value that occurs most frequently in the data Unimodal, bimodal, trimodal Empirical formula:mean-mode=3x(mean-mode)

Measuring the Dispersion of Data

Quartiles, outliers and boxplots Quartiles: Q1 (25th percentile), Q3 (75th percentile) Inter-quartile range: IQR = Q3 – Q1

Five number summary: min, Q1, M, Q3, max Boxplot: ends of the box are the quartiles, median is marked,

whiskers, and plot outlier individually Outlier: usually, a value higher/lower than 1.5 x IQR

Variance and standard deviation Variance s2: (algebraic, scalable computation) Standard deviation s is the square root of variance s2

Boxplot Analysis

Five-number summary of a distribution:Minimum, Q1, M, Q3, Maximum

Boxplot Data is represented with a box The ends of the box are at the first and third quartiles, i.e., the height

of the box is IRQ The median is marked by a line within the box Whiskers: two lines outside the box extend to Minimum and

Maximum

Graphic Displays of Basic Statistical Descriptions

Histogram: (shown before) Boxplot: (covered before) Quantile plot: each value xi is paired with fi indicating that approximately

100 fi % of data are £ xi

Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another

Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane

Loess (local regression) curve: add a smooth curve to a scatter plot to provide better perception of the pattern of dependence

40

UNIT III ASSOCIATION RULESAssociation rule mining − Single-dimensional boolean association rules from transactional databases − Multi level association rules from transaction databases

3.1 Association rule mining:Association rule mining is used for finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. It searches for interesting relationships among items in a given data set. 3.1.1 Market basket analysis: Electronic shops A motivating example for association rule mining

Motivation: finding regularities in data What products were often purchased together? — Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents?

Association rule mining is used for analyzing buying behavior. Frequently purchased items can be placed in close proximity in order to further encourage the sale of such items together.

If customers who purchase computers also tend to buy financial management software at the same time, then placing the hardware display close to the software display may help to increase the sales of both of these items.

Each basket can then be represented by a Boolean vector of values assigned to these variable. The Boolean vectors can be analyzed for buying patterns which reflect items that are frequent associated or purchased together. These patterns can be represented in the form of association rules. For example, the information that customers who purchase computers also tend to buy financial management software at the same time is represented in association Rule.computer =>financial management software [support = 2%; confidence = 60%] Example of association rule mining is market basket analysis. This process analyzes customer buying habits by finding associations between the different items that customers place in their “shopping baskets”.

41

Fig. 3.1.1 Market basket analysis

The discovery of such associations can help retailers develop marketing strategies by gaining insight into which items are frequently purchased together by customers. For instance, if customers are buying milk, how likely are they to also buy bread (and what kind of bread) on the same trip to the supermarket? Such information can lead to increased sales by helping retailers to do selective marketing and plan their shelf space.

For instance, placing milk and bread within close proximity may further encourage the sale of these items together within single visits to the store.

3.1.2 Basic Concepts: Frequent Patterns and Association Rules Itemset X={x1, …, xk} Find all the rules XàY with min confidence and support support, s, probability that a transaction contains XÈY confidence, c, conditional probability that a transaction having X also

contains Y.

Rule support and confidence are two measures of rule interestingness that were described

A support of 2% for association Rule means that 2% of all the transactions under analysis show that computer and financial management software are purchased together

A confidence of 60% means that 60% of the customers who purchased a computer also bought the software. Typically, association rules are considered interesting if they satisfy both a minimum support threshold and a minimum confidence threshold. Such thresholds can be set by users or domain experts.

42

Rules that satisfy both a minimum support threshold (min sup) and a minimum confidence threshold (min conf) are called strong. By convention, we write min sup and min conf values so as to occur between 0% and 100%,

A set of items is referred to as an itemset. An itemset that contains k items is a k-itemset. The set of computer, financial management software is a 2-itemset. The occurrence frequency of an itemset is the number of transactions that

contain the itemset. This is also known as the frequency or support count of the itemset.

The number of transactions required for the itemset to satisfy minimum support is referred to as the minimum support count.

Association rule mining - a two-step process:Step 1: Find all frequent itemsets. By definition, each of these itemsets will occur at least as frequently as a pre-determined minimum support count.Step 2: Generate strong association rules from the frequent itemsets. By definition, these rules must satisfy minimum support and minimum confidence.

3.1.3 Association rule mining:Association rules can be classified in various ways, based on the following criteria:

1. Based on the types of values handled in the rule: If a rule concerns associations between the presence or absence of items, it

is a Boolean association rule.

If a rule describes associations between quantitative items or attributes, then it is a quantitative association rule. In these rules, quantitative values for items or attributes are partitioned into intervals.

age(X; “30 ……39") ^ income(X; “42K ….. 48K") )=>buys( X, high resolution TV")

2. Based on the dimensions of data involved in the rule:If the items or attributes in an association rule each reference only one dimension, then it is a single- dimensional association rule.

The above rule could be rewritten asbuys(X; “computer") => buys(X; financial management software") The above example is a single-dimensional association rule since it refers to only one dimension, i.e., buys. If a rule references two or more dimensions, such as the dimensions buys, time of transaction, and customer category, then it is a multidimensional association rule.

43

3. Based on the levels of abstractions involved in the rule set:Some methods for association rule mining can find rules at differing levels of abstraction. For example, suppose that a set of association rules mined included Rule

age(X,”30…..39")) buys(X; “laptop computer") age(X; “30 …39") ) buys(X; “computer")

In the above said examples the items bought are referenced at different levels of abstraction. We refer to the rule set mined as consisting of multilevel association rules. If, instead, the rules within a given set do not reference items or attributes at different levels of abstraction, then the set contains single-level association rules.

4. Based on the nature of the association involved in the rule: Association mining can be extended to correlation analysis, where the absence or presence of correlated items can be identified.

3.2 Mining single-Dimensional Boolean association rules from Transactional databases

Different methods for mining the simplest form of association rules - single-dimensional, single-level, Boolean association rules, such as those discussed for market basket analysis presenting Apriori. It is a basic algorithm for finding frequent itemsets. It uses a procedure for generating strong association rules from frequent itemsets.

3.2.1 The Apriori algorithm: Finding frequent itemsetsApriori is an influential algorithm for mining frequent itemsets for Boolean association rules. The name of the algorithm is based on the fact that the algorithm uses prior knowledge of frequent itemset properties.

Apriori employs an iterative approach known as a level-wise search, where k-itemsets are used to explore (k+1)-itemsets. First, the set of frequent 1-itemsets is found. This set is denoted L1. L1 is used to find L2, the frequent 2-itemsets, which is used to find L3, and so on, until no more frequent k-itemsets can be found. The finding of each Lk requires one full scan of the database. To improve the efficiency of the level-wise generation of frequent itemsets, an important property called the Apriori property is used to reduce the search space.

The Apriori property. All non-empty subsets of a frequent itemset must also be frequent.

By definition, if an itemset I does not satisfy the minimum support threshold, s, then I is not frequent, i.e., P(I) < s. If an item A is added to the itemset I, then the resulting itemset cannot occur more frequently than I. This property belongs to a special category of properties called anti-monotone in the sense that if a set cannot

44

pass a test, all of its supersets will fail the same test as well. It is called anti-monotone because the property is monotonic in the context of failing a test.

1. The join step: To find Lk, a set of candidate k-itemsets is generated by joining Lk-1 with itself. This set of candidates is denoted Ck. Let l1 and l2 be itemsets in Lk_1. The notation li[j] refers to the jth item in li. By convention, Apriori assumes that items within a transaction or itemset are sorted in increasing lexicographic order. It also ensures that no duplicates are generated.

3. The prune step: Ck is a superset of Lk, that is, its members may or may not be frequent, but all of the frequent k-itemsets are included in Ck. A scan of the database to determine the count of each candidate in Ck would result in the determination of Lk. Ck can be huge, and so this could involve heavy computation.

The Apriori Algorithm

Join Step: Ck is generated by joining Lk-1with itselfPrune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset Pseudo-code:Ck: Candidate itemset of size kLk : frequent itemset of size kL1 = {frequent items};for (k = 1; Lk !=Æ; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return Èk Lk;

45

To reduce the size of Ck, the Apriori property is used as follows. Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset. Hence, if any (k-1)-subset of a candidate k-itemset is not in Lk-1, then the candidate cannot be frequent either and so can be removed from Ck. This subset testing can be done quickly by maintaining a hash tree of all frequent itemsets.

Fig. 3.2.1.1 Transactional data for an All Electronics branch.Let's look at a concrete example of Apriori, based on the All Electronics transaction database, D, of There are nine transactions in this database.

46

1. In the first iteration of the algorithm, each item is a member of the set of candidate 1-itemsets, C1. The algorithm simply scans all of the transactions in order to count the number of occurrences of each item.

2. Suppose that the minimum transaction support count required is 2 (i.e., min sup = 2). The set of frequent 1-itemsets, L1, can then be determined. It consists of the candidate 1-itemsets having minimum support.

3. To discover the set of frequent 2-itemsets, L2, the algorithm uses L1×L1 to generate a candidate set of 2-itemsets, C2.

4. Next, the transactions in D are scanned and the support count of each candidate itemset in C2 is accumulated.

5. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2-itemsets in C2 having minimum support.

6. The generation of the set of candidate 3-itemsets, C3. Based on the Apriori property that all subsets of a frequent itemset must also be frequent, we can determine that the four latter candidates cannot possibly be frequent. We therefore remove them from C3, thereby saving the effort of unnecessarily obtaining their counts during the subsequent scan of D to determine L3.

7. The transactions in D are scanned in order to determine L3, consisting of those candidate 3-itemsets in C3 having minimum support. 8. The algorithm uses L3×L3 to generate a candidate set of 4-itemsets, C4.

Generating association rules from frequent itemsets:

Once the frequent itemsets from transactions in a database D have been found, it is straightforward to generate strong association rules from them (where strong association rules satisfy both minimum support and minimum confidence). This can be done for confidence, where the conditional probability is expressed in terms of itemset support count:

support count(A U B) is the number of transactions containing the itemsets AUB, and

support count(A) is the number of transactions containing the itemset A. Based on this equation, association rules can be generated as follows.

For each frequent itemset, l, generate all non-empty subsets of l. For every non-empty subset s of l, output the rule s → (l-s)"

where min_conf is the minimum confidence threshold.

47

Variations of the Apriori algorithmMany variations of the Apriori algorithm have been proposed. A number of

these variations are enumerated below. Methods 1 to 6 focus on improving the efficiency of the original algorithm, while methods 7 and 8 consider transactions over time.1. A hash-based technique: Hashing itemset counts.

A hash-based technique can be used to reduce the size of the candidate k-itemsets, Ck, for k > 1. For example, when scanning each transaction in the database to generate the frequent 1-itemsets, L1, from the candidate

A 2-itemset whose corresponding bucket count in the hash table is below the support threshold cannot be frequent and thus should be removed from the candidate set. Such a hash-based technique may substantially reduce the number of the candidate k-itemsets examined (especially when k = 2).

2. Transaction reduction: Reducing the number of transactions scanned in future iterations. A transaction which does not contain any frequent k-itemsets cannot contain any frequent (k + 1)-itemsets. Therefore, such a transaction can be marked or removed from further consideration since subsequent scans of the database for j-itemsets, where j > k, will not require it.

3. Partitioning: It is used for partitioning the data to find candidate itemsets. A partitioning technique can be used which requires just two database scans to mine the frequent itemsets. It consists of two phases.

In Phase I, the algorithm subdivides the transactions of D into n non-overlapping partitions. If the minimum support threshold for transactions in D is min_sup, then the minimum itemset support count for a partition is min_sup*the number of transactions in that partition.

For each partition, all frequent itemsets within the partition are found. These are referred to as local frequent itemsets.

The procedure employs a special data structure which, for each itemset, records the TID's of the transactions containing the items in the itemset. This allows it to find all of the local frequent k-itemsets, for k = 1,2 ……..n in just one scan of the database.

The collection of frequent itemsets from all partitions forms a global candidate itemset with respect to D.

In Phase II, a second scan of D is conducted in which the actual support of each candidate is assessed in order to determine the global frequent itemsets. Partition size and the number of partitions are set so that each partition can fit into main memory and therefore be read only once in each phase.

4. Sampling:

48

It is used for Mining on a subset of the given data. The basic idea of the sampling approach is to pick a random sample S of the given data D, and then search for frequent itemsets in S instead D.

5. Dynamic itemset counting: It adds candidate itemsets at different points during a scan. A dynamic itemset counting technique was proposed in which the database is partitioned into blocks marked by start points. In this variation, new candidate itemsets can be added at any start point, unlike in Apriori, which determines new candidate itemsets only immediately prior to each complete database scan. The technique is dynamic in that it estimates the support of all of the itemsets that have been counted so far, adding new candidate itemsets if all of their subsets are estimated to be frequent. The resulting algorithm requires two database scans.

5.Calendric market basket analysis: Finding itemsets that are frequent in a set of user-defined time intervals. Calendric market basket analysis uses transaction time stamps to define subsets of the given database .

Construct FP-tree from a Transaction DB

Steps:1. Scan DB once, find frequent 1-itemset (single item pattern)2. Order frequent items in frequency descending order3. Scan DB again, construct FP-tree

49

Benefits of the FP-tree Structure Completeness:

never breaks a long pattern of any transaction preserves complete information for frequent pattern mining

Compactness reduce irrelevant information—infrequent items are gone frequency descending ordering: more frequent items are more likely

to be shared never be larger than the original database (if not count node-links

and counts) Example: For Connect-4 DB, compression ratio could be over 100

Mining Frequent Patterns Using FP-tree

General idea (divide-and-conquer) Recursively grow frequent pattern path using the FP-tree

Method For each item, construct its conditional pattern-base, and then its

conditional FP-tree Repeat the process on each newly created conditional FP-tree Until the resulting FP-tree is empty, or it contains only one path

(single path will generate all the combinations of its sub-paths, each of which is a frequent pattern)

Major Steps to Mine FP-tree1) Construct conditional pattern base for each node in the FP-tree2) Construct conditional FP-tree from each conditional pattern-base

50

3) Recursively mine conditional FP-trees and grow frequent patterns obtained so far

If the conditional FP-tree contains a single path, simply enumerate all the patterns

3.3 Mining multilevel association rules from transaction databases Multilevel association rulesFor many applications, it is difficult to find strong associations among data items at low or primitive levels of abstraction due to the sparsity of data in multidimensional space. Strong associations discovered at very high concept levels may represent common sense knowledge.

Example : Suppose we are given the task-relevant set of transactional data in for sales at the computer department of an All electronics branch, showing the items purchased for each transaction TID. A concept hierarchy defines sequence of mappings from a set of low level concepts to higher level, more general concepts. Data can be generalized by replacing low level concepts within the

Fig. 3.3.1 Class Hierarchydata by their higher level concepts, or ancestors, from a concept hierarchy 4. The concept hierarchy has four levels, referred to as levels 0, 1, 2, and 3. By convention, levels within a concept hierarchy are numbered from top to bottom, starting with level 0 at the root node for all (the most general abstraction level).

Level 1 includes computer, software, printer and computer accessory, Level 2 includes home computer, laptop computer, education software,

financial management software, .., and Level 3 includes IBM home computer, .., Microsoft educational software,

and so on. Level 3 represents the most specific abstraction level of this hierarchy.

51

Fig. 2.3.2 Multilevel Mining with Reduced SupportRules generated from association rule mining with concept hierarchies are called multiple-level or multilevel association rules, since they consider more than one concept level.

Approaches to mining multilevel association rules

In general, a top-down strategy is employed, where counts are accumulated for the calculation of frequent itemsets at each concept level, starting at the concept level 1 and working towards the lower, more specific concept levels, until no more frequent itemsets can be found. That is, once all frequent itemsets at concept level 1 are found, then the frequent itemsets at level 2 are found, and so on. For each level, any algorithm for discovering frequent itemsets may be used, such as Apriori or its variations.

1. Using uniform minimum support for all levels (referred to as uniform support): The same minimum support threshold is used when mining at each level of abstraction. For example, a minimum support threshold of 5% is used throughout (e.g., for mining from “computer" down to “laptop computer"). Both “computer" and “laptop computer" are found to be frequent, while “home computer" is not.When a uniform minimum support threshold is used, the search procedure is simplified. The method is also simple in that users are required to specify only one minimum support threshold. An optimization technique can be adopted, based on the knowledge that an ancestor is a superset of its descendents: the search avoids examining itemsets containing any item whose ancestors do not have minimum support.

Fig. 2.7.3.2.1 Multilevel Mining with Uniform Support

52

The uniform support approach is unlikely that items at lower levels of abstraction will occur as frequently as those at higher levels of abstraction. If the minimum support threshold is set too high, it could miss several meaningful associations occurring at low abstraction levels. If the threshold is set too low, it may generate many uninteresting associations occurring at high abstraction levels. This provides the motivation for the following approach.

2. Using reduced minimum support at lower levels (referred to as reduced support): Each level of abstraction has its own minimum support threshold. The lower the abstraction level is, the smaller the corresponding threshold is. For example, the minimum support thresholds for levels 1 and 2 are 5% and 3%, respectively. In this way, “computer", “laptop computer", and “home computer" are all considered frequent.

Fig. 2.7.3.2.2 Multilevel Mining with Reduced Support

For mining multiple-level associations with reduced support, there are a number of alternative search strategies.These include:1. Level-By-Level Independent: This is a full breadth search, where no background knowledge of frequent itemsets is used for pruning. Each node is examined, regardless of whether or not its parent node is found to be frequent.2. Level-Cross Filtering By Single Item: An item at the i-th level is examined if and only if its parent node at the (i-1)-th level is frequent. If a node is frequent, its children will be examined; otherwise, its descendents are pruned from the search. For example, the descendent nodes of “computer" (i.e., “laptop computer" and \home computer") are not examined, since “computer" is not frequent. 3. Level-Cross Filtering By K-Item Set: A k-itemset at the ith level is examined if and only if its corresponding parent k-itemset at the (i-1)th level is frequent.

53

UNIT IV CLASSIFICATION AND CLUSTERING

Classification and prediction − Issues − Decision tree induction − Bayesian classification – Association rule based − Other classification methods − Prediction − Classifier accuracy − Cluster analysis – Types of data − Categorization of methods − Partitioning methods − Outlier analysis.

4.1 Classification vs. Prediction

Classification: predicts categorical class labels classifies data (constructs a model) based on the training set and the

values (class labels) in a classifying attribute and uses it in classifying new data

Prediction: models continuous-valued functions, i.e., predicts unknown or

missing values Typical Applications

credit approval target marketing medical diagnosis treatment effectiveness analysis

Classification—A Two-Step Process

Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as

determined by the class label attribute The set of tuples used for model construction: training set The model is represented as classification rules, decision trees, or

mathematical formulae Model usage: for classifying future or unknown objects

Estimate accuracy of the model The known label of test sample is compared with the

classified result from the model Accuracy rate is the percentage of test set samples that are

correctly classified by the model Test set is independent of training set, otherwise over-fitting

will occur

54

Supervised learning (classification) Supervision: The training data (observations, measurements, etc.) are

accompanied by labels indicating the class of the observations New data is classified based on the training set

Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of

establishing the existence of classes or clusters in the data4.2 Issues regarding classification and prediction (2): Evaluating Classification Methods

Predictive accuracy Speed and scalability

time to construct the model time to use the model

Robustness handling noise and missing values

Scalability efficiency in disk-resident databases

Interpretability: understanding and insight provded by the model

Goodness of rules decision tree size

55

compactness of classification rules4.3 Classification by Decision Tree Induction Decision tree

o A decision tree is a flowchart-like tree structure, where each internal node denotes a test on an attribute.

o Each branch represents an outcome of the test, and each leaf node holds a class label.

o The topmost node in a tree is the root node.o Internal nodes are denoted by rectangles, and leaf nodes are denoted by

ovals.o Some decision tree algorithms produce only binary trees whereas others

can produce non binary trees. Decision tree generation consists of two phases

o Tree construction Attribute selection measures are used to select the attribute that

best partitions the tuples into distinct classes.o Tree pruning

Tree pruning attempts to identify and remove such branches, with the goal of improving classification accuracy on unseen data.

Use of decision tree: Classifying an unknown sampleo Test the attribute values of the sample against the decision

tree

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm)

o Tree is constructed in a top-down recursive divide-and-conquer mannero At start, all the training examples are at the rooto Attributes are categorical (if continuous-valued, they are discretized in

advance)o Examples are partitioned recursively based on selected attributeso Test attributes are selected on the basis of a heuristic or statistical

measure (e.g., information gain) Conditions for stopping partitioning

o All samples for a given node belong to the same classo There are no remaining attributes for further partitioning – majority

voting is employed for classifying the leafo There are no samples left

Attribute Selection Measure Information gain

o All attributes are assumed to be categorical and can be modified for continuous-valued attributes

Gini index

56

o All attributes are assumed continuous-valuedo Assume there exist several possible split values for each attributeo May need other tools, such as clustering, to get the possible split valueso Can be modified for categorical attributes

Information Gain (ID3/C4.5) Select the attribute with the highest information gain Assume there are two classes, P and N

o Let the set of examples S contain p elements of class P and n elements of class N

o The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined as

Information Gain in Decision Tree Induction :

Assume that using attribute A a set S will be partitioned into sets {S1, S2 , …, Sv}

If Si contains pi examples of P and ni examples of N, the entropy, or the expected information needed to classify objects in all subtrees Si is

The encoding information that would be gained by branching on A

4.4 Bayesian Classification: Probabilistic Learning: Calculate explicit probabilities for hypothesis,

among the most practical approaches to certain types of learning problems

Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data.

Probabilistic Prediction: Predict multiple hypotheses, weighted by their probabilities

Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured

Bayesian Theorem Given training data D, posteriori probability of a hypothesis h, P(h|

D) follows the Bayes theorem MAP (maximum posteriori) hypothesis

57

Practical difficulty: It requires initial knowledge of many probabilities, significant computational cost.

Naïve Bayes Classifier The naïve Bayesian classifier, or simple Bayesian classifier, works as follows:

o Let D be a training set of tuples and their associated class labels. As usual, each tuple is represented by an n-dimensional attribute vector, X = (x1, x2, : : : , xn), depicting n measurements made on the tuple from n attributes, respectively, A1, A2, : : : , An.

o Suppose that there are m classes, C1, C2, : : : , Cm. Given a tuple, X, the classifier will predict that X belongs to the class having the highest posterior probability, conditioned on X.

P(Ci|X) > P(Cj|X) for 1≤ j ≤m; j≠ i:o The class Ci for which P(CijX) is maximized is called the Maximum

posteriori hypothesis.

o As P(X) is constant for all classes, only P(X|Ci)P(Ci) need be maximized. If the class prior probabilities are not known, then it is commonly assumed that the classes are equally likely, that is, P(C1) = P(C2) = …= P(Cm), and we would therefore maximize P(X|Ci). Otherwise, we maximize P(X|Ci)P(Ci).

o The attributes are conditionally independent of one another, given theclass label of the tuple

Rule Based ClassificationA set of IF-THEN rules are used in Rule Based Classification.

Using IF-THEN Rules for ClassificationRules are a good way of representing information or bits of knowledge. A rule-based Classifier uses a set of IF-THEN rules for classification. An IF-THEN rule is an expression of the form

IF condition THEN conclusion.An example is rule R1,

R1: IF age = youth AND student = yes THEN buys computer = yes.• The “IF”-part (or left-hand side) of a rule is known as the rule antecedent or

precondition. The “THEN”-part (or right-hand side) is the rule consequent. R1 can also be written as

R1: (age = youth) ^ (student = yes)) (buys computer = yes).

58

Name Blood Type Give Birth Can Fly Live in Water Classhuman warm yes no no mammalspython cold no no no reptilessalmon cold no no yes fisheswhale warm yes no yes mammalsfrog cold no no sometimes amphibianskomodo cold no no no reptilesbat warm yes yes no mammalspigeon warm no yes no birdscat warm yes no no mammalsleopard shark cold yes no yes fishesturtle cold no no sometimes reptilespenguin warm no no sometimes birdsporcupine warm yes no no mammalseel cold no no yes fishessalamander cold no no sometimes amphibiansgila monster cold no no no reptilesplatypus warm no no no mammalsowl warm no yes no birdsdolphin warm yes no yes mammalseagle warm no yes no birds

Fig. If Then Example

R1: (Give Birth = no) Ù (Can Fly = yes) ® BirdsR2: (Give Birth = no) Ù (Live in Water = yes) ® FishesR3: (Give Birth = yes) Ù (Blood Type = warm) ® MammalsR4: (Give Birth = no) Ù (Can Fly = no) ® ReptilesR5: (Live in Water = sometimes) ® Amphibians

Application of Rule-Based ClassifierA rule r covers an instance x if the attributes of the instance satisfy the condition of the ruleR1: (Give Birth = no) Ù (Can Fly = yes) ® BirdsR2: (Give Birth = no) Ù (Live in Water = yes) ® FishesR3: (Give Birth = yes) Ù (Blood Type = warm) ® MammalsR4: (Give Birth = no) Ù (Can Fly = no) ® ReptilesR5: (Live in Water = sometimes) ® Amphibians

The rule R1 covers a hawk => BirdThe rule R3 covers the grizzly bear => Mammal

Advantages of Rule-Based Classifiers• As highly expressive as decision trees• Easy to interpret• Easy to generate• Can classify new instances rapidly• Performance comparable to decision trees

Prediction

59

• (Numerical) prediction is similar to classificationo construct a modelo use model to predict continuous or ordered value for a given input

• Prediction is different from classificationo Classification refers to predict categorical class labelo Prediction models continuous-valued functions

• Major method for prediction: regressiono model the relationship between one or more independent or predictor

variables and a dependent or response variable• Regression analysis

o Linear and multiple regressiono Non-linear regressiono Other regression methods: generalized linear model, Poisson regression,

log-linear models, regression trees

4.5 Association-Based Classification Several methods for association-based classification

ARCS: Quantitative association mining and clustering of association rules (Lent et al’97)

It beats C4.5 in (mainly) scalability and also accuracy Associative classification: (Liu et al’98)

It mines high support and high confidence rules in the form of “cond_set => y”, where y is a class label

CAEP (Classification by aggregating emerging patterns) (Dong et al’99)

Emerging patterns (EPs): the itemsets whose support increases significantly from one class to another

Mine Eps based on minimum support and growth rate

4.6 Other Classification Methods

k-nearest neighbor classifier case-based reasoning Genetic algorithm Rough set approach Fuzzy set approaches

Instance-Based Methods Instance-based learning:

Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified

Typical approaches k -nearest neighbor approach

Instances represented as points in a Euclidean space. Locally weighted regression

60

Constructs local approximation Case-based reasoning

Uses symbolic representations and knowledge-based inference

The k-Nearest Neighbor Algorithm

All instances correspond to points in the n-D space. The nearest neighbor are defined in terms of Euclidean distance. The target function could be discrete- or real- valued. For discrete-valued, the k-NN returns the most common value among the k

training examples nearest to xq.

The k-NN algorithm for continuous-valued target functions Calculate the mean values of the k nearest neighbors

Discussion on the k-NN Algorithm

Distance-weighted nearest neighbor algorithm Weight the contribution of each of the k neighbors according to their

distance to the query point xq giving greater weight to closer neighbors

Similarly, for real-valued target functions Robust to noisy data by averaging k-nearest neighbors Curse of dimensionality: distance between neighbors could be dominated

by irrelevant attributes. To overcome it, axes stretch or elimination of the least relevant

attributes.

Remarks on Lazy vs. Eager Learning

Instance-based learning: lazy evaluation Decision-tree and Bayesian classification : eager evaluation Key differences

Lazy method may consider query instance xq when deciding how to generalize beyond the training data D

Eager method cannot since they have already chosen global approximation when seeing the query

Efficiency: Lazy - less time training but more time predicting Accuracy

Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form its implicit global approximation to the target function

Eager: must commit to a single hypothesis that covers the entire instance space

61

4.7 Prediction

Prediction is similar to classification First, construct a model Second, use model to predict unknown value

Major method for prediction is regression Linear and multiple regression Non-linear regression

Prediction is different from classification Classification refers to predict categorical class label Prediction models continuous-valued functions

Predictive Modeling in Databases

Predictive modeling: Predict data values or construct generalized linear models based on the database data.

One can only predict value ranges or category distributions Method outline:

Minimal generalization Attribute relevance analysis Generalized linear model construction Prediction

Determine the major factors which influence the prediction Data relevance analysis: uncertainty measurement, entropy analysis,

expert judgement, etc. Multi-level prediction: drill-down and roll-up analysis

Regress Analysis and Log-Linear Models in Prediction

Linear regression : Y = a + b X Two parameters , a and b specify the line and are to be estimated by

using the data at hand. using the least squares criterion to the known values of Y1, Y2, …,

X1, X2, …. Multiple regression : Y = b0 + b1 X1 + b2 X2.

Many nonlinear functions can be transformed into the above. Log-linear models :

The multi-way table of joint probabilities is approximated by a product of lower-order tables.

Probability: p(a, b, c, d) = aab baccad dbcd

4.8Classification Accuracy: Estimating Error Rates Partition: Training-and-testing

use two independent data sets, e.g., training set (2/3), test set(1/3) used for data set with large number of samples

62

Cross-validation divide the data set into k subsamples use k-1 subsamples as training data and one sub-sample as test data

--- k-fold cross-validation for data set with moderate size

Bootstrapping (leave-one-out) for small size data

Boosting and Bagging Boosting increases classification accuracy

Applicable to decision trees or Bayesian classifier Learn a series of classifiers, where each classifier in the series pays more

attention to the examples misclassified by its predecessor Boosting requires only linear time and constant space

Boosting Technique (II) — Algorithm

Assign every example an equal weight 1/N For t = 1, 2, …, T Do

Obtain a hypothesis (classifier) h(t) under w(t) Calculate the error of h(t) and re-weight the examples based on the

error Normalize w(t+1) to sum to 1

Output a weighted sum of all the hypothesis, with each hypothesis weighted according to its accuracy on the training set

4.9 Cluster Analysis:Finding groups of objects such that the objects in a group will be similar (or

related) to one another and different from (or unrelated to) the objects in other groups is called clustering. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. A cluster of data objects can be treated collectively as one group and so may be considered as a form of data compression.

Clustering is also called data segmentation in some applications because clustering partitions large data sets into groups according to their similarity. Clustering can also be used for outlier detection.

A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity.

The quality of a clustering result depends on both the similarity measure used by the method and its implementation. The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.

63

The requirements of clustering in data mining: Scalability

o Clustering all the data instead of only on samples Ability to deal with different types of attributes

Numerical, binary, categorical, ordinal, linked, and mixture of these types Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to determine input parameters Ability to deal with noisy data Incremental clustering and insensitivity to the order of input records High dimensionality Constraint-based clustering

User may give inputs on constraints Use domain knowledge to determine input parameters

Interpretability and usability

4.10 Types of Data in Cluster Analysis:A data set to be clustered contains n objects, which may represent persons,

houses, documents, countries, and so on. Main memory-based clustering algorithms typically operate on the following two data structures.

Data matrix: it is otherwise known as object-by-variable structure. This represents n objects, such as persons, with p variables (also called measurements or attributes), such as age, height, weight, gender, and so on.

The structure is in the form of a relational table, or n-by-p matrix:

Dissimilarity matrix: It is otherwise known as object-by-object structure. This stores a collection of proximities that are available for all pairs of n objects. It is often represented by an n-by-n table:

d (i, j) is the measured difference or dissimilarity between objects i and j. d(i, j) is a nonnegative number that is close to 0 when objects i and j are

highly similar or “near” each other, and becomes larger the more they differ.

64

3.2.1 Interval-Scaled VariablesInterval-scaled variables are continuous measurements of a roughly linear scale. Typical examples include weight and height, latitude and longitude coordinate and weather temperature.The measurement unit used can affect the clustering analysis. For example, changing measurement units from meters to inches for height, or from kilograms to pounds for weight, may lead to a very different clustering structure. Steps involved in Interval-Scaled Variables:1. Calculate the mean absolute deviation

2. Calculate the standardized measurement (z-score)

3.2.2 Binary VariablesA binary variable has only two states: 0 or 1, where 0 means that the variable is absent, and 1 means that it is present.

Given the variable smoker describing a patient, 1 indicates that the patient smokes, 0 indicates that the patient does not.

We need methods specific to binary data are necessary for computing dissimilarities.

Distance measure for symmetric binary variables: Distance measure for asymmetric binary variables: Jaccard coefficient (similarity measure for asymmetric binary variables):

E.g.Dissimilarity between Binary Variables- patient record

Distance measures are:

3.2.3 Categorical, Ordinal, and Ratio-Scaled Variables

65

A categorical variable is a generalization of the binary variable in that it can take on more than two states. For example, map color is a categorical variable that may have, say, five states: red, yellow, green, pink, and blue.Let the number of states of a categorical variable be M. The states can be denoted by letters, symbols, or a set of integers, such as 1, 2, … , M. These integers are used for data handling and do not represent any specific ordering.

A discrete ordinal variable resembles a categorical variable, except that the M states of the ordinal value are ordered in a meaningful sequence. Subjective assessments of qualities can be measured using ordinal variable. For example, professional ranks are often enumerated in a sequential order, such as assistant, associate, and full for professors.

A ratio-scaled variable makes a positive measurement on a nonlinear scale, such as an exponential scale, approximately following the formula

Where A and B are positive constants, and t typically represents time. Common examples include the growth of a bacteria population or the decay of a radioactive element.

3.2.4 Variables of Mixed TypesA database can contain all of the six variable types.One approach is to group each kind of variable together, performing a separate cluster analysis for each variable type. This is feasible if these analyses derive compatible results. All variable types could be processed together for performing a single cluster analysis. The different variables can be combined into a single dissimilarity matrix, with a common scale of the interval [0.0, 1.0].

4.11 Categorization of Major Clustering MethodsThe major categorization is:

1. Partitioning methods 2. Hierarchical methods3. Density-based methods4. Grid-based methods5. Model-based clustering methods6. Clustering high-dimensional data7. Constraint-based cluster analysis

Partitioning Methods: Given a database of n objects or data tuples, a partitioning method constructs k partitions of the data, where each partition represents a cluster and k ≤ n. It classifies the data into k groups, which together satisfy the following requirements:

66

(1) Each group must contain at least one object and (2) Each object must belong to exactly one group. Given k, the number of partitions to construct, a partitioning method creates an initial partitioning. It then uses an iterative relocation technique that attempts to improve the partitioning by moving objects from one group to another. The general criterion of a good partitioning is that objects in the same cluster are “close” or related to each other, whereas objects of different clusters are “far apart” or very different.There are various kinds of other criteria for judging the quality of partitions.

Hierarchical Methods: A hierarchical method creates a hierarchical decomposition of the given set of data objects. A hierarchical method can be classified as being either agglomerative or divisive, based on how the hierarchical decomposition is formed.

The agglomerative approach- is also known as bottom-up approach. It starts with each object forming a separate group. It successively merges the objects or groups that are close to one another, until all of the groups are merged into one or until a termination condition holds.

The divisive approach – is known as top-down approach. It starts with all of the objects in the same cluster and the cluster is split up into smaller clusters, until eventually each object is in one cluster or until a termination condition holds.

Density-based Methods: A given cluster can be grown as long as the density in the “neighborhood” exceeds some threshold; that is, for each data point within a given cluster, the neighborhood of a given radius has to contain at least a minimum number of points. Such a method can be used to filter out noise and discover clusters of arbitrary shape.Grid-based methods: Grid-based methods quantize the object space into a finite number of cells that form a grid structure. Fast processing time, which is typically independent of the number of data objects and dependent only on the number of cells in each dimension in the quantized space is the merit of this method.Model-based methods: Model-based methods hypothesize a model for each of the clusters and find the best fit of the data to the given model. A model-based algorithm may locate clusters by constructing a density function that reflects the spatial distribution of the data points. It is used to automatically determine the number of clusters based on standard statistics. Clustering high-dimensional data: It is used for analysis of objects containing a large number of features or dimensions. Frequent pattern–based clustering extracts distinct frequent patterns among subsets of dimensions that occur frequently. It uses such patterns to group objects and generate meaningful clusters.Constraint-based clustering : It is a clustering approach that performs clustering by incorporation of user-specified or application-oriented constraints. Various kinds of constraints can be specified, either by a user or as per application requirements.

67

4.12 Clustering Algorithms

Partitioning Methods : k-Means and k-MedoidsThe most well-known and commonly used partitioning methods are k-means, k-medoids, and their variations.

a. The k-Means MethodThe k-means algorithm takes the input parameter, k, and partitions a set of n objects into k clusters so that the resulting intra cluster similarity is high but the inter cluster similarity is low. Cluster similarity is measured for the mean value of the objects in a cluster, which can be viewed as the cluster’s centroid or center of gravity.

First, it randomly selects k of the objects, each of which initially represents a cluster mean or center.

For each of the remaining objects, an object is assigned to the cluster to which it is the most similar, based on the distance between the object and the cluster mean. It then computes the new mean for each cluster.

This process iterates until the criterion function converges. Typically, the square-error criterion is used, defined as

Where E is the sum of the square error for all objects in the data set; p is the point in space representing a given object; and mi is the mean of cluster Ci (both p and mi are Multi dimensional). In other words, for each object in each cluster, the distance from the object to its cluster center is squared, and the distances are summed. This criterion tries to make the resulting k clusters as compact and as separate as possible.

Fig. Clustering of a set of objects based on the k-means method.

b. Representative Object-Based Technique: The k-Medoids MethodThe k-means algorithm is sensitive to outliers because an object with an extremely large value may substantially distort the distribution of data.

Steps are:

68

arbitrarily choose k objects in D as the initial representative objects or seeds;

repeato assign each remaining object to the cluster with the nearest

representative object;o randomly select a non representative objecto Compute the total cost, S, of swapping representative object.o if S < 0 then swap objects to form the new set of k representative

objects; until no change;

c. Partitioning Methods in Large Databases: From k-Medoids to CLARANSCLARA (Clustering LARge Applications) can be used to deal with larger data sets.

Instead of taking the whole set of data into consideration, a small portion of the actual data is chosen as a representative of the data.

Medoids are then chosen from this sample using PAM. If the sample is selected in a fairly random manner, it should closely

represent the original data set. The representative objects (medoids) chosen will likely be similar to those

that would have been chosen from the whole data set. CLARA draws multiple samples of the data set, applies PAM on each

sample, and returns its best clustering as the output.

The effectiveness of CLARA depends on the sample size. PAM searches for the best k medoids among a given data set, whereas CLARA searches for the best k medoids among the selected sample of the data set. CLARA cannot find the best clustering if any of the best sampled medoids is not among the best k medoids.

Hierarchical Methods

A hierarchical clustering method works by grouping data objects into a tree of clusters.

a. Agglomerative and Divisive Hierarchical ClusteringAgglomerative hierarchical clustering: This bottom-up strategy starts by placing each object in its own cluster and then merges these atomic clusters into larger and larger clusters, until all of the objects are in a single cluster or until certain termination conditions are satisfied.

Divisive hierarchical clustering: This top-down strategy does the reverse of agglomerative hierarchical clustering by starting with all objects in one cluster. It subdivides the cluster into smaller and smaller pieces, until each object forms a cluster on its own or until it satisfies certain termination conditions, such as a

69

desired number of clusters is obtained or the diameter of each cluster is within a certain threshold

b. BIRCH: Balanced Iterative Reducing and Clustering Using HierarchiesBIRCH is designed for clustering a large amount of numerical data by integration of hierarchical clustering and iterative partitioning. The merits of BIRCH:(1) scalability and (2) the inability to undo what was done in the previous step.

BIRCH introduces two concepts, clustering feature and clustering feature tree (CF tree), which are used to summarize cluster representations. These structures help the clustering method achieve good speed and scalability in large databases and also make it effective for incremental and dynamic clustering of incoming objects.

Fig. 3.4.2.1 A CF tree structure.

C. ROCK: A Hierarchical Clustering Algorithm for Categorical AttributesROCK (RObust Clustering using linKs) is a hierarchical clustering algorithm that explores the concept of links (the number of common neighbors between two objects) for data with categorical attributes.

d. Chameleon: A Hierarchical Clustering Algorithm Using Dynamic ModelingChameleon is a hierarchical clustering algorithm that uses dynamic modeling to determine the similarity between pairs of clusters. It was derived based on the observed weaknesses of two hierarchical clustering algorithms: ROCK and CURE. ROCK and related schemes emphasize cluster interconnectivity while ignoring information regarding cluster proximity. CURE and related schemes consider cluster proximity yet ignore cluster interconnectivity.

In Chameleon, cluster similarity is assessed based on how well-connected objects are within a cluster and on the proximity of clusters. That is, two clusters are merged if their interconnectivity is high and they are close together. Thus, Chameleon does not depend on a static, user-supplied model and can automatically adapt to the internal characteristics of the clusters being merged. The merge process facilitates the discovery of natural and homogeneous clusters and applies to all types of data as long as a similarity function can be specified.

70

Fig. Chameleon: Hierarchical clustering based on k-nearest neighbors and dynamic modeling.

Density-Based MethodsTo discover clusters with arbitrary shape, density-based clustering methods have been developed. These typically regard clusters as dense regions of objects in the data space that are separated by regions of low density (representing noise).

DBSCAN grows clusters according to a density-based connectivity analysis. OPTICS extends DBSCAN to produce a cluster ordering obtained from a wide range of parameter settings. DENCLUE clusters objects based on a set of density distribution functions.

a. DBSCANDBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density- based clustering algorithm. The algorithm grows regions with sufficiently high density into clusters and discovers clusters of arbitrary shape in spatial databases with noise. It defines a cluster as a maximal set of density-connected points.

The basic ideas of density-based clustering involve a number of new definitions. The steps involved are:

The neighborhood within a radius ε of a given object is called the ε-neighborhood of the object.

If the ε-neighborhood of an object contains at least a minimum number, MinPts, of objects, then the object is called a core object.

Given a set of objects, D, we say that an object p is directly density-reachable from object q if p is within the ε-neighborhood of q, and q is a core object.

An object p is density-reachable from object q with respect to ε and MinPts in a set of objects, D, if there is a chain of objects p1, : : : , pn, where p1 = q and pn = p such that pi+1 is directly density-reachable from pi.

An object p is density-connected to object q with respect to ε and MinPts in a set of objects, D, if there is an object o 2 D such that both p and q are density-reachable from o with respect to ε and MinPts.

71

Fig. Density reachability and density connectivity in density-based clustering.

b. OPTICS: Ordering Points to Identify the Clustering Structure

OPTICS computes an augmented cluster ordering for automatic and interactive cluster analysis. This ordering represents the density-based clustering structure of the data. It contains information that is equivalent to density-based clustering obtained from a wide range of parameter settings. The cluster ordering can be used to extract basic clustering information (such as cluster centers or arbitrary-shaped clusters) as well as provide the intrinsic clustering structure.

Fig. OPTICS terminology.

c. DENCLUE: Clustering Based on Density Distribution FunctionsDENCLUE (DENsity-based CLUstEring) is a clustering method based on a set of density distribution functions. The method is built on the following ideas: (1) The influence of each data point can be formally modeled using a mathematical function, called an influence function, which describes the impact of a data point within its neighborhood;

72

(2) The overall density of the data space can be modeled analytically as the sum of the influence function applied to all data points; and (3) Clusters can then be determined mathematically by identifying density attractors, where density attractors are local maxima of the overall density function.

Grid-Based MethodsThe grid-based clustering approach uses a multi resolution grid data structure. It quantizes the object space into a finite number of cells that form a grid structure on which all of the operations for clustering are performed. The main advantage of the approach is its fast processing time, which is typically independent of the number of data objects, yet dependent on only the number of cells in each dimension in the quantized space.

a. STING: STatistical INformation GridSTING is a grid-based multi resolution clustering technique in which the spatial area is divided into rectangular cells. There are usually several levels of such rectangular cells corresponding to different levels of resolution, and these cells form a hierarchical structure.Each cell at a high level is partitioned to form a number of cells at the next lower level. Statistical information regarding the attributes in each grid cell (such as the mean, maximum, and minimum values) is pre computed and stored. These statistical parameters are useful for query processing.

Fig. A hierarchical structure for STING clustering.

b. Wave Cluster: Clustering Using Wavelet TransformationWave Cluster is a multi resolution clustering algorithm that summarizes the data by imposing a multidimensional grid structure onto the data space. It then uses a wavelet transformation to transform the original feature space, finding dense regions in the transformed space.In this approach, each grid cell summarizes the information of a group of points that map into the cell. This summary information typically fits into main memory for use by the multi resolution wavelet transform and the subsequent cluster analysis.

73

A wavelet transform is a signal processing technique that decomposes a signal into different frequency sub bands. The wavelet model can be applied to d-dimensional signals by applying a one-dimensional wavelet transforms d times. In applying a wavelet transform, data are transformed so as to preserve the relative distance between objects at different levels of resolution. This allows the natural clusters in the data to become more distinguishable.

Model-Based Clustering MethodsModel-based clustering methods attempt to optimize the fit between the given data and some mathematical model. Such methods are often based on the assumption that the data are generated by a mixture of underlying probability distributions.

a. Expectation-MaximizationExpectation-Maximization is used to cluster the data using a finite mixture density model of k probability distributions, where each distribution represents a cluster. The problem is to estimate the parameters of the probability distributions so as to best fit the data.

Fig. Expectation- Maximization

Each cluster can be represented by a probability distribution, centered at a mean, and with a standard deviation. Here, we have two clusters, corresponding to the Gaussian distributions g(m1, σ1) and g(m2, σ2), respectively, where the dashed circles represent the first standard deviation of the distributions.

The steps are:1. Make an initial guess of the parameter vector2. Iteratively refine the parameters (or clusters) based on the following two steps:(a) Expectation Step: Assign each object xi to cluster Ck with the probability(b) Maximization Step: Use the probability estimates from above to re-estimate (or refine) the model parameters.

b. Conceptual ClusteringConceptual clustering is a form of clustering in machine learning that, given a set of unlabeled objects, produces a classification scheme over the objects. Unlike conventional clustering, which primarily identifies groups of like objects,

74

conceptual clustering goes one step further by also finding characteristic descriptions for each group, where each group represents a concept or class.

Hence, conceptual clustering process: Clustering is performed first, followed by characterization. Here, clustering

quality is not solely a function of the individual objects. Rather, it incorporates factors such as the generality and simplicity of the derived concept descriptions.

Probabilistic descriptions are used to represent each derived concept.

COBWEB is a popular and simple method of incremental conceptual clustering. Its input objects are described by categorical attribute-value pairs. COBWEB creates a hierarchical clustering in the form of a classification tree.

c. Neural Network ApproachThe neural network approach is motivated by biological neural networks. A neural network is a set of connected input/output units, where each connection has a weight associated with it. Neural networks have several properties that make them popular for clustering.

Neural networks are inherently parallel and distributed processing architectures.

Neural networks learn by adjusting their interconnection weights so as to best fit the data. This allows them to “normalize” or “prototype” the patterns and act as feature (or attribute) extractors for the various clusters.

Third, neural networks process numerical vectors and require object patterns to be represented by quantitative features only.

Many clustering tasks handle only numerical data or can transform their data into quantitative features if needed.

The neural network approach to clustering tends to represent each cluster as an exemplar. An exemplar acts as a “prototype” of the cluster and does not necessarily have to correspond to a particular data example or object. New objects can be distributed to the cluster whose exemplar is the most similar, based on some distance measure.

The attributes of an object assigned to a cluster can be predicted from the attributes of the Cluster’s exemplar. Self-organizing feature maps (SOMs) are one of the most popular neural network methods for cluster analysis. They are sometimes referred to as Kohonen self-organizing feature maps, after their creator, Teuvo Kohonon, or as topologically ordered maps.

SOMs’ goal is to represent all points in a high-dimensional source space by points in a low-dimensional (usually 2-D or 3-D) target space, such that the distance and proximity relationships (hence the topology) are preserved as much as possible.

Clustering High – Dimensional Data

75

Most clustering methods are designed for clustering low-dimensional data and encounter challenges when the dimensionality of the data grows really high. When the dimensionality increases, usually only a small number of dimensions are relevant to certain clusters, but data in the irrelevant dimensions may produce much noise and mask the real clusters to be discovered. When dimensionality increases, data usually become increasingly sparse because the data points are likely located in different dimensional subspaces. When the data become really sparse, data points located at different dimensions can be considered as all equally distanced, and the distance measure, which is essential for cluster analysis, becomes meaningless

a. CLIQUE: A Dimension-Growth Subspace Clustering MethodCLIQUE (CLustering InQUEst) was the first algorithm proposed for dimension-growth subspace clustering in high-dimensional space. In dimension-growth subspace clustering, the clustering process starts at single-dimensional subspaces and grows upward to higher-dimensional ones. Because CLIQUE partitions each dimension like a grid structure and determines whether a cell is dense based on the number of points it contains, it can also be viewed as an integration of density-based and grid-based clustering methods. However, its overall approach is typical of subspace clustering for high-dimensional space.The ideas of the CLIQUE clustering algorithm are outlined as follows.

Given a large set of multidimensional data points, the data space is usually not uniformly occupied by the data points. CLIQUE’s clustering identifies the sparse and the “crowded” areas in space, thereby discovering the overall distribution patterns of the data set.

A unit is dense if the fraction of total data points contained in it exceeds an input model parameter. In CLIQUE, a cluster is defined as a maximal set of connected dense units.

b. PROCLUS: A Dimension-Reduction Subspace Clustering MethodPROCLUS (PROjected CLUStering) is a dimension-reduction subspace clustering method. It starts by finding an initial approximation of the clusters in the high-dimensional attribute space. Each dimension is then assigned a weight for each cluster, and the updated weights are used in the next iteration to regenerate the clusters. This leads to the exploration of dense regions in all subspaces of some desired dimensionality and avoids the generation of a large number of overlapped clusters in projected dimensions of lower dimensionality.

PROCLUS finds the best set of medoids by a hill-climbing process similar to that used in CLARANS, but generalized to deal with projected clustering. It adopts a distance measure called Manhattan segmental distance, which is the Manhattan distance on a set of relevant dimensions.

The PROCLUS algorithm consists of three phases: Initialization Iteration and

76

Cluster refinement.

In the Initialization Phase, it uses a greedy algorithm to select a set of initial medoids that are far apart from each other so as to ensure that each cluster is represented by at least one object in the selected set. It chooses a random sample of data points proportional to the number of clusters to generate, and then applies the greedy algorithm to obtain an even smaller final subset for the next phase.

The Iteration Phase selects a random set of k medoids from this reduced set (of medoids), and replaces “bad” medoids with randomly chosen new medoids if the clustering is improved. For each medoid, a set of dimensions is chosen whose average distances are small compared to statistical expectation. The total number of dimensions associated to medoids must be k.

The Refinement Phase computes new dimensions for each medoid based on the clusters found, reassigns points to medoids, and removes outliers.

c. Frequent Pattern–Based Clustering MethodsFrequent pattern mining searches for patterns (such as sets of items or objects) that occur frequently in large data sets. Frequent pattern mining can lead to the discovery of interesting associations and correlations among data objects.

Constraint-Based Cluster AnalysisConstraint-based clustering finds clusters that satisfy user-specified preferences or constraints. Depending on the nature of the constraints, constraint-based clustering may adopt rather different approaches. Here are a few categories of constraints.

1. Constraints on individual objects: This constraint confines the set of objects to be clustered. It can easily be handled by preprocessing (e.g., performing selection using an SQL query), after which the problem reduces to an instance of unconstrained clustering.

2. Constraints on the selection of clustering parameters: A user may like to set a desired range for each clustering parameter. Clustering parameters are usually quite specific to the given clustering algorithm. Examples of parameters include k, the desired number of clusters in a k-means algorithm.

3. Constraints on distance or similarity functions: Different distance or similarity functions can be specified for specific attributes of the objects to be clustered or different distance measures for specific pairs of objects.

a. Clustering with Obstacle Objects

77

During clustering, we must take obstacle objects into consideration. Obstacles introduce constraints on the distance function. The straight-line distance between two points is meaningless if there is an obstacle in the way.

b. Semi-Supervised Cluster AnalysisIn comparison with supervised learning, clustering lacks guidance from users or classifiers (such as class label information), and thus may not generate highly desirable clusters. The quality of unsupervised clustering can be significantly improved using some weak form of supervision, for example, in the form of pair wise constraints.A clustering process based on user feedback or guidance constraints is called semi-supervised clustering.

Methods for semi-supervised clustering can be categorized into two classes: Constraint-Based Semi-Supervised Clustering and Distance-Based Semi-Supervised Clustering.

Constraint-based semi-supervised clustering relies on user-provided labels or constraints to guide the algorithm toward a more appropriate data partitioning. This includes modifying the objective function based on constraints, or initializing and constraining the clustering process based on the labeled objects.

Distance-based semi-supervised clustering employs an adaptive distance measure that is trained to satisfy the labels or constraints in the supervised data.

4.13 Outlier AnalysisData objects that do not comply with the general behavior or model of the data are known as outliers.. Such data objects, which are grossly different from or inconsistent with the remaining set of data, are called outliers.

Outliers can be caused by measurement or execution error. For example, the display of a person’s age as 999 could be caused by a program default setting of an unrecorded age. Outliers may be the result of inherent data variability.

Many data mining algorithms try to minimize the influence of outliers or eliminate the mall together. This, however, could result in the loss of important hidden information because one person’s noise could be another person’s signal. It is useful for fraud detection, where outliers may indicate fraudulent activity. Thus, outlier detection and analysis is an interesting data mining task, referred to as outlier mining.Outlier mining has wide applications. It can be used in fraud detection by detecting unusual usage of credit cards or telecommunication services. In addition, it is useful in customized marketing for identifying the spending behavior of customers with extremely low or extremely high incomes, or in medical analysis for finding unusual responses to various medical treatments. The outlier mining problem can be viewed as two sub problems:

78

(1) Define what data can be considered as inconsistent in a given data set, and (2) Find an efficient method to mine the outliers so defined.

3.5.1 Statistical Distribution-Based Outlier DetectionThe statistical distribution-based approach to outlier detection assumes a distribution or probability model for the given data set (e.g., a normal or Poisson distribution) and then identifies outliers with respect to the model using a discordancy test.

Application of the test requires knowledge of the data set parameters (such as the assumed data distribution), knowledge of distribution parameters (such as the mean and variance), and the expected number of outliers.

There are two basic types of procedures for detecting outliers: Block procedures: Either all of the suspect objects are treated as outliers or

all of them are accepted as consistent. Consecutive (or sequential) procedures: An example of such a procedure is

the inside out procedure. The object that is least “likely” to be an outlier is tested first. If it is found to be an outlier, then all of the more extreme values are also considered outliers; otherwise, the next most extreme object is tested, and so on. This procedure tends to be more effective than block procedures.

3.5.2 Distance-Based Outlier DetectionThe notion of distance-based outliers was used to overcome limitations imposed by statistical methods. An object, o, in a data set, D, is a distance-based (DB) outlier with parameters pct and dmin, that is, distance-based outliers as those objects that do not have “enough” neighbors, where neighbors are defined based on distance from the given object. In comparison with statistical-based methods, distance based outlier detection generalizes the ideas behind discordancy testing for various standard distributions. Distance-based outlier detection avoids the excessive computation that can be associated with fitting the observed distribution into some standard distribution and in selecting discordancy tests.Several efficient algorithms for mining distance-based outliers have been developed.These are outlined as follows

Index-based algorithm: Given a data set, the index-based algorithm uses multidimensional indexing structures, such as R-trees or k-d trees, to search for neighbors of each object o within radius dmin around that object.

Let M be the maximum number of objects within the dmin-neighborhood of an outlier. Therefore, onceM+1 neighbors of object o are found, it is clear that o is not an outlier. The index-based algorithm scales well as k increases.

Nested-loop algorithm: The nested-loop algorithm has the same computational complexity as the index-based algorithm but avoids index

79

structure construction and tries to minimize the number of I/Os. It divides the memory buffer space into two halves and the data set into several logical blocks. By carefully choosing the order in which blocks are loaded into each half, I/O efficiency can be achieved.

Cell-based algorithm: To avoid O(n ) computational complexity, a cell-based algorithm was developed for memory-resident data sets. Its complexity is O(ck+n), where c is a constant depending on the number of cells and k is the dimensionality

3.5.3 Density-Based Local Outlier DetectionStatistical and distance-based outlier detection both depend on the overall or “global” distribution of the given set of data points, D. These methods encounter difficulties when analyzing data with rather different density distributions.To define the local outlier factor of an object, we need to introduce the concepts of k-distance, k-distance neighborhood, reachability distance, and local reachability density.

These are defined as follows:The k-distance of an object p is the maximal distance that p gets from its k-nearest neighbors. This distance is denoted as k-distance(p). It is defined as the distance, d(p, o), between p and an object o ε D, such that (1) For at least k objects, That is, there are at least k objects in D that are as close as or closer to p than o, and (2) There are at most k-1 objects, that are closer to p than o.

3.5.4 Deviation-Based Outlier DetectionDeviation-based outlier detection does not use statistical tests or distance-based measures to identify exceptional objects. It identifies outliers by examining the main characteristics of objects in a group. Objects that “deviate” from this description are considered outliers. a. Sequential Exception TechniqueThe sequential exception technique simulates the way in which humans can distinguish unusual objects from among a series of supposedly like objects. It uses implicit redundancy of the data. Given a data set, D, of n objects, it builds a sequence of subsets,fD1, D2, : : : , Dmg, of these objects with 2 µ m µ n such that

b. OLAP Data Cube TechniqueAn OLAP approach to deviation detection uses data cubes to identify regions of anomalies in large multidimensional data. For added efficiency, the deviation detection process is overlapped with cube computation.

80

The approach is a form of discovery-driven exploration, in which pre computed measures indicating data exceptions are used to guide the user in data analysis, at all levels of aggregation. A cell value in the cube is considered an exception if it is significantly different from the expected value, based on a statistical model. The method uses visual cues such as background color to reflect the degree of exception of each cell. The user can choose to drill down on cells that are flagged as exceptions. The measure value of a cell may reflect exceptions occurring at more detailed or lower levels of the cube, where these exceptions are not visible from the current level.

81

UNIT V RECENT TRENDS

Multidimensional analysis and descriptive mining of complex data objects − Spatial databases −Multimedia databases − Time series and sequence data − Text databases − World Wide Web −Applications and trends in data mining.

5.1 MINING COMPLEX TYPES OF DATA

Introduction

Our previous studies on data mining techniques have focused on mining relational data-bases, transactional databases, and data warehouses formed by the transformation and integration of structured data. Vast amount of data in various complex forms (e.g., structured and unstructured, hypertext and multimedia) have been growing explosively owing to the rapid progress of data collection tolls, advanced database system technologies and World –Wide Web (WWW) technologies. Therefore, an increasingly important task in data mining is to mine complex types of data, including complex objects, spatial data, multimedia data, time-series data, text data, and the World Wide Web.

In this chapter, we examine how to further develop the essential data mining techniques (such as characterization, association, classification and clustering), and how to develop new ones to cope with complex types of data and perform fruitful knowledge mining in complex information repositories. Since search into mining such complex databases has been evolving at a hasty pace, our discussion covers only some preliminary issues. We expect that many books dedicated to the mining of complex kinds of data will become available in the future.

Multidimensional Analysis and Descriptive Mining of Complex Data Objects

A major limitation of many commercial data warehouse and OLAP tools for multidimensional database analysis is their restriction on the allowable data types for dimensions and measures. Most data cube implementations confine dimensions to nonnumeric data and measures to simple aggregated values.

To introduce data mining and multidimensional data analysis for complex objects, this section examines how to perform generalization on complex structured objects and construct object cubes for OLAP and mining in object databases.The storage and access of complex structured data have been studied in object-relational and object-oriented database systems. These systems organize a large set of complex data objects into classes, which are in turn organized into class/subclass hierarchies. Each object in a class is associated with

1) An object-identifier

82

2) A set of attributes that may contain sophisticated data structures, set- or list-valued data, class composition and hierarchies, multimedia data and so on &3) A set of methods that specify the computational routines or rules associated with the object class.

To facilitate generalization and induction in object-relational and object-oriented databases, it is important to study how the generalized data can be used for multidimensional data and analysis and data mining.

5.2 Spatial Data MiningSpatial data mining refers to the extraction of knowledge, spatial relationships, or other interesting patterns not explicitly stored in spatial databases. Such mining demands an integration of data mining with spatial database technologies.

A spatial database stores a large amount of space-related data, such as maps, preprocessed remote sensing or medical imaging data, and VLSI chip layout data. Spatial databases have many features distinguishing them from relational databases. They carry topological and/or distance information, usually organized by sophisticated, multidimensional spatial indexing structures that are accessed by spatial data access methods and often require spatial reasoning, geometric computation, and spatial knowledge representation techniques.

It can be used for understanding spatial data, discovering spatial relationships and relationships between spatial and non spatial data, constructing spatial knowledge bases, reorganizing spatial databases, and optimizing spatial queries. It is expected to have wide applications in geographic information systems, geomarketing, remote sensing, image database exploration, medical imaging, navigation, traffic control, environmental studies, and many other areas where spatial data are used.

5.2.1 Spatial Data Cube Construction and Spatial OLAPA spatial data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of both spatial and non spatial data in support of spatial data mining and spatial-data- related decision-making processes

5.2.2 Mining Spatial Association and Co-location PatternsA spatial association rule is of the form AB [s%;c%], where A and B are sets of spatial or non spatial predicates, s% is the support of the rule, and c% is the confidence of the rule.

For example

This rule states that 80% of schools that are close to sports centers are also close to parks, and 0.5% of the data belongs to such a case.

83

Examples include distance information, topological relations (like intersect, overlap, and disjoint), and spatial orientations (like left of and west of).Progressive refinement can be adopted in spatial association analysis.

The method first mines large data sets roughly using a fast algorithm. It improves the quality of mining in a pruned data set using a more

expensive algorithm.

Superset coverage property is used to ensure that the pruned data set covers the complete set of answers when applying the high-quality data mining algorithms.

It preserves all of the potential answers. In other words, it should allow a false-positive test, which might include some data sets that do not belong to the answer sets, but it should not allow a false-negative test, which might exclude some potential answers.

For mining spatial associations related to the spatial predicate close to, we can first collect the candidates that pass the minimum support threshold by

1. Spatial Clustering MethodsSpatial data clustering identifies clusters, or densely populated regions, according to some distance measurement in a large, multidimensional data set.

2. Spatial Classification and Spatial Trend AnalysisSpatial classification analyzes spatial objects to derive classification schemes in relevance to certain spatial properties, such as the neighborhood of a district, highway or river.

3. Mining Raster DatabasesSpatial database systems usually handle vector data that consist of points, lines, polygons (regions), and their compositions, such as networks or partitions. Typical examples of such data include maps, design graphs, and 3-D representations of the arrangement of the chains of protein molecules.

5.3 Multimedia Data MiningA multimedia database system stores and manages a large collection of multimedia data, such as audio, video, image, graphics, speech, text, document, and hypertext data, which contain text, text markups, and linkages.

Multimedia database systems are increasingly common owing to the popular use of audio-video equipment, digital cameras, CD-ROMs, and the Internet. Typical multimedia database systems include NASA’s EOS (Earth Observation System), various kinds of image and audio-video databases, and Internet databases.

5.3.1 Similarity Search in Multimedia DataFor similarity searching in multimedia data, we consider two main families of multimedia indexing and retrieval systems:

84

(1) Description-Based Retrieval Systems, which build indices and perform object retrieval based on image descriptions, such as keywords, captions, size, and time of creation; and (2) Content-Based Retrieval Systems, which support retrieval based on the image content, such as color histogram, texture, pattern, image topology, and the shape of objects and their layouts and locations within the image.

Description-based retrieval is labor-intensive if performed manually. If automated, the results are typically of poor quality. For example, the assignment of keywords to images can be a tricky and arbitrary task.

Content-based retrieval uses visual features to index images and promotes object retrieval based on feature similarity, which is highly desirable in many applications.

In a content-based image retrieval system, there are often two kinds of queries: Image-sample-based queries and Image feature specification queries.

Image-sample-based queriesIt is used to find all of the images that are similar to the given image sample. This search compares the feature vector (or signature) extracted from the sample with the feature vectors of images that have already been extracted and indexed in the image database. Based on this comparison, images that are close to the sample image are returned.

Image feature specification queries It is used to specify or sketch image features like color, texture, or shape, which are translated into a feature vector to be matched with the feature vectors of the images in the database.

Content-based retrieval is used for medical diagnosis, weather prediction, TV production, Web search engines for images, and e-commerce.

Similarity-based retrieval in image databases, based on image signature: Color histogram–based signature: In this approach, the signature of an

image includes color histograms based on the color composition of an image regardless of its scale or orientation. This method does not contain any information about shape, image topology, or texture. Thus, two images with similar color composition but having very different shapes or textures may be identified as similar, although they could be completely unrelated semantically.

Multi feature composed signature: In this approach, the signature of an image includes a composition of multiple features: color histogram, shape,

85

image topology, and texture. The extracted image features are stored as metadata, and images are indexed based on such metadata.

Multidimensional content-based search using Image features. It can therefore be used to search for similar images.

Wavelet-based signature: This approach uses the dominant wavelet coefficients of an image as its signature. Wavelets capture shape, texture, and image topology information in a single unified framework. This improves efficiency and reduces the need for providing multiple search primitives.

Wavelet-based signature with region-based granularity: In this approach, the computation and comparison of signatures are at the granularity of regions, not the entire image. This is based on the observation that similar images may contain similar regions, but a region in one image could be a translation or scaling of a matching region in the other. Therefore, a similarity measure between the query image Q and a target image T can be defined in terms of the fraction of the area of the two images covered by matching pairs of regions from Q and T. Such a region-based similarity search can find images containing similar objects, where these objects may be translated or scaled.

5.3.2 Multidimensional Analysis of Multimedia Data

A multimedia data cube can contain additional dimensions and measures for multimedia information, such as color, texture, and shape.

5.3.3 Classification and Prediction Analysis of Multimedia DataClassification and predictive modeling have been used forming multimedia data, especially in scientific research, such as astronomy, seismology.

5.3.4 Mining Associations in Multimedia DataAssociation rules involving multimedia objects can be mined in image and video databases.

Three categories can be observed: Associations between image content and non image content features: A rule

like “If at least 50% of the upper part of the picture is blue, then it is likely to represent sky” belongs to this category since it links the image content to the keyword sky.

Associations among image contents that are not related to spatial relationships: A rule like “If a picture contains two blue squares, then it is likely to contain one red circle as well” belongs to this category since the associations are all regarding image contents.

Associations among image contents related to spatial relationships: A rule like “If a red triangle is between two yellow squares, then it is likely a big

86

oval-shaped object is underneath” belongs to this category since it associates objects in the image with spatial relationships.

To mine associations among multimedia objects, we can treat each image as a transaction and find frequently occurring patterns among different images.

5.3.5 Audio and Video Data MiningBesides still images, an incommensurable amount of audio visual information is becoming available in digital form, in digital archives, on the World Wide Web, in broadcast data streams, and in personal and professional databases.

Typical examples include searching for and multimedia editing of particular video clips in a TV studio, detecting suspicious persons or scenes in surveillance videos, searching for particular events in a personal multimedia repository such as MyLifeBits, discovering patterns and outliers in weather radar recordings, and finding a particular melody or tune in MP3 audio album.

5.4 Mining Time-Series Data

Time-series database Consists of sequences of values or events changing with time Data is recorded at regular intervals Characteristic time-series components

Trend, cycle, seasonal, irregular Applications

Financial: stock price, inflation Industry: power consumption Scientific: experiment results Meteorological: precipitation

87

Categories of Time-Series Movements Long-term or trend movements (trend curve): general direction in

which a time series is moving over a long interval of time Cyclic movements or cycle variations: long term oscillations about a

trend line or curve e.g., business cycles, may or may not be periodic

Seasonal movements or seasonal variations i.e, almost identical patterns that a time series appears to

follow during corresponding months of successive years. Irregular or random movements

Time series analysis: decomposition of a time series into these four basic movements

Additive Modal: TS = T + C + S + I Multiplicative Modal: TS = T ´ C ´ S ´ I

The freehand method Fit the curve by looking at the graph Costly and barely reliable for large-scaled data mining

The least-square method Find the curve minimizing the sum of the squares of the deviation of

points on the curve from the corresponding data points The moving-average method

88

Moving average of order n Smoothes the data Eliminates cyclic, seasonal and irregular movements Loses the data at the beginning or end of a series Sensitive to outliers (can be reduced by weighted moving average)

Seasonal index Set of numbers showing the relative values of a variable during the

months of the year E.g., if the sales during October, November, and December are 80%,

120%, and 140% of the average monthly sales for the whole year, respectively, then 80, 120, and 140 are seasonal index numbers for these months

Deseasonalized data Data adjusted for seasonal variations for better trend and cyclic

analysis Divide the original monthly data by the seasonal index numbers for

the corresponding months Estimation of cyclic variations

If (approximate) periodicity of cycles occurs, cyclic index can be constructed in much the same manner as seasonal indexes

Estimation of irregular variations By adjusting the data for trend, seasonal and cyclic variations

With the systematic analysis of the trend, cyclic, seasonal, and irregular components, it is possible to make long- or short-term predictions with reasonable quality

5.5 Text Databases and IR

Text databases (document databases) Large collections of documents from various sources: news articles,

research papers, books, digital libraries, e-mail messages, and Web pages, library database, etc.

Data stored is usually semi-structured Traditional information retrieval techniques become inadequate for

the increasingly vast amounts of text data Information retrieval

A field developed in parallel with database systems Information is organized into (a large number of) documents Information retrieval problem: locating relevant documents based on

user input, such as keywords or example documents

Information Retrieval Typical IR systems

Online library catalogs89

Online document management systems Information retrieval vs. database systems

Some DB problems are not present in IR, e.g., update, transaction management, complex objects

Some IR problems are not addressed well in DBMS, e.g., unstructured documents, approximate search using keywords and relevance

Keyword-Based Retrieval

A document is represented by a string, which can be identified by a set of keywords

Queries may use expressions of keywords E.g., car and repair shop, tea or coffee, DBMS but not Oracle Queries and retrieval should consider synonyms, e.g., repair and

maintenance Major difficulties of the model

Synonymy: A keyword T does not appear anywhere in the document, even though the document is closely related to T, e.g., data mining

Polysemy: The same keyword may mean different things in different contexts, e.g., mining

90

Similarity-Based Retrieval in Text Databases

Finds similar documents based on a set of common keywords

Answer should be based on the degree of relevance based on the nearness of the keywords, relative frequency of the keywords, etc.

Basic techniques Stop list

Set of words that are deemed “irrelevant”, even though they may appear frequently

E.g., a, the, of, for, with, etc. Stop lists may vary when document set varies

5.6 Mining the World-Wide Web

The WWW is huge, widely distributed, global information service center for

Information services: news, advertisements, consumer information, financial management, education, government, e-commerce, etc.

Hyper-link information Access and usage information

WWW provides rich sources for data mining Challenges

Too huge for effective data warehousing and data mining Too complex and heterogeneous: no standards and structure

91

Web search engines Index-based: search the Web, index Web pages, and build and store huge

keyword-based indices Help locate sets of Web pages containing certain keywords Deficiencies

A topic of any breadth may easily contain hundreds of thousands of documents

Many documents that are highly relevant to a topic may not contain keywords defining them (polysemy)

92

HITS (Hyperlink-Induced Topic Search)

Explore interactions between hubs and authoritative pages Use an index-based search engine to form the root set

Many of these pages are presumably relevant to the search topic Some of them should contain links to most of the prominent

authorities Expand the root set into a base set

Include all of the pages that the root-set pages link to, and all of the pages that link to a page in the root set, up to a designated size cutoff

Apply weight-propagation An iterative process that determines numerical estimates of hub and

authority weights

XML can help to extract the correct descriptors Standardization would greatly facilitate information extraction Potential problem

XML can help solve heterogeneity for vertical applications, but the freedom to define tags can make horizontal applications on the Web more heterogeneous

93

5.7 Data mining applications

Data mining is a young discipline with wide and diverse applications There is still a nontrivial gap between general principles of data

mining and domain-specific, effective data mining tools for particular applications

Some application domains (covered in this chapter) Biomedical and DNA data analysis Financial data analysis Retail industry Telecommunication industry

DNA sequences: 4 basic building blocks (nucleotides): adenine (A), cytosine (C), guanine (G), and thymine (T).

Gene: a sequence of hundreds of individual nucleotides arranged in a particular order

Humans have around 100,000 genes Tremendous number of ways that the nucleotides can be ordered and

sequenced to form distinct genes Semantic integration of heterogeneous, distributed genome databases

Current: highly distributed, uncontrolled generation and use of a wide variety of DNA data

Similarity search and comparison among DNA sequences

94

Compare the frequently occurring patterns of each class (e.g., diseased and healthy)

Identify gene sequence patterns that play roles in various diseases Association analysis: identification of co-occurring gene sequences

Most diseases are not triggered by a single gene but by a combination of genes acting together

Association analysis may help determine the kinds of genes that are likely to co-occur together in target samples

Path analysis: linking genes to different disease development stages Different genes may become active at different stages of the disease Develop pharmaceutical interventions that target the different stages

separately Visualization tools and genetic data analysis

Data cleaning and data integration methods developed in data mining will help

Data Mining for Financial Data Analysis

Financial data collected in banks and financial institutions are often relatively complete, reliable, and of high quality

Design and construction of data warehouses for multidimensional data analysis and data mining

View the debt and revenue changes by month, by region, by sector, and by other factors

Access statistical information such as max, min, total, average, trend, etc.

Loan payment prediction/consumer credit policy analysis feature selection and attribute relevance ranking Loan payment performance Consumer credit rating

Financial Data Mining

Classification and clustering of customers for targeted marketing multidimensional segmentation by nearest-neighbor, classification,

decision trees, etc. to identify customer groups or associate a new customer to an appropriate customer group

Detection of money laundering and other financial crimes integration of from multiple DBs (e.g., bank transactions,

federal/state crime history DBs)Tools: data visualization, linkage analysis, classification, clustering tools, outlier analysis, and sequential pattern analysis tools (find unusual access sequences

95

Data mining notes

Technology

o data warehouse data

data analysis

analysis of data

historical data

data preprocessing

data cleaning

term data warehouse

data warehouse environment