Top Banner

of 30

Dataware Housing and Mining 16-Mar-06

Apr 05, 2018

Download

Documents

Ashish Sakpal
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 7/31/2019 Dataware Housing and Mining 16-Mar-06

    1/30

    Data Mining: An Overview from aDatabase Perspective

  • 7/31/2019 Dataware Housing and Mining 16-Mar-06

    2/30

  • 7/31/2019 Dataware Housing and Mining 16-Mar-06

    3/30

    Why Data Mining?Potential Applications

    Marketing

    Corporate Analysis

    Fraud Detection

    Other Applications

  • 7/31/2019 Dataware Housing and Mining 16-Mar-06

    4/30

    Marketing

    Sales Analysis

    associations between product sales

    beer and diapers

    Customer Profiling

    data mining can tell you what types of customersbuy what products

    Identifying Customer Requirementsidentify the best products for different customers

    use prediction to find what factors will attractnew customers

  • 7/31/2019 Dataware Housing and Mining 16-Mar-06

    5/30

    Corporate Analysis

    Finances

    cash flow analysis and prediction

    Resources

    summarize and compare the resources andspending

    Competition

    compare with other competitors bysummarizing data to the same level.

  • 7/31/2019 Dataware Housing and Mining 16-Mar-06

    6/30

    Fraud Detection

    Auto Insurance Fraud

    Association Rule Mining can detect a groupof people who stage accidents to collect on

    insurance

    Money Laundering

    Since 1993, the US Treasury's FinancialCrimes Enforcement Network agency hasused a data-mining application, to detectsuspicious money transactions

  • 7/31/2019 Dataware Housing and Mining 16-Mar-06

    7/30

    Other Applications

    Sports Teams

    New York Knicks use data mining to gain a

    competitive advantage

    AstronomyCalifornia Institute of Technology and the Palomar

    Observatory discovered 22 quasars with the help

    of data mining

    BankingSecurity Pacific/Bank of America uses data mining

    to help with commercial lending decisions and to

    prevent fraud

  • 7/31/2019 Dataware Housing and Mining 16-Mar-06

    8/30

    Data Mining: Major Issues

    Diversity of data mining tasks:

    Summarization, characterization, association,

    classification, clustering, trend and deviation

    analysis, other pattern analysis. Diversity of data:

    Relational, transactional, data warehouse,

    spatial, text, multimedia, active, object-

    oriented, Web, etc.

    Efficiency and scalability

    Expression and visualization of data mining results

    Data mining applications, social issues (security and

  • 7/31/2019 Dataware Housing and Mining 16-Mar-06

    9/30

    Data Mining: Classification

    Different views, different classifications:

    the kinds of knowledge to be mined

    the kinds of database to be mined on

    the kinds of techniques adopted

    Knowledge to be mined: Summarization, characterization, association,

    classification, clustering, trend and deviation analysis,

    other pattern analysis.

    Database to be mined on: Relational, transactional, data warehouse, spatial,

    text, multimedia, active, object-oriented, Web, etc.

    Techniques adopted:

    Database statistics visualization machine learnin

  • 7/31/2019 Dataware Housing and Mining 16-Mar-06

    10/30

    Data Mining: A KD Process

    Data mining: thecore of knowledgediscovery process.

    Data CleaningData Integration

    Databases

    Data Warehouse

    Task-relevant Data

    Selection

    Data Mining

    Pattern Evaluation

  • 7/31/2019 Dataware Housing and Mining 16-Mar-06

    11/30

    From OLAP to OLAP Mining

    Construction of data warehouse and computation ofdata cubes.

    OLAP: On-Line Analytical Processing.

    OLAP operations: drilling/rolling, pivoting,slicing/dicing, filtering, etc.

    OLAP mining (OLAM): Integration of OLAP with datamining.

    On-line interactive mining: Mining interwinedwith drilling, slicing and dicing, pivoting, etc.

    Dynamic swapping mining tasks.w

  • 7/31/2019 Dataware Housing and Mining 16-Mar-06

    12/30

    Why OLAP Mining?

    Integration of data mining with data warehouse andOLAP technologies.

    Necessity of mining knowledge and patterns atdifferent levels of abstraction by drilling/rolling,pivoting, slicing/dicing, etc.

    Interactive characterization, comparison, association,classification, clustering, prediction.

    Integration of different data mining functions, e.g.,characterized classification, first clustering and thenassociation, etc.

  • 7/31/2019 Dataware Housing and Mining 16-Mar-06

    13/30

    Data Mining: OLAM Architecture

    Database Data Warehouse

    Meta DataData

    Cube

    OLAM

    Engine OLAPEngine

    User GUI API

    Data Cube API

    ODBC/OLEDB

  • 7/31/2019 Dataware Housing and Mining 16-Mar-06

    14/30

    Mining Data Dispersion Characteristics

    Data Dispersion Characteristics

    median, max, min, quantiles, outliers, variance, etc.

    Numerical dimensions correspond to sorted intervals:

    Data dispersion: analyzed with multiple granularities ofprecision.

    Boxplot or quantile analysis on sorted intervals.

    Dispersion analysis on computed measures:

    Folding measures into numerical dimensions.

    Boxplot or quantile analysis on the transformed cube.

  • 7/31/2019 Dataware Housing and Mining 16-Mar-06

    15/30

    Visualization of Data Dispersion:Boxplot Analysis

  • 7/31/2019 Dataware Housing and Mining 16-Mar-06

    16/30

    Mining Discriminant Rules

    Discrimination: Comparison of two or more classes Strategy:

    Collect the relevant data respectively into the target classand the contrasting class

    Generalize both classes to the same high level concepts,

    Compare tuples with the same high level descriptions, Present for every tuple its description and two numbers

    support - distribution within single class comparison - distribution between classes

    Highlight the tuples with strong discriminant features

    Relevance Analysis: Find attributes (features) which best distinguish different

    classes.

  • 7/31/2019 Dataware Housing and Mining 16-Mar-06

    17/30

    Mining Association Rules

    Assocation rule mining: Finding associations or correlations among a set of items or

    objects in transaction databases, relational databases, anddata warehouses.

    Applications: Basket data analysis, cross-marketing, catalog design, loss-

    leader analysis, clustering, etc.

    Examples.

    Rule form: LHS RHS [support, confidence]. buys(x, diapers) buys(x, beers) [0.5%, 60%]

    major(x, CS) ^ takes(x, DB) grade(x, A) [1%, 75%]

  • 7/31/2019 Dataware Housing and Mining 16-Mar-06

    18/30

    Mining Different Kinds of AssociationRules

    Boolean vs. quantitative associations

    Association on discrete vs. continuous data

    Sinlge dimension vs. multiple dimensional associations

    E.g., association on items bought vs. on multiple predicates.

    Single level vs. multiple-level analysis

    E.g, what brandof beers is associated with what brand of diapers?

    Simple vs. constraint-based

    E.g., small sales (sum < 100) trigger big buys (sum > 1,000)?

    Association vs. correlation analysis.

    Association does not necessarily imply correlation.

  • 7/31/2019 Dataware Housing and Mining 16-Mar-06

    19/30

    Classification

    Data categorization based on a set of training objects.

    Applications: credit approval, target marketing, medicaldiagnosis, treatment effectiveness analysis, etc.

    Example: classify a set of diseases and provide thesymptoms which describe each class or subclass.

    The classification task: Based on the features present in theclass_labeled training data, develop a description or model foreach class. It is used for

    classification of future test data,

    better understanding of each class, and

    prediction of certain properties and behaviors. Data classification methods: Decision-trees (e.g., ID3, C4.5),

    statistics, neural networks, rough sets, etc.

  • 7/31/2019 Dataware Housing and Mining 16-Mar-06

    20/30

    Major Classification Methods

    Decision tree-based classification: Training set vs test set or cross-validation Overfitting problem and tree pruning Boosting techniques.

    Bayesian classification: Nave Bayesian classification Bayesian belief networks Boosting techniques (e.g., AdaBoosting).

    Neural network approach:

    Multi-layer networks and back-propagation. Genetic algorithms:

    Genetic operators and fitness function selection.

  • 7/31/2019 Dataware Housing and Mining 16-Mar-06

    21/30

    Three Categories of ClusteringTechniques

    Partitioning-based: Basically enumerate various partitions and then score

    them by some criterion. K-means, K-medoids, etc.

    Hierarchy-based: Create a hierarchical decomposition of the set of data

    (or objects) using some criterion.

    Model-based:

    A model is hypothesized for each of the clusters Find the best fit of that model to each other. E.g., Bayesian classification (AutoClass), Cobweb.

  • 7/31/2019 Dataware Housing and Mining 16-Mar-06

    22/30

    Database Clustering Methods

    CLARANS (Ng & Han94): An extension to k-medoid algorithm based on randomized search.

    BIRCH (Zhang et al96): CF tree (a balanced

    tree structure). DBSCAN (EKXS96): connects regions of

    sufficiently high desity into clusters.

    STING (WYM97): A hierarchical cell structurethat store statistical information.

    CLIQUE (Agrawal et al98): Cluster highdimensional data.

  • 7/31/2019 Dataware Housing and Mining 16-Mar-06

    23/30

    Time-Series Data Mining

    Trend and deviation analysis Find trend (data evolution regularity) and deviations.

    Regression analysis, visualization techniques.

    Subsequence analysis: similarity search Subsequence matching: normalization + matching

    Template specification: shape and macrospecification.

    Sequential pattern analysis Sequential association rules

    Periodicity analysis full periods vs. partial periods, cyclic association

  • 7/31/2019 Dataware Housing and Mining 16-Mar-06

    24/30

    Similarity Search in Data Mining

    Faloutsos et al. (1994) :

    Extract features from each window

    Fourier Transform & R*-tree structure.

    Agrawal et al. (1995) :

    Amplitude scaling, offset translation Distance is determined from the sequence

    envelopes

    Agrawal et al. (1995) : SDL pattern language to encode queries about

    shapes

    Jagadish et al. (1997) :

    domain-independent framework

  • 7/31/2019 Dataware Housing and Mining 16-Mar-06

    25/30

    Periodic Pattern Search in Time-RelatedData Sets

    Full cycle analysis: Fourier transformation, other statistical analysis

    methods

    Fragment-wise cyclic behavior analysis: Example. Jack reads NY Times at every 9:00am.

    Given (natural) periods vs. arbitray periods.

    A data cube and OLAP-based technique: (Gong andHan98)

    Cyclic association rules: Associations which form cycles.

    Cyclic Association Rules (B. zden, S. Ramawamy, A.

  • 7/31/2019 Dataware Housing and Mining 16-Mar-06

    26/30

    Systems for Data Warehousing

    Arbor Software: Essbase Oracle: Express/Data-mart Suite.

    Informix: Meta-Cube.

    Cognos: PowerPlay

    Redbrick Systems: Redbrick Warehouse

    Microstrategy: DSS/Server Microsoft: PLATO (SQL-Server 7.0)

    [OLEDB for OLAP]

  • 7/31/2019 Dataware Housing and Mining 16-Mar-06

    27/30

    Systems for Data Mining

    IBM: Intelligent Miner. SAS Institute: Enterprise Miner. Silicon Graphics: MineSet. Integral Solutions Ltd.: Clementine. Information Discovery Inc.: Data Mining

    Suite. DBMiner Technology Inc.: DBMiner Rutger: DataMine, GMD: Explora, Univ.

    Munich: VisDB

  • 7/31/2019 Dataware Housing and Mining 16-Mar-06

    28/30

    Major Approaches in Data MiningSystems

    Database-oriented approach: IBM IntelligentMiner.

    OLAM approach: DBMiner.

    Machine learning: AQ15, ID3/C4.5/C5.0,Cobweb.

    Rough sets, fuzzy sets: Datalogic/R, 49er, etc.

    Statistical approaches, e.g., SAS EnterpriseMiner.

    Neural network approach: Cognos 4thoughts.

  • 7/31/2019 Dataware Housing and Mining 16-Mar-06

    29/30

    Conclusions

    Data Mining: A rich, promising, young field with broad applicationsand many challenging research issues.

    Data mining tasks: characterization, association, classification,clustering, prediction, sequence and pattern analysis, etc.

    Data mining domains: relational, transactional, text, spatial, time-

    series, multimedia, active DBs, data warehouses, and WWW. Data mining methods: Data-intensive, statistics, visualization,

    information science, and other disciplines.

    Progress: Scalable methods and multi-task systems.

    OLAM: On-line analytical mining provides a high promise forintegration of OLAP and mining.

  • 7/31/2019 Dataware Housing and Mining 16-Mar-06

    30/30

    Future Work

    Theoretical foundations of data mining.

    Implementation and new data mining methodologies: A set of well-tuned, standard mining operators.

    Data and knowledge visualization tools.

    Integration of multiple data mining strategies. Data mining in advanced information systems:

    Spatial, multimedia, Web-mining

    Data mining applications:

    content browsing, query optimization, multi-resolution model, etc.

    Social issues: A threat to security and privacy.