Top Banner
Asterio K. Tanaka BANCO DE DADOS DISTRIBUÍDOS e DATAWAREHOUSING Asterio K. Tanaka http://www.uniriotec.br/~tanaka/tin0036 [email protected] Introdução a Data Mining
21

13-BDDDW-DataMining

Nov 17, 2015

Download

Documents

Georham

dw
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • Asterio K. Tanaka

    BANCO DE DADOSDISTRIBUDOS e DATAWAREHOUSING

    Asterio K. Tanakahttp://www.uniriotec.br/~tanaka/tin0036

    [email protected]

    Introduo a Data Mining

  • Asterio K. Tanaka

    Introduo a Data Mining Conceitos: DM x OLAP DM como parte de KDD Objetivos gerais de KDD/DM Conhecimentos descobertos com DM

    Regras de associao Hierarquias de classificao Padres sequenciais Padres em sries temporais Categorizao e segmentao

    Tcnicas de DM Tcnicas para regras de associao rvores de deciso Outras tcnicas

    Aplicaes

  • Asterio K. Tanaka

    Business Intelligence

    Increasing potentialto supportbusiness decisions End User

    BusinessAnalyst

    DataAnalyst

    DBA

    MakingDecisions

    Data PresentationVisualization Techniques

    Data MiningInformation Discovery

    Data Exploration

    OLAP, MDA

    Statistical Analysis, Querying and Reporting

    Data Warehouses / Data Marts

    Data SourcesPaper, Files, Information Providers, Database Systems, OLTP

  • Asterio K. Tanaka

    Ambientes analticos

    Consultas padro

    Multidimensional

    Modelagem/Segmentao

    Hipteses seguras

    Hipteses moderadas

    Nenhuma ou poucas hipteses

    Ferramentas de Consulta

    OLAP

    Data Mining

  • Asterio K. Tanaka

    Arquitetura Genrica de um Data Warehouse

    BDs Operacionais

    Fontes Externas

    FONTES DE DADOS

    Meta Dados

    Data Warehouse

    Data Marts

    FERRAMENTASDE CONSULTA

    Anlise

    Data Mining

    Relatrios

    OLAP

    OLAP

    ExtraoTransformaoCargaAtualizao

    Chaudhri&Dayal, SIGMOD RECORD 1997

  • Asterio K. Tanaka

    Processo de Data Warehousing

  • Asterio K. Tanaka

    KDD Knowledge Discovery in Databases

    Fayyad, Usama; Piatetski-Shapiro, Gregory; Smyth, Padhraic (1996) TheKDD Process for Extracting Useful Knowledge from Volumes of Data. In: Communications of the ACM, pp.27-34, Nov.1996

  • Asterio K. Tanaka

    Definies de Data Mining

    Descoberta de informaes no reveladas em um banco de dados

    Termos Similares Categorizao de Dados Anlise Exploratria de Dados (Exploratory Data

    Analisys) Descoberta orientada a dados (Data driven discovery) Aprendizado dedutivo (Deductive learning)

    parte de KDD (Knowledge Discovery in Databases)

  • Asterio K. Tanaka

    Data MiningProcesso de extrair informao vlida,previamente desconhecida e de mxima abrangncia a partir de grandes bases de dados, usando-as para tomada dedecises.

    Permite aos usurios explorar e inferir informao til a partir dos dados,descobrindo relacionamentos escondidosno banco de dados

  • Asterio K. Tanaka

    Objetivos de Data Mining Explanatrio: explicar algum evento ou medida observada

    porque a venda de sorvetes caiu no Rio de Janeiro;

    Confirmatrio: confirmar uma hiptese Uma companhia de seguros , por exemplo, pode querer examinar

    os registros de seus clientes para determinar se famlias de duas rendas tem mais probalidade de adquirir um plano de sade doque famlias de uma renda;

    Exploratrio:analisar os dados buscando relacionamentos novos e no previstos.

    Uma companhia de carto de crdito pode analisar seus registros histricos para determinar que fatores esto associados apessoas que representam risco para crditos

  • Asterio K. Tanaka

    Objetivos de DM e KDD PredictionData mining can show how certain attributes within the data will behave in the

    future. Examples of predictive data mining include the analysis of buying transactions to predict what consumers will buy under certain discounts, how much sales volume a store would generate in a given period, and whether deleting a product line would yield more profits. Insuch applications, business logic is used coupled with data mining. In a scientific context,certain seismic wave patterns may predict an earthquake with high probability.

    IdentificationData patterns can be used to identify the existence of an item, an event, or an activity. For example, intruders trying to break a system may be identified by the programs executed, files accessed, and CPU time per session. In biological applications, existence of a gene may be identified by certain sequences of nucleotide symbols in the DNA sequence. The area known as authentication is a form of identification. It ascertains whether a user is indeed aspecific user or one from an authorized class; it involves a comparison of parameters or images or signals against a database.

    ClassificationData mining can partition the data so that different classes or categories can be identified based on combinations of parameters. For example, customers in a supermarket can be categorized into discount-seeking shoppers, shoppers in a rush, loyal regular shoppers,and infrequent shoppers. This classification may be used in different analyses of customer buying transactions as a post-mining activity. Sometimes classification based on common domain knowledge is used as an input to decompose the mining problem and make it simpler. For instance, health foods, party foods, or school lunch foods are distinct categories in the supermarket business. It makes sense to analyze relationships within and across categories asseparate problems. Such categorization may be used to encode the data appropriately before subjecting it to further data mining.

    OptimizationOne eventual goal of data mining may be to optimize the use of limited resources such as time, space, money, or materials and to maximize output variables such assales or profits under a given set of constraints. As such, this goal of data mining resembles the objective function used in operations research problems that deals with optimization underconstraints

  • Asterio K. Tanaka

    Tipos de conhecimento descoberto com DM

    1. Association rulesThese rules correlate the presence of a set of items with another range ofvalues for another set of variables. Examples: (1) When a female retail shopper buys a handbag, sheis likely to buy shoes. (2) An X-ray image containing characteristics a and b is likely to also exhibit characteristic c.

    2. Classification hierarchiesThe goal is to work from an existing set of events or transactions tocreate a hierarchy of classes. Examples: (1) A population may be divided into five ranges of credit worthiness based on a history of previous credit transactions. (2) A model may be developed for the factors that determine the desirability of location of a store on a 110 scale. (3) Mutual funds may be classified based on performance data using characteristics such as growth, income, and stability.

    3. Sequential patternsA sequence of actions or events is sought. Example: If a patient underwent cardiac bypass surgery for blocked arteries and an aneurysm and later developed high blood urea within a year of surgery, he or she is likely to suffer from kidney failure within the next 18 months.Detection of sequential patterns is equivalent to detecting association among events with certaintemporal relationships.

    4. Patterns within time seriesSimilarities can be detected within positions of the time series. Three examples follow with the stock market price data as a time series: (1) Stocks of a utility companyABC Power and a financial company XYZ Securities show the same pattern during 1998 in terms ofclosing stock price. (2) Two products show the same selling pattern in summer but a different one inwinter. (3) A pattern in solar magnetic wind may be used to predict changes in earth atmospheric conditions.

    5. Categorization and segmentationA given population of events or items can be partitioned(segmented) into sets of "similar" elements. Examples: (1) An entire population of treatment data ona disease may be divided into groups based on the similarity of side effects produced. (2) The adult population in the United States may be categorized into five groups from "most likely to buy" to "least likely to buy" a new product. (3) The web accesses made by a collection of users against a set ofdocuments (say, in a digital library) may be analyzed in terms of the keywords of documents toreveal clusters or categories of users.

  • Asterio K. Tanaka

    Data Mining e KDD

    Knowledge Discovery in Databases (KDD):processo de encontrar informao til em dados.

    Data Mining: Uso de algoritmos para extraodesta informao

    isto , DM parte do processo de KDD.

  • Asterio K. Tanaka

    Processo de KDD

    Modified from [FPSS96C]

    Seleo: Obteno de dados de vrias fontes. Preprocessamento: Limpeza dos dados. Transformao: Converso para formato comum. Data Mining: Obteno de informao. Interpretao/Avaliao: Apresentao de

    resultados de forma til.

  • Asterio K. Tanaka

    KDD Ex: Web Log

    Seleo: Selecionar dados de log (datas e locais)

    Preprocessamento: Remover erros logados

    Transformao: Ordenar e agrupar

    Data Mining: Identificar e contar padres

    Interpretao/Avaliao: Identificar e mostrar sequencias de acesso frequentes

    Empregos Potenciais: Otimizao de Cache Personalizao

  • Asterio K. Tanaka

    Desenvolvimento em Data MiningMedidas de SimilaridadeQueries imprecisasInformao no estruturadaMquinas de Busca

    Teorema de BayesK-Means ClusteringAnlise de Sries de Tempo

    Redes NeuraisLgica nebulosaAlgoritmos GenticosTeoria dos Conjuntos Aproximativos

    Anlise, Projeto e Sntese de AlgoritmosEstruturas de Dados

    Modelo RelacionalSQLData Warehousing/OLAPTcnicas de Escalabilidade

  • Asterio K. Tanaka

    Conceitos Relacionados Bancos de Dados/OLTP Fuzzy Sets/Logic Cincia da Informao (Information Retrieval) Modelagem Dimensional/DW/OLAP Mtodos Estatsticos Aprendizado de Mquina (Machine Learning) Visualizao Computao de Alto Desempenho

    (algoritmos/paralelismo)

    Outras disciplinas: Redes neurais, modelagem matemtica, reconhecimento de

    padres, etc.

  • Asterio K. Tanaka

    DM versus DW e OLAP

    DM prov outro nvel de anlise mais sofisticada quea provida por ferramentas OLAP

    DM em DWs se beneficia da integrao e limpeza j feita sobre os dados

    Mas no necessariamente precisa ser feito sobre DWs

    Data warehousing/OLAP: Orientado a verificao

    Data Mining: Orientado a descobertas no-antecipadas

  • Asterio K. Tanaka

    Banco de Dados vs. Data Mining

    Consultas Bem definidas SQL

    Consultas Fracamente definidas Linguagem de consulta no definida

    precisamente

    DadosDados OperacionaisOperacionais

    OutputOutput PrecisoPreciso SubconjuntoSubconjunto do do bancobanco de dadosde dados

    DadosDados No operacionaisNo operacionais

    OutputOutput FuzzyFuzzy No No subconjunto subconjunto do do banco banco de dadosde dados

  • Asterio K. Tanaka

    Exemplos de Consultas Banco de Dados

    Data Mining

    EncontreEncontre items items que normalmente so comprados em que normalmente so comprados em conjunto conjunto com com leiteleite ((regrasregras de de associaoassociao).).

    EncontreEncontre todastodas as as aplicaesaplicaes de de crditocrdito com com ltimoltimo nomenome Silva.Silva. Identifique clientes que compraram mais Identifique clientes que compraram mais de R$ 10.000,00 no de R$ 10.000,00 no ltimo msltimo ms..

    EncontreEncontre todostodos as as aplicaesaplicaes de de crditocrdito queque representemrepresentemriscorisco ((classificaoclassificao).). Identifique clientes Identifique clientes com com perfis perfis de de consumo similaresconsumo similares(Clustering).(Clustering).

    Liste Liste as as vendas diriasvendas dirias de de leite leite no no ltimo msltimo ms..

  • Asterio K. Tanaka

    Objetivos de Data MiningModelos e Tarefas

    BANCO DE DADOSDISTRIBUDOS e DATAWAREHOUSING Asterio K. Tanakahttp://www.uniriotec.br/~tanaka/[email protected] a Data MiningBusiness IntelligenceAmbientes analticosArquitetura Genrica de um Data WarehouseProcesso de Data WarehousingKDD Knowledge Discovery in DatabasesDefinies de Data MiningObjetivos de Data MiningObjetivos de DM e KDDTipos de conhecimento descoberto com DMData Mining e KDDProcesso de KDDKDD Ex: Web LogDesenvolvimento em Data MiningConceitos RelacionadosDM versus DW e OLAPBanco de Dados vs. Data MiningExemplos de ConsultasObjetivos de Data Mining Modelos e Tarefas