Asterio K. Tanaka BANCO DE DADOS DISTRIBUÍDOS e DATAWAREHOUSING Asterio K. Tanaka http://www.uniriotec.br/~tanaka/tin0036 [email protected] Introdução a Data Mining
Asterio K. Tanaka
BANCO DE DADOSDISTRIBUDOS e DATAWAREHOUSING
Asterio K. Tanakahttp://www.uniriotec.br/~tanaka/tin0036
Introduo a Data Mining
Asterio K. Tanaka
Introduo a Data Mining Conceitos: DM x OLAP DM como parte de KDD Objetivos gerais de KDD/DM Conhecimentos descobertos com DM
Regras de associao Hierarquias de classificao Padres sequenciais Padres em sries temporais Categorizao e segmentao
Tcnicas de DM Tcnicas para regras de associao rvores de deciso Outras tcnicas
Aplicaes
Asterio K. Tanaka
Business Intelligence
Increasing potentialto supportbusiness decisions End User
BusinessAnalyst
DataAnalyst
DBA
MakingDecisions
Data PresentationVisualization Techniques
Data MiningInformation Discovery
Data Exploration
OLAP, MDA
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
Data SourcesPaper, Files, Information Providers, Database Systems, OLTP
Asterio K. Tanaka
Ambientes analticos
Consultas padro
Multidimensional
Modelagem/Segmentao
Hipteses seguras
Hipteses moderadas
Nenhuma ou poucas hipteses
Ferramentas de Consulta
OLAP
Data Mining
Asterio K. Tanaka
Arquitetura Genrica de um Data Warehouse
BDs Operacionais
Fontes Externas
FONTES DE DADOS
Meta Dados
Data Warehouse
Data Marts
FERRAMENTASDE CONSULTA
Anlise
Data Mining
Relatrios
OLAP
OLAP
ExtraoTransformaoCargaAtualizao
Chaudhri&Dayal, SIGMOD RECORD 1997
Asterio K. Tanaka
Processo de Data Warehousing
Asterio K. Tanaka
KDD Knowledge Discovery in Databases
Fayyad, Usama; Piatetski-Shapiro, Gregory; Smyth, Padhraic (1996) TheKDD Process for Extracting Useful Knowledge from Volumes of Data. In: Communications of the ACM, pp.27-34, Nov.1996
Asterio K. Tanaka
Definies de Data Mining
Descoberta de informaes no reveladas em um banco de dados
Termos Similares Categorizao de Dados Anlise Exploratria de Dados (Exploratory Data
Analisys) Descoberta orientada a dados (Data driven discovery) Aprendizado dedutivo (Deductive learning)
parte de KDD (Knowledge Discovery in Databases)
Asterio K. Tanaka
Data MiningProcesso de extrair informao vlida,previamente desconhecida e de mxima abrangncia a partir de grandes bases de dados, usando-as para tomada dedecises.
Permite aos usurios explorar e inferir informao til a partir dos dados,descobrindo relacionamentos escondidosno banco de dados
Asterio K. Tanaka
Objetivos de Data Mining Explanatrio: explicar algum evento ou medida observada
porque a venda de sorvetes caiu no Rio de Janeiro;
Confirmatrio: confirmar uma hiptese Uma companhia de seguros , por exemplo, pode querer examinar
os registros de seus clientes para determinar se famlias de duas rendas tem mais probalidade de adquirir um plano de sade doque famlias de uma renda;
Exploratrio:analisar os dados buscando relacionamentos novos e no previstos.
Uma companhia de carto de crdito pode analisar seus registros histricos para determinar que fatores esto associados apessoas que representam risco para crditos
Asterio K. Tanaka
Objetivos de DM e KDD PredictionData mining can show how certain attributes within the data will behave in the
future. Examples of predictive data mining include the analysis of buying transactions to predict what consumers will buy under certain discounts, how much sales volume a store would generate in a given period, and whether deleting a product line would yield more profits. Insuch applications, business logic is used coupled with data mining. In a scientific context,certain seismic wave patterns may predict an earthquake with high probability.
IdentificationData patterns can be used to identify the existence of an item, an event, or an activity. For example, intruders trying to break a system may be identified by the programs executed, files accessed, and CPU time per session. In biological applications, existence of a gene may be identified by certain sequences of nucleotide symbols in the DNA sequence. The area known as authentication is a form of identification. It ascertains whether a user is indeed aspecific user or one from an authorized class; it involves a comparison of parameters or images or signals against a database.
ClassificationData mining can partition the data so that different classes or categories can be identified based on combinations of parameters. For example, customers in a supermarket can be categorized into discount-seeking shoppers, shoppers in a rush, loyal regular shoppers,and infrequent shoppers. This classification may be used in different analyses of customer buying transactions as a post-mining activity. Sometimes classification based on common domain knowledge is used as an input to decompose the mining problem and make it simpler. For instance, health foods, party foods, or school lunch foods are distinct categories in the supermarket business. It makes sense to analyze relationships within and across categories asseparate problems. Such categorization may be used to encode the data appropriately before subjecting it to further data mining.
OptimizationOne eventual goal of data mining may be to optimize the use of limited resources such as time, space, money, or materials and to maximize output variables such assales or profits under a given set of constraints. As such, this goal of data mining resembles the objective function used in operations research problems that deals with optimization underconstraints
Asterio K. Tanaka
Tipos de conhecimento descoberto com DM
1. Association rulesThese rules correlate the presence of a set of items with another range ofvalues for another set of variables. Examples: (1) When a female retail shopper buys a handbag, sheis likely to buy shoes. (2) An X-ray image containing characteristics a and b is likely to also exhibit characteristic c.
2. Classification hierarchiesThe goal is to work from an existing set of events or transactions tocreate a hierarchy of classes. Examples: (1) A population may be divided into five ranges of credit worthiness based on a history of previous credit transactions. (2) A model may be developed for the factors that determine the desirability of location of a store on a 110 scale. (3) Mutual funds may be classified based on performance data using characteristics such as growth, income, and stability.
3. Sequential patternsA sequence of actions or events is sought. Example: If a patient underwent cardiac bypass surgery for blocked arteries and an aneurysm and later developed high blood urea within a year of surgery, he or she is likely to suffer from kidney failure within the next 18 months.Detection of sequential patterns is equivalent to detecting association among events with certaintemporal relationships.
4. Patterns within time seriesSimilarities can be detected within positions of the time series. Three examples follow with the stock market price data as a time series: (1) Stocks of a utility companyABC Power and a financial company XYZ Securities show the same pattern during 1998 in terms ofclosing stock price. (2) Two products show the same selling pattern in summer but a different one inwinter. (3) A pattern in solar magnetic wind may be used to predict changes in earth atmospheric conditions.
5. Categorization and segmentationA given population of events or items can be partitioned(segmented) into sets of "similar" elements. Examples: (1) An entire population of treatment data ona disease may be divided into groups based on the similarity of side effects produced. (2) The adult population in the United States may be categorized into five groups from "most likely to buy" to "least likely to buy" a new product. (3) The web accesses made by a collection of users against a set ofdocuments (say, in a digital library) may be analyzed in terms of the keywords of documents toreveal clusters or categories of users.
Asterio K. Tanaka
Data Mining e KDD
Knowledge Discovery in Databases (KDD):processo de encontrar informao til em dados.
Data Mining: Uso de algoritmos para extraodesta informao
isto , DM parte do processo de KDD.
Asterio K. Tanaka
Processo de KDD
Modified from [FPSS96C]
Seleo: Obteno de dados de vrias fontes. Preprocessamento: Limpeza dos dados. Transformao: Converso para formato comum. Data Mining: Obteno de informao. Interpretao/Avaliao: Apresentao de
resultados de forma til.
Asterio K. Tanaka
KDD Ex: Web Log
Seleo: Selecionar dados de log (datas e locais)
Preprocessamento: Remover erros logados
Transformao: Ordenar e agrupar
Data Mining: Identificar e contar padres
Interpretao/Avaliao: Identificar e mostrar sequencias de acesso frequentes
Empregos Potenciais: Otimizao de Cache Personalizao
Asterio K. Tanaka
Desenvolvimento em Data MiningMedidas de SimilaridadeQueries imprecisasInformao no estruturadaMquinas de Busca
Teorema de BayesK-Means ClusteringAnlise de Sries de Tempo
Redes NeuraisLgica nebulosaAlgoritmos GenticosTeoria dos Conjuntos Aproximativos
Anlise, Projeto e Sntese de AlgoritmosEstruturas de Dados
Modelo RelacionalSQLData Warehousing/OLAPTcnicas de Escalabilidade
Asterio K. Tanaka
Conceitos Relacionados Bancos de Dados/OLTP Fuzzy Sets/Logic Cincia da Informao (Information Retrieval) Modelagem Dimensional/DW/OLAP Mtodos Estatsticos Aprendizado de Mquina (Machine Learning) Visualizao Computao de Alto Desempenho
(algoritmos/paralelismo)
Outras disciplinas: Redes neurais, modelagem matemtica, reconhecimento de
padres, etc.
Asterio K. Tanaka
DM versus DW e OLAP
DM prov outro nvel de anlise mais sofisticada quea provida por ferramentas OLAP
DM em DWs se beneficia da integrao e limpeza j feita sobre os dados
Mas no necessariamente precisa ser feito sobre DWs
Data warehousing/OLAP: Orientado a verificao
Data Mining: Orientado a descobertas no-antecipadas
Asterio K. Tanaka
Banco de Dados vs. Data Mining
Consultas Bem definidas SQL
Consultas Fracamente definidas Linguagem de consulta no definida
precisamente
DadosDados OperacionaisOperacionais
OutputOutput PrecisoPreciso SubconjuntoSubconjunto do do bancobanco de dadosde dados
DadosDados No operacionaisNo operacionais
OutputOutput FuzzyFuzzy No No subconjunto subconjunto do do banco banco de dadosde dados
Asterio K. Tanaka
Exemplos de Consultas Banco de Dados
Data Mining
EncontreEncontre items items que normalmente so comprados em que normalmente so comprados em conjunto conjunto com com leiteleite ((regrasregras de de associaoassociao).).
EncontreEncontre todastodas as as aplicaesaplicaes de de crditocrdito com com ltimoltimo nomenome Silva.Silva. Identifique clientes que compraram mais Identifique clientes que compraram mais de R$ 10.000,00 no de R$ 10.000,00 no ltimo msltimo ms..
EncontreEncontre todostodos as as aplicaesaplicaes de de crditocrdito queque representemrepresentemriscorisco ((classificaoclassificao).). Identifique clientes Identifique clientes com com perfis perfis de de consumo similaresconsumo similares(Clustering).(Clustering).
Liste Liste as as vendas diriasvendas dirias de de leite leite no no ltimo msltimo ms..
Asterio K. Tanaka
Objetivos de Data MiningModelos e Tarefas
BANCO DE DADOSDISTRIBUDOS e DATAWAREHOUSING Asterio K. Tanakahttp://www.uniriotec.br/~tanaka/[email protected] a Data MiningBusiness IntelligenceAmbientes analticosArquitetura Genrica de um Data WarehouseProcesso de Data WarehousingKDD Knowledge Discovery in DatabasesDefinies de Data MiningObjetivos de Data MiningObjetivos de DM e KDDTipos de conhecimento descoberto com DMData Mining e KDDProcesso de KDDKDD Ex: Web LogDesenvolvimento em Data MiningConceitos RelacionadosDM versus DW e OLAPBanco de Dados vs. Data MiningExemplos de ConsultasObjetivos de Data Mining Modelos e Tarefas