Top Banner

Click here to load reader


Nov 24, 2014



1. Definition 2. Overview 3. History 4. Evolution 5. Scope 6. Stages 7. Process 8. Relationships 9. Elements 10. Data Warehousing vs Data mining 11. Data Mining tools 12. Knowledge Discovery in Database 13. Advantages/Disadvantages

Data mining is the process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques.

Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge driven decisions. Prospective analysis offered by data mining move beyond analyses of past events provided by retrospective tools typical of decision support systems. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Data mining techniques can be implemented rapidly on existing software and hardware platforms to enhance the value of existing information resources, and can be integrated with new products and systems as they are brought on-line.

Data mining is the evolution of a field with a long history, but the term itself was only introduced relatively recently, in the 1990s Statistics are the foundation of most technologies on which data mining is built. Its roots can be traced back to along three family lines: Classical statistics Artificial intelligence Machine learning It is finding increasing acceptance in science and business areas which need to analyze large amounts of data to discover trends which they could not otherwise find.

Classical statistics embrace concepts such as regression analysis, standard distribution, standard deviation, standard variance, cluster analysis, all of which are used to study data and data relationships. These are the building blocks with which more advanced statistical analysis are underpinned. Within the heart of todays data mining tools and techniques, classical statistical analysis plays a significant role.

It is built upon heuristics (method that often rapidly leads to a solution that is usually close to the best possible answer) as opposed to statistics, attempts to apply human-thought-like processing to statistical problems. Since this approach requires vast computer processing power, it was not practical until the early 1980s, when computers began to offer useful power at reasonable prices. Certain AI concepts were adopted by some high-end commercial products, such as query optimization modules for Relational Database Management Systems (RDBMS).

Union of statistics and artificial intelligence. Is an evolution of artificial intelligence because it blends artificial intelligence heuristics with advanced statistical analysis. Machine learning attempts to let computer programs learn about the data they study, such that programs make different decisions based on the qualities of the studied data, using statistics for fundamental concepts, and adding more advanced AI heuristics and algorithms to achieve its goals.

Evolutionary Step Data Collection(1960)

Business Question "What was my total revenue in the last five years?" "What were unit sales in New England last March?"

Enabling Technologies Computers, tapes, disks

Product Providers IBM, CDC

Purpose Retrospective, static data delivery

Data Access(1980s)

Relational databases (RDBMS), Structured Query Language (SQL), ODBC On-line analytic processing (OLAP), multidimensional databases, data warehouses Advanced algorithms, multiprocessor computers, massive databases

Oracle, Sybase, Informix, IBM, Microsoft

Retrospective, dynamic data delivery at record level

Data Warehousing & Decision Support (1990s) Data Mining (Emerging Today)

"What were unit sales in New England last March? Drill down to Boston." "Whats likely to happen to Boston unit sales next month? Why?"

Pilot, Comshare, Arbor, Cognos, Microstrategy

Retrospective, dynamic data delivery at multiple levels

Pilot, Lockheed, IBM, SGI, numerous startups (nascent industry)

Prospective, proactive information delivery

Automated prediction of trends and behaviors. A typical example of a predictive problem is targeted marketing. Data mining uses data on past promotional mailings to identify the targets most likely to maximize return on investment in future mailings. EX:x x forecasting bankruptcy identifying segments of a population likely to respond similarly to given events.

Automated discovery of previously unknown patterns. Data mining tools sweep through databases and identify previously hidden patterns in one step. EX:x analysis of retail sales data to identify seemingly unrelated products that are often purchased together (ex beer and diapers). x detecting fraudulent credit card transactions and identifying anomalous data that could represent data entry keying errors.

Stage 1: Exploration Data preparation, cleaning and transformation. Stage 2: Model building and validation Considering various models and choosing the best one based on their performance. Stage 3: Deployment Using the selected model as best in Stage 2 and applying it to new data in order to generate predictions or estimates of the expected outcome.

Classes: Stored data is used to locate data in predetermined groups. For example, a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order. This information could be used to increase traffic by having daily specials. Clusters: Data items are grouped according to logical relationships or consumer preferences. For example, data can be mined to identify market segments or consumer affinities. Associations: Data can be mined to identify associations. The beer-diaper example is an example of associative mining. Sequential patterns: Data is mined to anticipate behavior patterns and trends. For example, an outdoor equipment retailer could predict the likelihood of a backpack being purchased based on a consumer's purchase of sleeping bags and hiking shoes.

Extract, transform, and load transaction data onto the data warehouse system. Store and manage the data in a multidimensional database system. Provide data access to business analysts and information technology professionals. Analyze the data by application software. Present the data in a useful format, such as a graph or table.

Data Warehouse: is a repository (or archive) of information gathered from multiple sources, stored under a unified schema, at a single site. (Silberschatz) Collect data Store in single repository Allows for easier query development as a single repository

can be queried.

Data Mining: Analyzing databases or Data Warehouses to discover

patterns about the data to gain knowledge. Knowledge is power

Data mining tools are software components and theories that allow users to extract information from data. The tools provide individuals and companies with the ability to gather large amounts of data and use it to make determinations about a particular user or groups of users.

1. 2. 3.

Data mining tools can be classified into one of three categories: traditional data mining tools dashboards, and text-mining tools.


Traditional Data Mining Tools.

Help companies establish data patterns and trends by using a number of complex algorithms and techniques. Some of these tools are installed on the desktop to monitor the data and highlight trends and others capture information residing outside a database. The majority are available in both Windows and UNIX versions, although some specialize in one operating system only. While some may concentrate on one database type, most will be able to handle any data using online analytical processing or a similar technology.

2. Dashboards.

Installed in computers to monitor information in a database. Dashboards reflect data changes and updates onscreen often in the form of a chart or table enabling the user to see how the business is performing. Historical data also can be referenced, enabling the user to see where things have changed (e.g., increase in sales from the same period last year). This functionality makes dashboards easy to use and particularly appealing to managers who wish to have an overview of the company's performance.

3. Text-mining Tools. Its ability to mine data from different kinds of text for example from Microsoft Word and Acrobat PDF documents to simple text files. These tools scan content and convert the selected data into a format that is compatible with the tool's database, thus providing users with an easy and convenient way of accessing data without the need to open different applications. Scanned content can be unstructured (i.e., information is scattered almost randomly across the document, including emails, Internet pages, audio and video data) or structured (i.e., the data's form and purpose is known, such as content found in a database).

The most prevalent tool used in data mining KDD was developed in 1989 by Gregory PiatetskyShapiro. Users are able to process raw data, mine the data for information and interpret the various results in the form of information management. Include information like financials, client lists, policy and procedure documents, shareholder registers, and even electronic copies of contractual agreements with customers and vendors. With a data mining tool, it is possible to conduct a focused search for data that is needed, rather than having to pore through all the stored data manually.



We select data relevant to some criteria.