data mining.docx

A predictive analisys for real time data through data mining techniques Introduction

Chapter 1

Introduction

The convergence of computing and communication has produced a society that feeds on information.

Yet most of the information is in its raw form: data. If data is characterized as recorded facts, then

information is the set of patterns, or expectations, that underlie the data. There is a huge amount of

information locked up in databases—information that is potentially important but has not yet been

discovered or articulated. Our mission is to bring it forth. Data mining is the extraction of implicit,

previously unknown, and potentially useful information from data. The idea is to build computer

programs that sift through databases automatically, seeking regularities or patterns. Strong patterns, if

found, will likely generalize to make accurate predictions on future data.

Forecasting demand is both a science and an art Needless to say, forecasting is an important element for

decision-making support. The common theme through decision-making is “selection and decision” and

the forecasting is indispensable for the optimal realization of this theme. This is explained obviously by

the fact that the forecasting is positioned as a core method of DSS (Decision Support System) which has

been developed until now.

In modern days, high activity of industry and high dependence to electric power in daily life, require

still more increase and higher stability of electric power. It is needless to say that high accurate

prediction of electric power demand has a decisive role for these requirements. This is the first chapter

of the report which will give you a brief overview of different terms that we repeatedly used in data

mining

- 1 -


1.1 OVERVIEW

Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data

from different perspectives and summarizing it into useful information - information that can be used to

increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for

analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it,

and summarize the relationships identified. Technically, data mining is the process of finding

correlations or patterns among dozens of fields in large relational databases.

1.2 Data

Data are any facts, numbers, or text that can be processed by a computer. Today, organizations are

accumulating vast and growing amounts of data in different formats and different databases. This

includes:

Operational or transactional data such as, sales, cost, inventory, payroll, and accounting.

nonoperational data, such as industry sales, forecast data, and macro economic data

meta data - data about the data itself, such as logical database design or data dictionary

definitions

1.3 Information

The patterns, associations, or relationships among all this data can provide information. For example,

analysis of retail point of sale transaction data can yield information on which products are selling and

when.

- 2 -


1.4 Knowledge

Information can be converted into knowledge about historical patterns and future trends. For example,

summary information on retail supermarket sales can be analyzed in light of promotional efforts to

provide knowledge of consumer buying behaviour. Thus, a manufacturer or retailer could determine

which items are most susceptible to promotional efforts. There are two main kinds of models in data

mining: predictive and descriptive. Predictive models can be used to forecast explicit values, based on

patterns determined from known results. For example, from a database of customers who have already

responded to a particular offer, a model can be built that predicts which prospects are likeliest to respond

to the same offer. Descriptive models describe patterns in existing data, and are generally used to create

meaningful subgroups such as demographic clusters. Descriptive models, that is, the unsupervised

learning function do not predict a target value, but focus more on the intrinsic structure, relations,

interconnectedness, etc. of the data.

1.5 Predictive modeling

Predictive modeling is used when the goal is to estimate the value of a particular target attribute and

there exist sample training data for which values of that attribute are known. An example is

classification, which takes a set of data already divided into predefined groups and searches for patterns

in the data that differentiate those groups. These discovered patterns then can be used to classify other

data where the right group designation for the target attribute is unknown (though other attributes may

be known). For instance, a manufacturer could develop a predictive model that distinguishes parts that

fail under extreme heat, extreme cold, or other conditions based on their manufacturing environment,

and this model may then be used to determine appropriate applications for each part.

- 3 -


1.6 Descriptive modeling

Descriptive modelling, or clustering, also divides data into groups. With clustering, however, the proper

groups are not known in advance; the patterns discovered by analyzing the data are used to determine

the groups. For example, an advertiser could analyze a general population in order to classify potential

customers into different clusters and then develop separate advertising campaigns targeted to each

group. Fraud detection also makes use of clustering to identify groups of individuals with similar

purchasing patterns.

1.7 How does data mining work?

While large-scale information technology has been evolving separate transaction and analytical

systems, data mining provides the link between the two. Data mining software analyzes relationships

and patterns in stored transaction data based on open-ended user queries. Several types of analytical

software are available: statistical, machine learning, and neural networks. Generally, any of four

types of relationships are sought

Classes: Stored data is used to locate data in predetermined groups. For example, a restaurant

chain could mine customer purchase data to determine when customers visit and what they

typically order. This information could be used to increase traffic by having daily specials.

Clusters: Data items are grouped according to logical relationships or consumer preferences. For

example, data can be mined to identify market segments or consumer affinities.

Associations: Data can be mined to identify associations. The beer-diaper example is an example

of associative mining.

- 4 -

http://www.britannica.com/EBchecked/topic/6801/advertising


Sequential patterns: Data is mined to anticipate behaviour patterns and trends. For example, an

outdoor equipment retailer could predict the likelihood of a backpack being purchased based on a

consumer's purchase of sleeping bags and hiking shoes.

Data mining consists of five major elements:

Extract, transform, and load transaction data onto the data warehouse system.

Store and manage the data in a multidimensional database system.

Provide data access to business analysts and information technology professionals.

Analyze the data by application software.

Present the data in a useful format, such as a graph or table.

1.8 Different levels of analysis are available

Artificial neural networks: Non-linear predictive models that learn through training and

resemble biological neural networks in structure.

Genetic algorithms: Optimization techniques that use process such as genetic combination,

mutation, and natural selection in a design based on the concepts of natural evolution.

Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate

rules for the classification of a dataset

- 5 -


Nearest neighbor method: A technique that classifies each record in a dataset based on a

combination of the classes of the k record(s) most similar to it in a historical dataset (where k 1).

Sometimes called the k-nearest neighbor technique.

Rule induction: The extraction of useful if-then rules from data based on statistical significance.

1.9 Predictive models.

Predictive models analyze past performance to assess how likely a customer is to exhibit a specific

behavior in the future in order to improve marketing effectiveness. This category also encompasses

models that seek out subtle data patterns to answer questions about customer performance, such as fraud

detection models.

1.9.1 Predictive Models Methodology.

1. Project Definition: Define the business objectives and desired outcomes for the project and translate

them into predictive analytic objectives and tasks.

2. Exploration: Analyze source data to determine the most appropriate data and model building

approach, and scope the effort.

3. Data Preparation: Select, extract, and transform data upon which to create models.

4. Model Building: Create, test, and validate models, and evaluate whether they will meet project

metrics and goals.

- 6 -


5. Deployment: Apply model results to business decisions or processes. This ranges from sharing

insights with business users to embedding models into applications to automate decisions and

business processes.

1.10 .Barriers to Usage.

A host of barriers can prevent organizations from venturing into the domain of predictive analytics or

impede their growth. This “analytics bottleneck” arises from:

1. Complexity. Developing sophisticated models has traditionally been a slow, iterative, and labour

intensive process.

2. Data. Most corporate data is full of errors and inconsistencies but most predictive a models require

clean, scrubbed, expertly formatted data to work.

3. Processing Expense. Complex analytical queries and scoring processes can clog networks and bog

down database performance, especially when performed on the desktop.

4. Expertise. Qualified business analysts who can create sophisticated models are hard to find,

expensive to pay, and difficult to retain.

5. Interoperability. The process of creating and deploying predictive models traditionally involves

accessing or moving data and models among multiple machines, operating platforms, and

applications, which requires interoperable software.

6. Pricing. The price of most predictive analytic software and the hardware to run it on is beyond

the reach of most midsize organizations or departments in large organizations.

- 7 -


1.11 Applications of predictive analysis.

Although predictive analytics can be put to use in many applications, we outline a few examples

where predictive analytics has shown positive impact in recent years.

i) Analytical customer relationship management (CRM)

ii) Clinical decision support systems

Clinical Decision Support systems link health observations with health knowledge to influence health

choices by clinicians for improved health care.iii) .Cross-sell

For an organization that offers multiple products, an analysis of existing customer behaviour can lead

to efficient cross sell of products. Predictive analytics can help analyze customers’ spending, usage

and other behaviour, and help cross-sell the right product at the right time.

iv) Customer retention

By a frequent examination of a customer’s past service usage, service performance, spending and

other behaviour patterns, predictive models can determine the likelihood of a customer wanting to

terminate service sometime in the near future.

v) Direct marketing

Apart from identifying prospects, predictive analytics can also help to identify the most effective

combination of product versions, marketing material, communication channels and timing that should

- 8 -

http://en.wikipedia.org/wiki/Cross-selling

http://en.wikipedia.org/wiki/Clinical_Decision_Support


be used to target a given consumer. The goal of predictive analytics is typically to lower the cost per

order or cost per action.

vi) Fraud detection

Fraud is a big problem for many businesses and can be of various types. Inaccurate credit

applications, fraudulent transactions, identity thefts and false insurance claims are some examples of

this problem. These problems plague firms all across the spectrum and some examples of likely

victims are credit card issuers, insurance companies, retail merchants, manufacturers, business to

business suppliers and even services providers. This is an area where a predictive model is often used

to help weed out the “bads” and reduce a business's exposure to fraud.

- 9 -

http://en.wikipedia.org/wiki/Credit_card_fraud

http://en.wikipedia.org/wiki/Identity_theft

http://en.wikipedia.org/wiki/Cost_per_action

http://en.wikipedia.org/w/index.php?title=Cost_per_order&action=edit&redlink=1

http://en.wikipedia.org/w/index.php?title=Cost_per_order&action=edit&redlink=1

A predictive analisys for real time data through data mining techniques Objectives

Chapter 2

Objectives

1. Upgrade an e-science infrastructure to support collaborative, data mining enabled

experimental research.

2. Develop a knowledge-driven data mining assistant to support researchers in data-

intensive, knowledge-rich domains.

3. Design and implement mechanisms for meta-mining the knowledge discovery process.

4. Demonstrate e-LICO on a systems biology approach to disease studies

- 10 -

A predictive analisys for real time data through data mining techniques Company profile

Chapter 3

Company Profile

3.1 About Veena Industries Ltd.

Veena Industries Ltd. the flagship company of Pune based Agarwal Group, was established in1980

by two technocrat entrepreneurs Mr. Shailendra N. Agarwal and Mr. Brijendra N. Agarwal. Veena

Industries pioneered the manufacturing of Silent Gensets and Engineered fabricated components

like canopies for gensets & industrial equipment, under-carriages, structural elements for Off

Highway vehicles for over 30 years now. Veena Industries Limited has been a closely family held

public limited company. Traditionally VIL has been an Owner driven, professionally managed

Organisation. Veena Industries commissioned its first plant at MIDC, Bhosari, and Pune, India to

manufacture trailers for Mahindra Owel and base-frames & under-carriages for Atlas Copco India

Ltd.

3.2 Introduction

Under veena industries Ltd. Agarwal Group has been newly started software Solutions Company on 12 th

July 2012. Which headquartered in Chakan Pune, they delivers the finest software solutions that best

meet the requirements of the global clientele. Their software’s successfully supports the clients in

achieving their business goals in the most cost-effective way. They can do this, as they invest quality

time in understanding, analyzing, and sometimes even discovering the client requirements at an early

stage. Their ability to choose the suitable in-house resources for the projects, their expertise to scale their

selves in terms of resources and domain knowledge to match the project needs, and above all, their

knowledge of industry, translates a customer requirement into a successfully delivered software solution.

- 11 -

A predictive analisys for real time data through data mining techniques Company profile

Veena Industries Ltd. Offers Following Services:

1. Offshore software development

2. Platform/data migration services

3. Digitization and conversion of documents

4. Product development

3.3 Why Veena Industries Ltd. Consulting?

It solves most critical business problems using analytic they maximize predictive power while ensuring

actionable use by addressing operational legal and data issues. Deep Expertise in working into data from

multiple sources ( e.g. demographic, application, master file, credit bureau ) and levels of data

aggregation (e.g. transaction, account , customer, portfolio, enterprise)

.

3.4 MISSION & VISION

Their Mission is to provide customers with innovative solution and progressive technology that helps to

improve customer’s working methods and profitability.

- 12 -

A predictive analisys for real time data through data mining techniques Company History

3.5 COMPANY HISTORY

2012 Started Solutions Company on 12th July 2012.

2011 Integrated Heavy Fabrication plant set at Chakan.

2010 Consolidation of manufacturing facilities from Silvasa and Santoshnagar to

Chakan

2009 Mecc Alte plant was set up at Sanaswadi, Pune and is operational since 25th

January 2009

2007 Flair Technologies Limited, UK was incorporated as a 100% subsidiary of

Veena Industries Ltd.

Flair Technologies is the sales, marketing & distribution arm operating on the

business principle of JIT deliveries to our global customers.

Veena Industries Private Limited became a public limited company

2006 Veena Industries Private Limited formed a Joint Venture with Mecc Alte, UK

for manufacture of Alternator assemblies.

2002 Branch I and Branch II were ISO 9001:2000 certified by DNV in June 2002

1980 Veena Industries commissioned its first plant at MIDC, Bhosari, Pune, India to

manufacture trailers for Mahindra Owel and base-frames & under-carriages for

Atlas Copco India Ltd

- 13 -


3.6 Branches of the veena industries ltd.

Branch 1 - Bhosari, Pune

Branch 2 - Chakan, Pune

Branch 3 - Waki, Pune

Branch 4 - Samba, Jammu

Branch 5 - (EOU)-Chakan, Pune

Branch 6 - Pithampur, Indore

Mecc Alte India Limited. Sanaswadi, Pune (JV)

3.7 Organisational Chart of the company

Board of directors

Shailendra N. Agarwal, Managing Director

Brijendra N. Agarwal, Chairman

Atin Agarwal, Director

Avinash Agarwal, Director

3.8 Management team & Organization of the company

Mr. Raman Tandon - GM Operations Division

Mr. Milind Vyas – Head Procurement and IT

Mr. S.Srinivas - GM Business Development

- 14 -


Various Products

1. Proclaim

2. Makros

3. Universal Desk

4. Acuity

Veena Industries Ltd. Provides powerful custom analytics and analytic consulting that are considered

the “Gold Standard’’ in the industries they serve.

What Veena Industries Ltd. aspires to be!

1) To have the leading edge in the use of Software technology.

- 15 -

A predictive analisys for real time data through data mining techniques project work undertaken

Chapter 4

PROJECT WORK UNDERTAKEN

4.1 WHY THIS PROJECT?

The study that I conducted provided me an excellent opportunity to implement all that I have learnt in

my classroom session in the practical outfield. I am doing my project on this topic because it will help

me to know more about the Data mining.

This project will help organization in determining to what extent its IT sectors activities were successful

in their software solutions. Data mining is the process of finding correlations or patterns among dozens

of fields in large relational databases. And it is very helpful to find out the data which is hidden or stored

in the system. This project will also help the company to understand what customers look for in the

software service portal.

My project will help the company know more about their strength, prospects, etc.

4.2 SCOPE OF THE PROJECT

1. the project include techniques of data mining which is most important to collect any data

Which is useful to

2. Predict cross-sell opportunities and make recommendations.

- 16 -


3. Predict what each individual accessing a Web site is most likely interested in seeing.

4.3 HOW IT IS USEFUL FOR THE COMPANY?

The data mining is useful to identify the need of software solutions.

Identify your best prospects and then retain them as customers.

Learn parameters influencing trends in sales and margins.

4.4 IMPORTANT ASPECTS OF THE PROJECT

As a student of MBA the project provided me an excellent opportunity to implement all

that I have learnt in my classroom sessions in the practical outfield. it helped me to know

more about the data mining data mailing, software’s, & websites and also about the

software development.

The head quarters of the company was established months before in pune. Beginning of

this year the management realized the importance of software solutions and need of

software companies, the company recruited new team and interns with this respect. I was

recruited as an intern for working on data mining process.

4.5 PLACE OF THE PROJECT

Gat No 309 Plot No C/3, Nanekarwadi, Chakan, Pune – 410501.

- 17 -


- 18 -

A predictive analisys for real time data through data mining techniques Research Methodology

Chapter 5.

RESEARCH METHODOLOGY

This chapter presents a methodology known as association analysis which is useful for discovering

interesting relationship hidden in the large datasets. The uncovered relationships can be presented in the

form of association rules or sets of frequent item sets

5.1 INTRODUCTION

In data mining, association rule learning is a popular and well researched method for discovering

interesting relations between variables in large databases. Piatetsky-Shapiro describes analyzing and

presenting strong rules discovered in databases using different measures of interestingness.

Association rule mining finds interesting associations and/or correlation relationships among large set of

data items. Association rules show attribute value conditions that occur frequently together in a given

dataset. A typical and widely-used example of association rule mining is Market Basket Analysis.

For example, data are collected using bar-code scanners in supermarkets. Such market basket databases

consist of a large number of transaction records. Each record lists all items bought by a customer on a

single purchase transaction. Managers would be interested to know if certain groups of items are

consistently purchased together. They could use this data for adjusting store layouts (placing items

optimally with respect to each other), for cross-selling, for promotions, for catalog design and to identify

customer segments based on buying patterns. Association rules provide information of this type in the

form of "if-then" statements. These rules are computed from the data and, unlike the if-then rules of

logic, association rules are probabilistic in nature.

- 19 -

http://en.wikipedia.org/w/index.php?title=Gregory_Pietetsky-Shapiro&action=edit&redlink=1

http://en.wikipedia.org/wiki/Data_mining


5.2 Statistical terms

In addition to the antecedent (the "if" part) and the consequent (the "then" part), an association rule has

two numbers that express the degree of uncertainty about the rule. In association analysis the antecedent

and consequent are sets of items (called item sets) that are disjoint (do not have any items in common).

The first number is called the support for the rule. The support is simply the number of transactions that

include all items in the antecedent and consequent parts of the rule. The support is sometimes expressed

as a percentage of the total number of records in the database.

The other number is known as the confidence of the rule. Confidence is the ratio of the number of

transactions that include all items in the consequent as well as the antecedent (namely, the support) to

the number of transactions that include all items in the antecedent.

Lift is one more parameter of interest in the association analysis. Lift is nothing but the ratio of

Confidence to Expected Confidence. Expected Confidence in this case means, using the above example,

"confidence, if buying A and B does not enhance the probability of buying C." It is the number of

transactions that include the consequent divided by the total number of transactions.

Association rules are usually required to satisfy a user-specified minimum support and a user-specified

minimum confidence at the same time. Association rule generation is usually split up into two separate

steps:

1. First, minimum support is applied to find all frequent item sets in a database.

2. Second, these frequent item sets and the minimum confidence constraint are used to form rules.

While the second step is straight forward, the first step needs more attention.

- 20 -


Finding all frequent item sets in a database is difficult since it involves searching all possible item sets

(item combinations). The set of possible item sets is the power set over I and has size 2n − 1 (excluding

the empty set which is not a valid item set). Although the size of the power set grows exponentially in

the number of items n in I, efficient search is possible using the downward-closure property of

support(also called anti-monotonicity) which guarantees that for a frequent item set also all its subsets

are frequent and thus for an infrequent item set, all its supersets must be infrequent. Exploiting this

property, efficient algorithms (e.g., Apriori and Éclat) can find all frequent item sets.

5.3 ASSOCIATION RULE

An association rule is an implication expression of the form X Y , where X and Y are disjoint item

sets, i.e., X ∩ Y = Ø.

Support ( A ⇒ C) = Support(A ∪ C)

Confidence (A ⇒ C) = Support (A ⇒ C) / support (A)

A common strategy adopted by many association rule mining algorithms is to decompose the problem

into two major subtasks:

1. Frequent Item set Generation, whose objective is to find all the item- sets that satisfy the

minimum support threshold. These item set are called frequent item set.

2. Rule Generation, whose objective is to extract all the high-confidence rules from the frequent

item sets found in the previous step. These rules are called strong rules.

The computational requirements for frequent item set generation are generally more expensive than

those of rule generation.

- 21 -


5.4 APRIORI PRINCIPLE

Apriori is Christian Borgelt’s implementation of the well-known Apriori association rule algorithm

Apriori

takes transactional data in the form of one row for each pair of transaction and item identifiers. It first

generates frequent item sets and then creates association rules from these item sets. It can generate both

association rules and frequent item sets.

If an item sets is frequent, then all of its subsets must also be frequent. To illustrate the idea behind the

Apriori principle, suppose {c, d, e} is a frequent item sets. Clearly, any transaction that contains {c, d, e}

must also contain its subsets, {c, d},{c, e}, {d, e}, {c}, {d}, and {e}. As a result, if {c, d, e} is frequent,

then all subsets of {c, d, e} must also be frequent

Apriori pseudo code:-

- 22 -


The Apriori heuristic achieves good performance gained by (possibly significantly) reducing

The size of candidate sets. However, in situations with a large number of frequent

Patterns, long patterns, or quite low minimum support thresholds, an Apriori-like algorithm

May suffer from the following two nontrivial costs:

– It is costly to handle a huge number of candidate sets. For example, if there are 104

Frequent 1-item sets, the Apriori algorithm will need to generate more than 107 length-2

Candidates and accumulate and test their occurrence frequencies. Moreover, to discover

a frequent pattern of size 100, such as {a1, . . . , a100}, it must generate 2100 − 2 ≈ 1030

Candidates in total. This is the inherent cost of candidate generation, no matter what

implementation technique is applied.

– It is tedious to repeatedly scan the database and check a large set of candidates by pattern

matching, which is especially true for mining long patterns.

5.4.1 FP-GROWTH.

FP-growth is an algorithm for generating frequent item sets for association rules from Jiawei Han’s

research group at Simon Fraser University. It generates all frequent item sets satisfying a given

minimum support by growing a frequent pattern tree structure that stores compressed information about

the frequent patterns. In this way, FP-growth can avoid repeated database scans and also avoid the

generation of a large number of candidate item sets

- 23 -


There are several advantages of FP-growth over other approaches:

i) It constructs a highly compact FP-tree, which is usually substantially smaller than the original

database and thus saves the costly database scans in the subsequent mining processes.

ii) It applies a pattern growth method which avoids costly candidate generation and test by successively

concatenating frequent 1-item set found in the (conditional) FP-trees. This ensures that it never

generates any combinations of new candidate sets which are not in the database because the item set

in any transaction is always encoded in the corresponding path of the FP-trees

- 24 -


5.4.2 OBJECTIVE

Develop a recommender system with the help of supermarket transactional data.

5.4.3 Data Set Information

Data is created in a supermarket by recording daily transactions of the sold items. It taken from a

supermarket located in New Zealand about 10 years ago.

5.4.4 Attribute Information

The dataset contains 217 attributes and 4627 instances. The attributes are nothing but the names of the

items in store. For each transactions record of 217 attributes is generated in which the sold items is

recorded as ‘True’. The names are not given to all attributes.

5.4.5 Data Pre-processing

The dataset is in arff file format which is suited for most of the mining tools so there is very little need

for pre-processing the dataset. To convert it into CSV file format we have to remove the @attributes and

@data tags present in the arff format.

- 25 -


.

Fig.4.1 ARFF file of Supermarket Dataset

Fig 4.2 Supermarket Data view in Table Format

As we can see the data is highly sparse and‘t’ represents binary value. For the mining process, besides

the input data, the minimum support threshold value is needed. It is one of the key issues, to which value

the support threshold should be set. The right answer can be given only with the user interactions and

- 26 -


many iterations until the appropriate values have been found. For this reason, namely, that the

interaction of the users is needed in this phase of the mining process, it is advisable executing the

frequent pattern discovery algorithm iteratively on a relatively small part of the whole dataset only.

Choosing the right size of the sample data, the response time of the application remains small, while the

sample data represents the whole data accurately.

Setting the minimum support threshold parameter is not a trivial task, and it requires a lot of practice and

attention on the part of the user. The frequent web access patterns are written in a text file along with the

sessions in which they are accessed and the day in which they are accessed and in the correct order

sequence.

Here are some of the rules generated by apriori given the values for minimum support is 10% and

minimum confidence level of 90%, the above figure shows the output of the weka

Association analysis on supermarket dataset.

Output

Output of the weka association analysis on supermarket dataset is given below.

- 27 -


Figure.4.3 Weka output for apriori algorithm on supermarket dataset

As we can see in fig. 4.3 the algorithm has generated 10 best rules found according to given constraints,

here are the rules.

1. biscuits=t frozen foods=t pet foods=t milk-cream=t vegetables=t 516 ==> bread and cake=t 475

conf:(0.92)

2. baking needs=t biscuits=t milk-cream=t margarine=t fruit=t vegetables=t 505 ==> bread and cake=t

464 conf:(0.92)

3. biscuits=t frozen foods=t milk-cream=t margarine=t vegetables=t 585 ==> bread and cake=t 537

conf:(0.92)

4. biscuits=t canned vegetables=t frozen foods=t fruit=t vegetables=t 536 ==> bread and cake=t 492

conf:(0.92)

- 28 -


5. baking needs=t frozen foods=t milk-cream=t margarine=t fruit=t vegetables=t 517 ==> bread and

cake=t 474 conf:(0.92)

6. biscuits=t frozen foods=t pet foods=t milk-cream=t fruit=t 511 ==> bread and cake=t 468 conf:

(0.92)

7. biscuits=t frozen foods=t tissues-paper prd=t milk-cream=t vegetables=t 575 ==> bread and cake=t

526 conf:(0.91)

8. biscuits=t frozen foods=t beef=t fruit=t vegetables=t 536 ==> bread and cake=t 490 conf:(0.91)

9. baking needs=t biscuits=t frozen foods=t cheese=t fruit=t 538 ==> bread and cake=t 491 conf:

(0.91)

10. biscuits=t frozen foods=t milk-cream=t margarine=t fruit=t 592 ==> bread and cake=t 540 conf:

(0.91).

5.4.6 Output Interpretation

Interpreting the rules are also important as it may show a new opportunity for a store owner to increase

the sell. Describing the 1st rule given here, it shows that 92%of the customers who brought biscuits,

frozen foods, pets foods, milk-cream, vegetables have also brought Bread and cake.

So using this analysis the store owner can use this information to boost sales by placing these items

together on shelf or he can place it to the opposite corners of the shelf so that customers have to go

through aisle passing the other products on the shelf. He can go for other options like discounting a low

- 29 -

conf:(0.91)

conf:(0.91)


sells product with above group thereby increasing the turnover and decreasing the shelf life of the

product.

5.5 FREQUENT ITEM SET GENERATION USING FP-GROWTH

This FP growth operator in Rapid Miner calculates all frequent items sets from a data set by building a

FPTree data structure on the transaction data base. This is a very compressed copy of the data which in

many cases fits into main memory even for large data bases. From this FPTree all frequent item set are

derived.

A major advantage of FPGrowth compared to Apriori is that it uses only 2 data scans and is therefore

often applicable even on large data sets. The given data set is only allowed to contain binominal

attributes, i.e. nominal attributes with only two different values

5.5.1 Process

The process for frequent item set generation in RapidMiner is given below. This process takes input

using ‘Read ARFF’ operator. This input shuld be in binominal form, if the input is not in binominal

form we have to convert into binominal using ‘Nomial2Binominal’ operator.

- 30 -


Fig.4.4 Frequent item set generation process in Rapid Miner

4.5.2 Output: - output of the FPGrowth is given as follows

Fig.4.5 Frequent item set given by FPGrowth.

4.5.3 Output interpretation.

The most frequent item set are given in the output with their support level. In the output item1, item2,

item3 in the first line are bread and cake, fruit and vegetables and it shows that the support for this item

set in all transactions is 38.7%.

- 31 -


Using this the store owner can arrange these items together on shelf or at different places in store, or he

can give discounts on these item set so the overall sells can be increased. The items with minimum

support level and high shelf life can be clubbed together with a high support level item giving a

attractive discount will result in increase in the sells and decrease in the shelf life time.

Using this output many rules can be generated and applied in real scenario thereby boosting the overall

profit of the business.

- 32 -

A predictive analisys for real time data through data mining techniques Analysis and interpretation of data

Chapter 6

ANALYSIS AND INTERPRETATION OF DATA

This section of the report will give us information about classification trees and how they are used in

generating rules for prediction for real world scenario. Classification tree analysis is one of the main

techniques used in Data Mining.

6.1 INTRODUCTION

Classification trees are used to predict membership of cases or objects in the classes of a categorical

dependent variable from their measurements on one or more predictor variables. The goal of

classification trees is to predict or explain responses on a categorical dependent variable, and as such,

the available techniques have much in common with the techniques used in the more traditional methods

The flexibility of classification trees makes them a very attractive analysis option. Classification trees

readily lend themselves to being displayed graphically, helping to make them easier to interpret.

Classification trees can be and sometimes are quite complex. However, graphical procedures can be

developed to help simplify interpretation even for complex trees.

6.2 OBJECTIVE

Develop a classification system using census bureau dataset taken from www.census.gov Prediction task

is to determine whether a person makes over or below 50K a year.

.

- 33 -

http://www.census.gov/

http://www.statsoft.com/textbook/statistics-glossary/d.aspx?button=d#Data%20Mining


6.3 Data Set Information

This data was extracted from the census bureau database found at www.census.gov. It consist of

different information about people.

Fig. 5.1 Adult Dataset information

6.4 Attribute Information

The dataset contains 15 attributes showing characteristics and information about a person. It contains

16000 instances, each instance represents a person with values for 15 attributes. Persons are identified

by numbers. The original Adult data set has 15 features, among which six are continuous and nine are

categorical.

Fig 5.2 Adult Dataset in table format.

- 34 -

http://www.census.gov/


6.4 Data Pre-processing.

The data set obtained was in simple text format where the instance values are separated by commas. So

to do classification on this data we have to convert it into required form. For this first we have to clean

the dataset by removing unnecessary information from data then second step is to convert the data into

the format needed by the mining algorithms.

6.5 Data Analysis

‘Decision tree’ operator in Rapid Miner learns decision trees from both nominal and numerical data.

Decision trees are powerful classification methods which often can also easily be understood. In order to

classify an example, the tree is traversed bottom-down. Every node in a decision tree is labeled with an

attribute. The example's value for this attribute determines which of the out coming edges is taken. For

nominal attributes, we have one outgoing edge per possible attribute value, and for numerical attributes

the outgoing edges are labeled with disjoint ranges. This decision tree learner works similar to Quinlan's

C4.5 or CART.

C4.5 Algorithm.

C4.5 builds decision trees from a set of training data using the concept of information entropy. At each

node of the tree, C4.5 chooses one attribute of the data that most effectively splits its set of samples into

subsets enriched in one class or the other. Its criterion is the normalized information gain (difference in

entropy) that results from choosing an attribute for splitting the data. The attribute with the highest

normalized information gain is chosen to make the decision. The C4.5 algorithm then recurs on the

smaller sub lists.

- 35 -

http://en.wikipedia.org/wiki/Information_gain

http://en.wikipedia.org/wiki/Entropy_(information_theory)


In pseudo-code, the general algorithm for building decision trees is:

1. Check for base cases

2. For each attribute a

2.1Find the normalized information gain from splitting on a

3. Let a_best be the attribute with the highest normalized information gain

4. Create a decision node that splits on a_best

5. Recur on the sub lists obtained by splitting on a_best, and add those nodes as children of node

6.6 Process

Fig.5.3 Classification process using decision tree in Rapid Miner

- 36 -


Output : - The output contains two things to analyze, these are as follows.

1. Decision tree 2. Confusion matrix

6.7 Decision tree

Fig. 5.4 Decision tree for adult dataset

From the above tree Fig 5.4, we obtained through Data Miner; we can derive rules that can be applied

for a given problem. The rules can be written in text form, and the number of the rules is equal to

number of the leafs in the tree. As we can see there are nine leafs in output tree we can derive nine rules

as follows

1.if capital-gain > 7,073.500 then >50K (13 / 673)

2.if capital-gain ≤ 7,073.500 and marital-status = Divorced then <=50K (1958 / 168)

- 37 -


3.if capital-gain ≤ 7,073.500 and marital-status = Married-AF-spouse then <=50K (8 / 4)

4.if capital-gain ≤ 7,073.500 and marital-status = Married-civ-spouse then <=50K

(4077 / 2743)

5.if capital-gain ≤ 7,073.500 and marital-status = Married-spouse-absent then <=50K

(193 / 10)

6.if capital-gain ≤ 7,073.500 and marital-status = Never-married then <=50K (5007 / 182)

7.if capital-gain ≤ 7,073.500 and marital-status = Separated and education-num > 13.500 then >50K (9 /

12)

8.if capital-gain ≤ 7,073.500 and marital-status = Separated and education-num ≤ 13.500 then <=50K

(452 / 18)

9.if capital-gain ≤ 7,073.500 and marital-status = Widowed then <=50K (448 / 25)

6.8 Confusion matrix

Confusion Matrix gives the number/proportion of examples from one class classified in to another (or

same) class. This way, one can observe which type of examples were misclassified in a certain way. One

benefit of a confusion matrix is that it is easy to see if the system is confusing two classes (i.e.

commonly mislabeling one as another).

Fig 5.5 Confusion Matrix for Adult Dataset

- 38 -


As we can see the above confusion matrix Fig 5.5, 12143 instances are predicted as class ‘<=50K’ and

they belongs to same class i.e. ‘<=50K’. 3150 instances are predicted as class ‘<=50K’ but actually they

belongs to class ‘>50K’

In the second line 22 instances which belongs to class ‘<=50K’ are wrongly predicted as of class ‘>50K’

and 685 instances are correctly classified as of class ‘>50K’

6.9Accuracy of the model

Using above model for classification, 12828 instances among total 16000 are predicted correctly

therefore overall accuracy of the classification is 80.18%. the classification accuracy is 99.82% for the

class ‘<=50K’ and 17.86% for the class ‘>50K’

This confusion matrix output we can visualize using a scatter plot by actual class plotted against

predicted class.

- 39 -


Fig. 5.6 scatter plot for classification Result

6.9.1 Output interpretation

By carrying out the classification on adult dataset using decision tree algorithm we can conclude

following things. Simple to understand and interpret - People are able to understand decision tree models

after a brief explanation.

The rules can be applied to form any new marketing campaign, promotional schemes, finding

valuable customers which will help business to categories their customers and give better type of

the service needed for each class of customers.

- 40 -


- 41 -

A predictive analisys for real time data through data mining techniques Findings & suggestions

Chapter 7

FINDINGS & SUGGESTIONS

During this whole project I carried out different data mining techniques such as classification,

association and time series analysis.

Classification technique used for prediction is Simple to understand and interpret - People are able to

understand decision tree models after a brief explanation. The rules generated by classification can be

applied to form any new marketing campaign, promotional schemes, finding valuable customers which

will help business to categories their customers and give better type of the service needed for each class

of customers.

Association analysis are very good for finding the hidden relationship in transactional data which can be

used in different sectors such as in market basket analysis in a retail store. It gives new insights of data

which can be used for well being of business.

By performing time series analysis of predicting load on the electric feeder for short term using

techniques such as simple moving average and weighted moving averages, we can say that the above

two techniques have given promising results, despite the inconsistencies in the data. With nearly same

accuracy both methods are predicting the short term load effectively. In future, we can apply more

advanced method for forecasting which will predict load more accurately.

In summary we believe that the proposed techniques are adaptable in short term forecasting of feeder

load and will give good results if used in managing the generation of electricity and load management of

the feeders.

In future, we can apply more advanced method for forecasting which will predict load more accurately

with the onset of inflation and rapidly rising energy prices, emergence of alternative fuels and

technologies (in energy supply and end-use), changes in lifestyles, institutional changes etc, it has

become imperative to use modeling techniques which capture the effect of factors such as prices,

income, population, technology and other economic, demographic, policy and technological variables.

- 42 -

A predictive analisys for real time data through data mining techniques Bibliography

BIBLIOGRAPHY

J. Han and M. Kamber, “Data Mining Concepts and Techniques” Second edition,

Morgan Kaufmann, San Francisco.

[1] http://www.sas.com/feature/analytics/102892_0107.pdf

[2] International Journal of Management and Decision Making

[3]http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm

[4] http://www.britannica.com/EBchecked/topic/1671124/predictive-modeling

[5] http://www.dynamicintegration.net/predictive_analytics.aspx

[6] http://en.wikipedia.org/wiki/Predictive_analytics

[7] http://www.cs.waikato.ac.nz/~ihw/papers/04-EF-etal-DataminingWEKA.pdf

- 43 -

data mining.docx

Documents

transactional data

future data

nonoperational data

data mining software

real time data

data dictionary definitions

growing amounts of data

different databases