Credit Card Analysis of Czech Bank Goals • Our goal for this project is to analyze customer and credit-card information, from the Berka dataset, to extrapolate the type of customer who makes a good candidate for a credit-card, and what level of credit to extend to that customer. • The Berka dataset is from the 1999 PKDD Discovery Challenge. • The Berka dataset is a collection of financial
35
Embed
Credit Card Analysis of Czech Bank Goals Our goal for this project is to analyze customer and credit-card information, from the Berka dataset, to extrapolate.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Credit Card Analysis of Czech BankGoals
• Our goal for this project is to analyze customer and credit-card information, from the Berka dataset, to extrapolate the type of customer who makes a good candidate for a credit-card, and what level of credit to extend to that customer.
• The Berka dataset is from the 1999 PKDD Discovery Challenge.
• The Berka dataset is a collection of financial information from a Czech bank.
Credit Card Analysis of Czech BankDomain Description
Entity-Relationship Description
• Each account has both static characteristics (e.g. date of creation, address of the branch) given in relation "account" and dynamic characteristics (e.g. payments debited or credited, balances) given in relations "permanent order" and "transaction".
• Relation "client" describes characteristics of persons who can manipulate with the accounts.
• Relations "loan" and "credit card" describe some services which the bank offers to its clients;
• Relation "demographic data" gives some publicly available information about the districts (e.g. the unemployment rate); additional information about the clients can be deduced from this.
Credit Card Analysis of Czech BankDomain Description - Continued
The dataset contains the following tables:• Accounts
o Each record describes static characteristics of an account
o Size: 4500 records• Clients
o Each record describes characteristics of a cliento Size : 5369 records
• Disposition (Disp)o Each record relates a client with an account and
describes the client’s right to operate that accounto Size: 5369 records
Credit Card Analysis of Czech BankData Preprocessing Activities
1. Converted the ascii files to:
a. MS Excel and/or MS Word files for cleaning data
b. MS Access database for use ini. data mining, ii. de-normalizing or ‘flattening’ files, and iii. basic querying to learn more about the data.
c. Put all modified files into file types recognized by Weka. i. These files are comma delimited with a ‘heading’ of
attribute definition information.
Credit Card Analysis of Czech BankData Preprocessing Activities - Continued
2. Verified all table relationships:
a. Every account has an Owner via Disp and Account tables
b. Order and Loan records are duplicated in transaction records. That is, the transactions include Order records and Loan payments.
i. Loan records in Trans are identified by k_symbol=”LP”
Credit Card Analysis of Czech BankData Preprocessing Activities - Continued
3. De-normalize, or ‘flatten’, files for mining. Our database is relational. In order to mine or cluster attributes, those attributes must be in a single table. We have created a de-normalized table based on our goals.
a. Goal: Analyze credit-card information to extrapolate the type of customer who makes a good candidate for a credit-card.
i. Account-Client-Disp-Card-District-Loan-Transaction Table• Using information we discovered about accounts from
previous clustering, cluster customer information Using card type as clustering attribute Added "N" (None) as a possible value to the Loan Status
attribute• In order to better understand customers, we looked at this
table in two ways: • To identify, from all customers, which were credit card
holders and which were not.• To examine the variances that exist between all credit
card holding customers.
4. Add, change, remove, and descretize attributes as necessary. The tables shown below describe the changes made:
Credit Card Analysis of Czech BankData Preprocessing Activities - Continued
Preprocessing Activities – Account Table:Changes Made With Excel
Column Description Changes Missing or Invalid Values
Notes
Frequency Frequency of Statement Issuance
Translated values as follows: •POPLATEK MESICNE changed to MONTHLY ISSUANCE (MI) •POPLATEK TYDNE changed to WEEKLY ISSUANCE (WI) •POPLATEK PO OBRATU change to ISSUANCE AFTER TRANSACTION (TI)
N/A Translated for ease of use
Date Date of account creation Removed Ignore This attribute is not used in our mining effort
Credit Card Analysis of Czech BankData Preprocessing Activities - Continued
Preprocessing Activities – Client Table:Changes Made With Excel
Column Description Changes Missing or Invalid Values
Notes
BirthNumber Birthday and gender Removed This is a 6-digit number. The documentation says that its format is as follows: - YYMMDD (Men) - YYMM50+DD (Women) Analysis suggests that the format is as follows: - YYMMDD (Men) - YY50+MMDD (Women) Format changed to: - MM/DD/YYYY - Created a new field for Client_Sex
N/A Attribute removed from dataset and replaced by attributes Client_Sex and Client_Age
Credit Card Analysis of Czech BankData Preprocessing Activities - Continued
Operation Mode of transaction Translated values: • VYBER KARTOU = Credit Card Withdrawal (CCW) • VKLAD = Credit in Cash (CRC) • PREVOD Z UCTU = Collection from Another Bank (CAB) • VYBER = Withdrawal in Cash (WC) • PREVOD NA UCET = Remittance to Another Bank (RAB)
Issued Date card issued •Removed •Eliminated null time-stamp and changed date format from YYMMDD to MM/DD/YYYY
Ignore For our purposes, date card issued is not material
Credit Card Analysis of Czech BankData Preprocessing Activities - Continued
Preprocessing Activities – Credit Card Table
Changes Made With Excel
Column Description Changes Missing or Invalid Values
Notes
Issued Date card issued •Removed •Eliminated null time-stamp and changed date format from YYMMDD to MM/DD/YYYY
Ignore For our purposes, date card issued is not material
Credit Card Analysis of Czech BankMethodology
Methodology Overview
1. Attribute Ranking
2. Classification Analysis
3. Clustering Analysis
• Classification is a process where a model is built describing a predetermined set of data classes.
• The model is constructed by analyzing all the records in the database.
• Each tuple is assumed to belong to a predefined class, as determined by one of the attributes called the class label.
• The tuples analyzed to build the model form the training data set.
• Typically, the learned model is expressed in terms of decision trees or classsification rules.
• These rules can be used to predict the test data set.
Credit Card Analysis of Czech BankMethodology – Classification Analysis
• We used See5 as the classification tool for our project. A brief description of the algorithm is as follows:
• The tree starts as a single node representing the training samples.
• If the samples are all of the same class, then the node becomes a leaf and is labeled with that class.
• Otherwise, the algorithm uses an entropy-based measure known as Information Gain as a heuristic for selecting the attribute that will best separate the samples into individual classes. This attribute becomes the "test" or "decision" attribute at the node.
• A branch is created for each known value of the test attribute and the samples are partitioned accordingly.
Credit Card Analysis of Czech BankMethodology – Classification Analysis
• See5 Algorithm, cont.• The algorithm uses the same process recursively to form a
decision tree for the samples at each partition.
• The recursive partition stops when all the samples for a given node belong to the same class or if there are no remaining attributes on which samples may be further partitioned.
Credit Card Analysis of Czech BankMethodology – Classification Analysis
Trans_Avg_Balance <= 37097.18
Class :0 (1828.3/93.4)
Client_District_ID = 10
Class: 1(6.5)
Client_District ID = 38
Class: 1(8.5 /0.9)
Trans_Avg_Balance > 46123.76
Class: 1(8.3)
Account_Opened = 1995
Client_District ID = 1 Client_District ID = 15
Client_District ID = 13
Class: 1(5.3)
Class: 1(5.3)
Trans_Avg_Balance > 66899.77
Class: 1(8.3)
Client_Age = Senior
Class: 1(5.3)
Credit Card Analysis of Czech BankMethodology – Classification Analysis
We ran See5 using the RuleSets option and the Boost option with 3
trials. This represents simplified version of the decision tree generated
by See5. To build this tree, we used Rules obtained, which had high
confidence (>85%).
See5 also has the ability to express the clasifiers as Rule sets, which are easier to understand. 96 rules were generated, out of which we have listed 3 rules. All of these have confidence of more than 90%
If (Client_District_ID = 60) and (Trans_Avg_Balance > 46123.76) and (Loan_Status = none), Then Class Prediction is that of a Card Holder.
Credit Card Analysis of Czech BankMethodology – Classification Analysis
Credit Card Analysis of Czech BankMethodology – Classification Analysis
• The estimated predictive error = (64 + 390)/3600 = 12.6% • The percentage of instances that were correctly classified as Non-card holders
= (2816/2880) = 97%• The percentage of instances that were incorrectly classified as Non-card
holders = (64/2880) = 22.2% • The percentage of instances that were in-correctly classified as Card holders =
(390/720) = 54% The percentage of instances that were correctly classified as Card holders = (330/720) = 45%
Training Set Evaluation
Credit Card Analysis of Czech BankMethodology – Classification Analysis
Test Set Evaluation
• The estimated predictive error on the Test set is = ( 59 +119)/900 = 19.7% • The percentage of instances that were correctly classified as Non-card holders
= (669/728) =91% • The percentage of instances that were incorrectly classified as Non-card
holders = (59/728) = 8.1% • The percentage of instances that were in-correctly classified as Card holders =
(119/172) = 69% • The percentage of instances that were correctly classified as Card holders =
(53/172)= 73.6%
• We used cluster analysis to partition the data into a set of classes, grouping together customers or attributes with similar characteristics.
• The purpose of cluster analysis is to place observations into groups or clusters suggested by the data such that observations in a given cluster tend to be similar to each other in some sense, and objects in different clusters tend to be dissimilar.
• The COBWEB Algorithm was chosen for this task. The COBWEB algorithm used was implemented in the Weka Toolkit.
• The distinction of each cluster was not obvious beyond the coupling of age group and card type.
• To further investigate the characteristics of customers, we ran another clustering analysis using partitioning methods. We used the SAS Enterprise Miner product. SAS uses a partitioning clustering tool implementing the WARD (Minimum variance) method.
The following rules represent our findings in analyzing whether or not a customer will/will not have a credit card:
1. If the average account balance is less-than-or-equal 37097.18, then the client will not be a card holder.
2. If the client is middle-age and lives in Prague, and his/her averageaccount balance is greater-than 66899.77, then that client will be a card holder.
3. If the client lives in Prostejov, and has an average account balance in excess of 46123.76, then that client will be a card holder.
The following are characterizations of cardholders within the current population of bank clients
1. There is no apparent difference between men and women in the
issuance of credit cards.
2. Middle aged single females have a higher percentage of classic
and gold cards than the general population.
3. All credit card customers that have taken a loan have good status in repayment.
4. Customers who are in the age bracket less than 24 and whose average balance is under 50000.00 are likely candidates for a JUNIOR Card.
5. Otherwise, it is not easily determined from customer information
which card type has been selected. Perhaps further examination of
customer behavior is warranted.
Credit Card Analysis of Czech BankResults
• Future continuation of this analysis might include • identifying customers with a junior or classic credit card to whom the
bank can offer a higher-limit card.
• Other Opportunities• Which accounts are likely to default on loan payments? Why?
• What are the characteristics of a good bank client ("good" is defined here, in the most general of terms, as who will make payments on their outstanding balances in a timely manner)?
• What are the identifying characteristics of a good bank branch (again, we define "good" only in the most general terms of which branches are most successful in the collection of loan payments)?