Forecasting a Student's Education Fulfillment using Regression Analysis

i

FORECASTING A STUDENT’S EDUCATION

FULFILLMENT USING REGRESSION ANALYSIS

Submitted by

RAM G ATHREYA

Roll No.: 1202FOSS0019 Reg. No.: 75812200021

A PROJECT REPORT

Submitted to the

FACULTY OF SCIENCE AND HUMANITIES

in partial fulfillment for the requirement of award of the degree of

MASTER OF SCIENCE IN

FREE / OPEN SOURCE SOFTWARE (CS-FOSS)

CENTRE FOR DISTANCE EDUCATION ANNA UNIVERSITY CHENNAI 600 025

AUGUST 2014

ii

CENTRE FOR DISTANCE EDUCATION

ANNA UNIVERSITY

CHENNAI 600 025

BONA FIDE CERTIFICATE

Certified that this Project report titled “FORECASTING A STUDENT’S

EDUCATION USING REGRESSION ANALYSIS” is the bona fide work of Mr. RAM G

ATHREYA, who carried out the research under my supervision. I certify further, that

to the best of my knowledge the work reported herein does not form part of any

other Project report or dissertation on the basis of which a degree or award was

conferred on an earlier occasion on this or any other candidate.

RAM G ATHREYA Dr. SRINIVASAN SUNDARARAJAN

Student at Anna University Professor

iii

CERTIFICATE OF VIVA-VOCE-EXAMINATION

This is to certify that Thiru/Mr. RAM G ATHREYA

(Roll No. 1202FOSS0019; Register No. 75812200021) has been subjected to Viva-

voce-Examination on 14 September 2014 at 9:30 AM at the Study centre The AU-

KBC research Centre, Madras Institute of Technology, Anna Universisty, Chrompet,

Chennai 600044.

Internal Examiner External Examiner

Name : Name :

(in capital letters) (in capital letters)

Designation : Designation :

Address : Address :

Coordinator centre

Name :

(in capital letters)

Designation :

Address :

Date :

iv

ACKNOWLEDGEMENT

I am highly indebted to my guide Dr. SRINIVASAN SUNDARARAJAN for his

guidance, monitoring, constant supervision, kind co-operation and encouragement

that helped me in completion of this project.

I would also like to express my special gratitude to AU-KBC faculties involved in

M.Sc. (CS-FOSS) course for their cordial support and guidance as well as for

providing necessary information regarding the project and also for their support in

completing the project.

Finally, I thank Center of Distance Education, Anna University for giving me an

opportunity to do this project.

v

ABSTRACT

Our government spends substantial amount of resources in educating our

children. Additionally several welfare schemes are introduced aimed especially at

underprivileged children to ensure that all of them complete a basic level of

education. In spite of these measures many students do not complete their basic

education.

The aim of this project is to formulate a Supervised Learning Algorithm

that will aid in identifying such students who have a higher likelihood of not

completing their education.

To perform this task the algorithm will perform Logistic Regression

Analysis on historical data of students from a given school. The historical data

includes basic background information (features) such as gender, community,

number of siblings etc. It must be noted that the historical data also contains

information on whether the student completed his/her education, which is the

outcome we are interested in. Typically a student finishing education will be

denoted using a value of 1 and a student not finishing will be denoted with a value

of 0.

Based on the training (historical) data a logistic classifier can be built. Such

a classifier after learning from the training set will develop specific weightages for

each of the features. These weightages can then be extrapolated into an equation

that can be used for prediction.

That is we can apply the equation on a current student (whose background

we already know) to calculate the probability that he/she will complete his/her

education.

vi

Such an algorithm will be beneficial to government agencies since it can

serve as an early warning system using which they can take more proactive action

to prevent a student from dropping out. Policy makers can also use it as a tool to

identify schools that are more vulnerable and direct their resources and energies to

help them.

vii

சசசசசசசசச

சசசச சசசச சசசச சசசசசசசசசச சசசசச சசசசசசச

சசசசசசச சசசச சசசசசசசசசசசசச. சசசசசசசச பல

சசசசசசசசசசசசசச சசசசசசசசசசச சசசசசசச சசசசச சசச

சசசசசசசச சசசச சசசசசசச சசசசச சசசசச சசசசசசசசச

சசசசசசசசசச சசசசசசசசசச சசசசசசச. சசசச

சசசசசசசசசசசச சசச பல சசசசசசசசச சசசச சசசசசசசச

சசசசசசச சசசசசசச.

சசசச சசசசசசசசசசச சசசசசசசச, சசசசசச சசசசச சசசசசச

சசசச சசசசசசசச சசசசசச சசசச சசசசசச சசசசசசசசச

சசசசசசசச சசசசசச சசசசச சசச சசசசசசசசசசசசசச

சசசசசச சசசசசசச சசசசசசசசச சசசசசச.

சசசசசசச சசசசசசசசசசசசசச சசசசசசசசச சசசசசசச

சசசசசசசசச சசசசசசசச சசசச சசசசசசசசச சசசசசசசசச

சசசசசசசசசசச சசசசச சசசச சசசசச சசசசச. சசசசசசசச

சசசச சசசசசசசச சசசசசசச சசசசச சசசசசச சசசசசசச,

சசசசசச, சசசசசசசச சசசச சசசசசச சசசசசசச சசசசசசச

சசசசசச சசசசச சசசசச / சசசசச சசசசச, சசசசசச சசசசசச

சசசசச சசச சசசசசசசசசசசசச சசசசசசசச சசசசசச

சசசசசசசசசசசசசச சசசசசசசசச (சசசசசசசசச) சசசசசசசச .

சசசசசசசச சசச சசசசசச சசசசசசச சசசசச சசச சசசசசசச 1

சசசசசசசசசசச சசசசசசசசசசசசசச சசசசசசச சசச சசசசசச

0 சசச சசசசசசச சசசசசசசசசசசசசச சசசசசசசச.

சசசசசசச சசசசசசசசசசசச (சசசசசசசச) சசசச

சசசசசசசசசச சசசசசசசசசசசச சசசசசசசசசசசச. சசசசசசச

சசசசசசசசசசச சசசசசசச சசசசச சசசசசசச சசசசசசச சசச

சசசசசசசசசசசச சசசசசசசசச சசசசசசசச சசசசசசசசசசச

viii

சசசசசசச சசசசசசசசசசச. சசசச சசசசசசசசசச சசசசசசச

சசசசசசச சசசசசசசசசச சசசசசசசச சசசசச சசச

சசசசசசசசசசசச சசசசசசசசசசசசச.

சசசசச சசசச (சசசச சசசசசசச சசசச சசசசசசச சசசசசசசச)

சசசச / சசசச சசசச / சசசசச சசசசச சசசசசசச சசசசசசசச

சசசசச சசசசசசசசச சசசசசசச சசசசசசசசச சசசசசச சசசச

சசசசசசசசசச சசசசசசசசசசசச சசசசசசசச.

சசச சசசசசசச சசசசசச சசசசசசசசசச சசசசசசச சசச

சசசசசச சசசசசச சசசசசசச சசசசசசசசசசசசச சசசசசசசசச

சசசசசச சசசசசசசச சசசசசசசசசசச சசச சசசசச

சசசசசசசசசச சசசசசசச சசசசசசசச சசசசசசசச சசசசசசசச

சசசசசசச சசச சசசசசசச சசசசசசச சசசசசசசசசசசசசசச

சசசசசசசசசச சசசசசசசசச. சசசசசசச சசசசசசசசசசசசச

சசசசசசச சசசசசச சசசசசசச சசசசசசச சசசசசச சசசசசச

சசசசசசசசசசசசசச சசசசச சசசசசசசச சசசசசசசச

சசசசசசச சசசசசசசச சசச சசசசசசசச சசச சசசசசசசசசச

சசசசசசசச.

ix

TABLE OF CONTENTS

CHAPTER NO TITLE PAGE NO

ACKNOWLEDGEMENT iv

ABSTRACT v

ABSTRACT IN TAMIL vii

LIST OF FIGURES xii

LIST OF TABLES xiii

LIST OF ABBREVIATIONS xiv

1 INTRODUCTION 1

1.1 OVERVIEW OF THE PROJECT 1

1.2 LITERATURE SURVEY 2

1.3 PROPOSED SYSTEM 2

1.4 SCOPE 2

2 REQUIREMENT SPECIFICATION 4

2.1 INTRODUCTION 4

2.2 OVERALL DESCRIPTION 4

2.2.1 PRODUCT PERSPECTIVE 5

2.2.2 PRODUCT FUNCTIONS 5

3 PROJECT REQUIREMENTS 7

3.1 SOFTWARE REQUIREMENTS 7

3.2 HARDWARE REQUIREMENTS 7

4 SYSTEM DESIGN 9

x

4.1 METHODOLOGY 9

4.2 ALGORITHM 9

4.2.1 SUPERVISED LEARNING 10

4.2.2 CLASSIFICATION 11

4.2.3 LOGISTIC REGRESSION 13

4.3 DATA COLLECTION 15

4.3.1 FEATURE DETECTION 15

4.3.1.1 PERSONAL 15

4.3.1.2 ENVIRONMENTAL 15

4.3.1.3 SCHOOL 16

4.3.2 DATASET GENERATION 16

4.4 MODELING 18

4.4.1 HYPOTHESIS DEVELOPMENT 19

4.4.2 GENERALIZATION ERROR 19

4.5 VALIDATION 20

4.5.1 DATASET PARTITIONING 21

4.5.1.1 TRAINING DATASET 21

4.5.1.2 CV DATASET 22

4.5.2 COST FUNCTION 23

4.5.3 ERROR METRICS 24

4.5.3.1 TRAINING AND CV

ERROR 25

4.5.3.2 F1 SCORE 25

4.5.3.3 W – SCORE 26

4.5.4 LEARNING CURVES 27

4.6 PREDICTION 29

5 IMPLEMENTATION 31

5.1 R 31

xi

5.1.1 COST FUNCTION.R 31

5.1.2 F1SCORE.R 31

5.1.3 GENERATEDATASET.R 32

5.1.4 GENERATEVECTOR.R 34

5.1.5 INIT.R 36

5.1.6 LEARNINGCURVE.R 37

5.1.7 MYSQL.R 39

5.1.8 PERCRANK.R 39

5.1.9 PLOTLEARNINGCURVE.R 39

5.1.10 PREDICTION.R 40

5.1.11 PREDICTOR.R 40

5.1.12 RANDOMIZEDATASET.R 41

5.2 NODE.JS 41

5.2.1 APP.JS 41

5.2.2 PACKAGE.JSON 42

5.2.3 ROUTES.JS 43

5.2.4 INDEX.JADE 45

5.2.5 PREDICT.JADE 47

5.2.6 UPLOAD.JADE 52

6 RESULTS 54

6.1 DATASET UPLOAD 54

6.2 UPLOAD RESULT 55

6.3 PREDICTION 56

7 CONCLUSIONS 57

8 REFERENCES 58

xii

LIST OF FIGURES

FIGURE NO TITLE PAGE NO

4.1 Logistic Regression Curve

4.2 Dataset Generation

4.3 Modeling

4.4 Dataset Partitioning

4.5 Developing Multiple Models

4.6 Calculating Cross-Validation Errors

4.7 Single Subject Learning

4.8 Learning from Experience

4.9 Score & Learning Time vs Experience

4.10 Training & Cross – Validation Error

Convergence

4.11 Choosing the Best Model

4.12 Prediction

6.1 Upload Result

6.2 Prediction Screen

6.3 Predicting Student will not Dropout

6.4 Predicting Student will Dropout

xiii

LIST OF TABLES

TABLE NO TITLE PAGE NO

4.1 Sample Dataset 17

xiv

LIST OF ABBREVIATIONS

FOSS Free and Open Source Software

IDE Integrated Development Environment

OS Operating System

PTR Pupil Teacher Ratio

SCR Student Classroom Ratio

1

CHAPTER 1

INTRODUCTION

1.1 OVERVIEW OF THE PROJECT

Dropout is a universal phenomenon of the education system in India, which is

spread across all levels of education, in all parts of the country, and across socio-

economic groups the dropout rates are much higher for educationally backward

states and districts. Girls in India tend to have higher dropout rates than boys.

Similarly, children belonging to the socially disadvantaged groups like Scheduled

Castes and Scheduled Tribes have the higher dropout rates in comparison to the

general population.

There are also regional and location wise differences and the children living in

rural areas are more likely to drop out of school. In order to reduce wastage and

improve the efficiency of education system, educational planners need to

understand and identify the social groups that are more susceptible to dropout and

the reasons for their dropping out.

Keeping the above context in perspective, it would be helpful to develop a

system or an algorithm that can systematically identify such vulnerable students

who have a higher likelihood of dropping out from school. The goal of this project

is to develop such an algorithm or system.

Hopefully such an algorithm or system could assist educational planners and

administrative staff of educational institutions to better allocate resources and

make better decisions, which could curb this growing dropout problem.

2

1.2 LITERATURE SURVEY

The literature survey covers existing research and studies with respect to the

dropout problem. They are grouped into three broad categories:

1 Research Papers

2 Surveys

3 Govt Reports

The detailed list of resources researched during the literature survey is

provided in the references section.

1.3 PROPOSED SYSTEM

The proposed system will implement an algorithm that will take in student

data as input and learn from it. This learned function, otherwise called as the

hypothesis will serve as an approximate explanation of the data. Error metrics and

validation techniques will be used to determine the accuracy of the hypothesis.

The best hypothesis that fits the data will then be used for prediction. The final

goal of the algorithm is to make reasonably accurate predictions of new unlabeled

data. Unlabeled data is data for which the outcome is unknown.

This system will be implemented in such a way that it can be operated from a

web interface where the user can upload datasets as well as make predictions

based on learned data.

1.4 SCOPE

3

The algorithm developed is an exploratory proof – of – concept system that

uses machine learning and statistical techniques to make predictions based on

student data. The validity of the results is entirely dependent on the accuracy of

the data and how the algorithm processes it.

Since comprehensive student data was not available for making the algorithm

as best as possible, this iteration of the system can only serve as a proof – of –

concept on what is possible and cannot be directly used in the real world, in its

present form, as a decision making or policy making tool.

4

CHAPTER 2

REQUIREMENT SPECIFICATION

2.1 INTRODUCTION

A software requirements specification (SRS) defines the requirements of a

software system. It is a description of the behavior of a system to be developed

and may include a set of use cases. In addition it also contains non-functional

requirements. Non-functional requirements impose constraints on the design or

implementation (such as performance requirements, quality standards, or design

constraints).

This project requires storage and processing of medium to large volumes of

data/datasets. Such datasets will be passed through the algorithm initially during a

training phase, during this time the algorithm will learn using the training data.

After training is completed the algorithm would then be required to make

predictions for new unlabeled data based on what it learned from the training data.

Additionally it would be helpful it the algorithm can be operated from a

Web User Interface which will be more user friendly than issuing commands from

the command line.

2.2 OVERALL DESCRIPTION

This section will outline a holistic description of the project, which includes

different perspectives, constraints, functional and non – functional requirements of

the project.

5

2.2.1 PRODUCT PERSPECTIVE

The system has 4 main tasks that are

Data Collection

Modeling

Validation

Prediction

In the data collection phase the data required for the

algorithm is gathered converted into a suitable form and supplied to

the system for learning.

In the modeling phase the algorithm tries to generate models

that try to explain the data that has been gathered. Machine Learning

techniques are used in this phase to generate multiple models of

which the best gets chosen in later stages.

In the validation phase the different models are evaluated

based on performance and the best among them is chosen as the

candidate algorithm that can be used for prediction

Finally in the prediction phase the chosen model is used for

making actual real world predictions.

2.2.2 PRODUCT FUNCTIONS

The system has two main functions that are

Training

6

Prediction

In the training phase the dataset is supplied to the algorithm using

which the best model is developed for prediction

In the prediction phase the learnt algorithm can be actually put to

use that is it can be used to make predictions for unlabeled data.

How these processes are implemented is explained in detail in

subsequent sections.

7

CHAPTER 3

PROJECT REQUIREMENTS

The project requirement is to develop an algorithm that can classify

students on whether they would complete education or not (dropout). To achieve

this a system needs to be created that can be operated from a web user interface

that will supply data for training or can make predictions based on already trained

data.

3.1 SOFTWARE REQUIREMENTS

The software requirements for this project are:

R – R is a free software programming language and software

environment for statistical computing and graphics.

Node.js - Node.js is a cross-platform runtime environment and a

library for running applications written in JavaScript outside the

browser (for example, on the server)

Netbeans - NetBeans is an integrated development

environment (IDE) for developing primarily with Java, but also with

other languages, in particular PHP, C++, Node.js & HTML5

RStudio – RStudio is a free and open source (FOSS) integrated

development environment for R, a programming language for

statistical computing and graphics

LINUX – LINUX is a POSIX-compliant computer operating system

(OS) assembled under the model of free and open source software.

3.2 HARDWARE REQUIREMENTS

https://en.wikipedia.org/wiki/Cross-platform

https://en.wikipedia.org/wiki/JavaScript

https://en.wikipedia.org/wiki/Java_(programming_language)

8

The hardware requirements define a set of (minimum) hardware that must

be available to run the system.

Hardware System that can support LINUX Operating System

2 – 4 GB of RAM

Internet Connectivity

9

CHAPTER 4

SYSTEM DESIGN

System design is the process of defining the architecture, components,

modules, interfaces and data for a system to satisfy specified requirements. System

design encompasses activities such as systems analysis, systems architecture and

systems engineering.

4.1 METHODOLOGY

A software development methodology or system development methodology

in software engineering is a framework that is used to structure, plan and control

the process of developing a software system.

This project consists of four distinct phases that are

Data Collection

Modeling

Validation

Prediction

4.2 ALGORITHM

The system will use a Logistic Regression Classifier, which is a Supervised

Machine Learning Algorithm. This algorithm will take student data as input and

predict an outcome. Outcomes are typically binary that is either a TRUE or

FALSE. A TRUE value indicates that a student will dropout while FALSE means

the student will not dropout.

10

Since the algorithm will return only one of two possible outcomes it can

also be called as a binary/binomial classifier.

4.2.1 SUPERVISED LEARNING

Supervised learning is the machine-learning task of inferring a

function from labeled training data. The training data consist of a set of

training examples. Typically the training data for this project will consist of

data about students based on features that will be defined later in this

document.

In supervised learning, each example is a pair consisting of an input

object (typically a vector) and a desired output value (also called the

supervisory signal). A supervised learning algorithm analyzes the training

data and produces an inferred function, which can be used for mapping new

examples. New examples are usually unlabeled data that we need to predict.

An optimal scenario will allow for the algorithm to correctly determine the

class labels for unseen instances. This requires the learning algorithm to

generalize from the training data to unseen situations in a "reasonable" way.

In order to solve a given problem of supervised learning, the system

has to perform the following steps:

1. Determine the type of training examples : The kind of data that is

to be used as the training set needs to be determined first. In the case

of handwriting analysis, for example, this might be a single

handwritten character, an entire handwritten word, or an entire line

of handwriting

2. Gather a training set : The training set needs to be representative

of the real-world use of the function. Thus, a set of input objects is

11

gathered and corresponding outputs are also gathered, either from

human experts or from measurements

3. Determine the input feature representation of the learned

function: The accuracy of the learned function depends strongly on

how the input object is represented. Typically, the input object is

transformed into a feature vector, which contains a number of

features that are descriptive of the object. The number of features

should not be too large; but should contain enough information to

accurately predict the output.

4. Determine the learning algorithm : The correct learning algorithm

that models the available data should be identified and applied. For

example the learning algorithm may be support vector machines or

decision trees

5. Complete the design : Run the learning algorithm on the gathered

training set. Some supervised learning algorithms require certain

control parameters. These parameters may be adjusted by optimizing

performance on a subset (called a validation set) of the training set,

or via cross-validation.

6. Evaluate the accuracy of the learned function : After parameter

adjustment and learning, the performance of the resulting function

should be measured on a test set that is separate from the training

set.

4.2.2 CLASSIFICATION

In machine learning and statistics, classification is the problem of

identifying to which of a set of categories (sub-populations) a new

observation belongs, on the basis of a training set of data containing

observations (or instances) whose category membership is known. The

12

individual observations are analyzed into a set of quantifiable properties,

known as various explanatory variables, features, etc. These properties may

variously be categorical (e.g. "A", "B", "AB" or "O", for blood type),

ordinal (e.g. "large", "medium" or "small"), integer-valued (e.g. the number

of occurrences of a part word in an email) or real-valued (e.g. a

measurement of blood pressure). Some algorithms work only in terms of

discrete data and require that real-valued or integer-valued data be

discretized into groups (e.g. less than 5, between 5 and 10, or greater than

10). An example would be assigning a given email into "spam" or "non-

spam" classes or assigning a diagnosis to a given patient as described by

observed characteristics of the patient (gender, blood pressure, presence or

absence of certain symptoms, etc.).

An algorithm that implements classification, especially in a concrete

implementation, is known as a classifier. The term "classifier" sometimes

also refers to the mathematical function, implemented by a classification

algorithm, that maps input data to a category.

In the terminology of machine learning, classification is considered

an instance of supervised learning, i.e. learning where a training set of

correctly identified observations is available. The corresponding

unsupervised procedure is known as clustering or cluster analysis, and

involves grouping data into categories based on some measure of inherent

similarity (e.g. the distance between instances, considered as vectors in a

multi-dimensional vector space).

In statistics, where classification is often done with logistic

regression or a similar procedure, the properties of observations are termed

explanatory variables (or independent variables, regressors, etc.), and the

13

categories to be predicted are known as outcomes, which are considered to

be possible values of the dependent variable. In machine learning, the

observations are often known as instances, the explanatory variables are

termed features (grouped into a feature vector), and the possible categories

to be predicted are classes. There is also some argument over whether

classification methods that do not involve a statistical model can be

considered "statistical".

4.2.3 LOGISTIC REGRESSION

In statistics, logistic regression, or logit regression, is a type of

probabilistic statistical classification model. It is also used to predict a

binary response from a binary predictor, used for predicting the outcome of

a categorical dependent variable (i.e., a class label) based on one or more

predictor variables (features). That is, it is used in estimating the parameters

of a qualitative response model. The probabilities describing the possible

outcomes of a single trial are modeled, as a function of the explanatory

(predictor) variables, using a logistic function. Logistic Regression is used

to refer specifically to the problem in which the dependent variable is

binary—that is, the number of available categories is two, while problems

with more than two categories are referred to as multinomial logistic

regression.

Logistic regression measures the relationship between a categorical

dependent variable and one or more independent variables, which are

usually (but not necessarily) continuous, by using probability scores as the

predicted values of the dependent variable.

14

Fig 4.1 : Logistic Regression Curve

The formula for Logistic Regression can be expressed as :

𝐹(𝑥) = 1

1 + 𝑒−𝑥

Eq 4.1 : Logistic Regression Formula

where :

F(x) is the output

x is the input

e is Euler’s number

It must be noted that 𝐹(𝑥) can have a value only between 0 to 1 for

any value of x that may be between (−∞, ∞) . Using the above equation we

can define a value 𝑘 𝜖 (0, 1) such that all values of 𝐹(𝑥) ≥ 𝑘 is true while

those lesser are false or vice versa, thereby classifying the data into two

distinct parts.

15

4.3 DATA COLLECTION

4.3.1 FEATURE DETECTION

Based on the literature survey six features have been identified as

major observable factors that can affect the final outcome regarding the

education fulfillment of a student.

The six features can be grouped into three categories that are:

1. Personal Features

2. Environmental Features

3. School Features

4.3.1.1 PERSONAL

Personal features are those features that are based on the

characteristics of the student or his/her parents, family background

etc. The personal features that are being considered by the algorithm

are:

1. Gender: Values can be Male or Female

2. Poverty: Values can be Yes or No

3. Community: Values can be General, OBC, SC, ST

4.3.1.2 ENVIRONMENTAL

Environmental features are those features that are based on

the student’s environment, locality, geography etc. The

16

environmental features that are being considered by the algorithm

are:

1. Rural: Values can be Yes or No

4.3.1.3 SCHOOL

School features are those features that are based on the

characteristics of the school where the student studies. The school

features that are being considered by the algorithm are:

Pupil Teacher Ratio: Pupil–teacher ratio is the number of students

who attend a school or university divided by the number of teachers

in the institution. For example, a pupil–teacher ratio of 10:1

indicates that there are 10 students for every one teacher. The term

can also be reversed to create a teacher–pupil ratio.

Student Classroom Ratio: Student – classroom ratio is the number

of students per classroom in an education institution. For example, a

student – classroom ratio of 40:1 indicates that there are 40 students

for every classroom.

1. Pupil Teacher Ratio: Values can be Low (1 Teacher :

<30 Students), Medium (1 Teacher : 30 – 40 Students) and

High (1 Teacher : 40+ Students)

2. Student Classroom Ratio: Values can be Low (1

Classroom : <30 Students), Medium (1 Classroom: 30 –

40 Students) and High (1 Classroom: 40+ Students)

4.3.2 DATASET GENERATION

17

Based on statistics derived from the literature survey and the features

mentioned above the dataset for modeling is generated. The tables given

below extrapolate statistical findings compiled from the literature survey:

Feature Value Distribution Dropout Chance

Gender Male 52% 39%

Gender Female 48% 41%

Poverty Yes 22% 80%

Poverty No 78% 27%

Rural Yes 75% 45%

Rural No 25% 20%

Community General 30% 10%

Community OBC 40% 48%

Community SC 20% 64%

Community ST 10% 69%

PTR Low 20% 15%

PTR Medium 30% 35%

PTR High 50% 55%

SCR Low 18% 22%

SCR Medium 33% 25%

SCR High 49% 60%

Table 4.1 : Sample Dataset

The above table shows the distribution of each feature in the student

population and the corresponding dropout chance of each feature within

that population. For example when considering 100 students there are 52

18

male students and 42 female students and the chance that a female student

drops out is 41%.

Overall Dropout Percentage was found to be 40%. That is 40% of

the student population dropout of school. Using the above statistics a

dataset can be generated for further analysis.

Fig 4.2 : Dataset Generation

4.4 MODELING

Data modeling in software engineering is the process of creating a data

model for an information system by applying formal data modeling techniques.

19

Fig 4.3 : Modeling

4.4.1 HYPOTHESIS DEVELOPMENT

A Hypothesis (plural hypotheses) is a proposed explanation for a

phenomenon. A working hypothesis is a provisionally accepted hypothesis

proposed for further research. In the context of Machine Learning the

hypotheses is also called as the Learned Function.

In the context of this project the learned function is a working

hypothesis that tries to explain the training dataset of students. Based on the

observations/outcomes of the training dataset the learned algorithm will

develop weightages for each of the features that have been selected. These

weightages will then be used for predicting outcomes in a future dataset.

4.4.2 GENERALIZATION ERROR

The generalization error of a machine-learning model is a function

that measures how well a learning machine generalizes to unseen data. It is

20

measured as the distance between the error on the training set and the test

set and is averaged over the entire set of possible training data that can be

generated after each iteration of the learning process. It has this name

because this function indicates the capacity of a machine that learns with

the specified algorithm to infer a rule (or generalize).

The theoretical model assumes a probability distribution of the

examples, and a function giving the exact target. The model can also

include noise in the example (in the input and/or target output). The

generalization error is usually defined as the expected value of the square of

the difference between the learned function and the exact target (mean-

square error)

The performance of a machine learning algorithm is measured by

plots of the generalization error values through the learning process and are

called learning curves.

4.5 VALIDATION

In statistics, model validation is the process of deciding whether the

numerical results quantifying hypothesized relationships between variables,

obtained from machine learning analysis, are in fact acceptable as descriptions of

the data.

The validation process can involve analyzing the goodness of fit of the

model, analyzing whether the model residuals are random, and checking whether

the model's predictive performance deteriorates substantially when applied to data

that were not used in model estimation.

21

4.5.1 DATASET PARTITIONING

In model validation for assessing the results of statistical analysis the

dataset is generally partitioned into two separate datasets. They are :

1. Training Dataset

2. Cross – Validation(CV) Dataset

The model is typically trained on the training dataset and then tested

on the cross – validation dataset that contains examples that are

independent of the training data. The actual training, cross – validation split

is upto the person doing the analysis. Usually ranges between 80-20%

(training – cv) or 70-30% is preferred so that the model has enough

examples for training the model.

Fig 4.4 : Dataset Partitioning

4.5.1.1 TRAINING DATASET

22

A training set is a set of data used in various areas of

information science to discover potentially predictive relationships.

Training sets are used in artificial intelligence, machine learning,

genetic programming, intelligent systems, and statistics. In all these

fields, a training set has much the same role and is often used in

conjunction with a test set.

Fig 4.5 : Developing Multiple Models

4.5.1.2 CV DATASET

Cross-validation, sometimes called rotation estimation, is a

model validation technique for assessing how the results of a

statistical analysis will generalize to an independent data set. It is

mainly used in settings where the goal is prediction, and one wants

to estimate how accurately a predictive model will perform in

practice. In a prediction problem, a model is usually given a dataset

of known data on which training is run (training dataset), and a

dataset of unknown data (or first seen data) against which the model

is tested (testing dataset). The goal of cross validation is to define a

23

dataset to "test" the model in the training phase (i.e., the validation

dataset), in order to limit problems like overfitting, give an insight

on how the model will generalize to an independent data set (i.e., an

unknown dataset, for instance from a real problem), etc.

One round of cross-validation involves partitioning a sample

of data into complementary subsets, performing the analysis on one

subset (called the training set), and validating the analysis on the

other subset (called the validation set or testing set). To reduce

variability, multiple rounds of cross-validation are performed using

different partitions, and the validation results are averaged over the

rounds.

Fig 4.6 : Calculating Cross-Validation Errors

4.5.2 COST FUNCTION

In mathematical optimization, statistics, decision theory and machine

learning, a cost function or loss function is a function that maps an event or

values of one or more variables onto a real number intuitively representing

24

some "cost" associated with the event. An optimization problem seeks to

minimize a loss function. An objective function is either a loss function or

its negative (sometimes called a reward function or a utility function), in

which case it is to be maximized.

In statistics, typically a loss function is used for parameter

estimation, and the event in question is some function of the difference

between estimated and true values for an instance of data.

The cost function is expressed as :

𝐽(𝜃) = 1

2𝑚 ∑(ℎ𝜃(𝑥(𝑖)) − 𝑦(𝑖))2

𝑚

𝑖=1

Eq 4.2 : Cost Function or Error Function

where :

J is the Cost

m is the number of training examples

h(x) is the hypothesis

y is the actual value or the result vector

4.5.3 ERROR METRICS

Error metrics are systematic benchmarking measures that are used

for calculating the accuracy or effectiveness of the system. The cost

function is described above is a good example of an error metric. The

following error metrics are used for validation of the generated models and

in choosing the best among them:

25

Training and CV Error

F1 Score

W – Score

4.5.3.1 TRAINING AND CV ERROR

Training error is cost function error of the trained model on

the training set. That is after training the model the training dataset is

supplied again to the model as input to make predictions. These

predictions made by the model are compared against the actual

outcomes in the dataset and the error between the two is calculated

using the cost function formula. The resulting value is the cost

function error.

The cross – validation error is similar to the training error

except it is calculated on the cross – validation set. The benefit here

is that the cross – validation set is new data and has none of the

training examples of the training set and thus can be a better estimate

of the accuracy of the system. Ideally the system’s cross – validation

error should be similar to the training error in which case the model

is a good estimate of the underlying data.

4.5.3.2 F1 Score

In statistical analysis of binary classification, the F1 score

(also F-score or F-measure) is a measure of a test's accuracy. It

considers both the precision p and the recall r of the test to compute

the score: p is the number of correct results divided by the number of

all returned results and r is the number of correct results divided by

26

the number of results that should have been returned. The F1 score

can be interpreted as a weighted average of the precision and recall,

where an F1 score reaches its best value at 1 and worst score at 0.

𝐹1 = 2 .𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 . 𝑅𝑒𝑐𝑎𝑙𝑙

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙

Eq 4.3 : F1 – Score

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠

𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠

Eq 4.4 : Precision

𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠

𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠

Eq 4.5 : Recall

4.5.3.3 W – Score

The W-Score is a combination of the training, cross

validation errors using which the best model gets chosen. The best

model that gets chosen will have the least W – Score. The W – Score

is expressed as :

𝑊 = (1 − 𝑓1).∑ 𝑇𝑟𝑎𝑖𝑛 𝐸𝑟𝑟𝑜𝑟

𝑁𝑇.∑ 𝐶𝑉 𝐸𝑟𝑟𝑜𝑟

𝑁𝐶𝑉

27

Eq 4.6 : W - Score

where:

W – W-Score

f1 – F1 Score

NT – Number of Training Examples

NCV – Number of Cross – Validation Examples

4.5.4 LEARNING CURVES

Fig 4.7 : Single Subject Learning Fig 4.8 : Learning from Experience

Fig 4.9 : Score & Learning Time vs Experience

28

A learning curve is a graphical representation of the increase of learning

(vertical axis) with experience (horizontal axis). Although the curve for a single

subject may be erratic (Fig 4.7), when a large number of trials are averaged, a

smooth curve results, which can be described with a mathematical function (Fig

4.8). Depending on the metric used for learning (or proficiency) the curve can

either rise or fall with experience (Fig 4.9).

Within the context of the project the horizontal axis will be training

examples, which is basically derived from experience, and the vertical axis is the

cost function error. Ideally the cost function error should decrease with increase in

training examples.

But there are two types of errors, that is the training error and the cross –

validation error. With increase in training examples the training error would

increase gradually so as to prevent overfitting and since the training dataset has to

explain a diverse spectrum of examples. But it should not increase exponentially.

Also if the model is efficient then it should perform just as good on new data as it

does on the training dataset. So the cross – validation error must decrease with

increase in training examples.

Thus the ideal model will have a small increase in training error with

increase in training examples and the cross – validation error should decrease with

increase in training examples and the two errors must converge as shown in (Fig

4.10).

29

Fig 4.10 : Training & Cross – Validation Error Convergence

Fig 4.11 : Choosing the Best Model

4.6 PREDICTION

Prediction is the final step in the process. After selecting the best model that

fits the given dataset the model can be put to use on actual real world unlabeled

data. That is it can be used to predict data for which the outcomes are not known.

30

The prediction process begins with the algorithm being supplied unlabeled student

data using which it predicts an outcome, which is whether the student will dropout

or not.

Fig 4.12 : Prediction

31

CHAPTER 5

IMPLEMENTATION

5.1 R

5.1.1 COST FUNCTION

costFunction <- function(dataset, prediction){

dataset <- as.numeric(dataset);

prediction <- as.numeric(prediction);

m = length(dataset);

J = 1 / (2 * m) * sum((dataset - prediction) ^ 2);

return(J);

}

5.1.2 F1SCORE.R

f1Score = function(data, prediction){

data <- as.numeric(data);

prediction <- as.numeric(prediction);

true_positives <- sum(data);

false_positives <- sum(prediction == !data & prediction);

false_negatives <- sum(data == !prediction & !prediction);

precision <- true_positives / (true_positives + false_positives);

recall <- true_positives / (true_positives + false_negatives);

32

return(as.numeric(2 * precision * recall / (precision + recall)));

}

5.1.3 GENERATEDATASET.R

generateDataset <- function(n, dropout_percentage){

source('generateVector.R');

source('percRank.R')

#Gender List

gender_list <- list(data = factor(c("Male", "Female")),

dist = list(Male = 0.52, Female = 0.48),

w = list(Male = 0.39, Female = 0.41));

#Poverty List

poverty_list <- list(data = factor(c("Yes", "No")),

dist = list(Yes = 0.22, No = 0.78),

w = list(Yes = 0.80, No = 0.27));

#Community List

community_list <- list(data = factor(c("General", "OBC", "SC", "ST")),

dist = list(General = 0.30, OBC = 0.40, SC = 0.20, ST = 0.10),

w = list(General = 0.10, OBC = 0.48, SC = 0.64, ST = 0.69));

#Rural List

rural_list <- list(data = factor(c("Yes", "No")),

dist = list(Yes = 0.75, No = 0.25),

w = list(Yes = 0.45, No = 0.20));

33

#Pupil Teacher Ratio List

ptr_list <- list(data = factor(c("Low", "Medium", "High"), order = TRUE),

dist = list(Low = 0.20, Medium = 0.30, High = 0.50),

w = list(Low = 0.15, Medium = 0.35, High = 0.55));

#Student Classroom Ratio List

scr_list <- list(data = factor(c("Low", "Medium", "High"), order = TRUE),

dist = list(Low = 0.18, Medium = 0.33, High = 0.49),

w = list(Low = 0.22, Medium = 0.25, High = 0.60));

Gender <- generateVector(n, gender_list);

Poverty <- generateVector(n, poverty_list);

Community <- generateVector(n, community_list);

Rural <- generateVector(n, rural_list);

PTR <- generateVector(n, ptr_list);

SCR <- generateVector(n, scr_list);

getW <- function(list, vector, index){

value <- as.character(vector[index]);

return(as.numeric(list$w[value]));

}

weightage_vector <- vector('numeric');

for(i in 1:n){

gender_weightage <- getW(gender_list, Gender, i);

poverty_weightage <- getW(poverty_list, Poverty, i);

community_weightage <- getW(community_list, Community, i);

34

rural_weightage <- getW(rural_list, Rural, i);

weightage_vector[i] <-

gender_weightage +

poverty_weightage +

community_weightage +

rural_weightage +

getW(ptr_list, PTR, i) +

getW(scr_list, SCR, i)

;

}

w_rank <- percRank(weightage_vector);

Dropout <- w_rank >= (1 - dropout_percentage);

data <- data.frame(Gender, Poverty, Community, Rural, PTR, SCR,

Dropout);

write.csv(file="data.csv", x=data)

}

5.1.4 GENERATEVECTOR.R

generateVector <- function(n, list){

dist <- list$dist;

p <- c(length(list$data));

#Generate probability series

k = 1;

for(i in dist){

if(k == 1){

35

p[k] = i;

}

else{

p[k] = i + p[k - 1];

}

k = k + 1;

}

#Get index of value that will be added to the vector

getIndex <- function(p, r){

k = 1;

for(i in p){

if(r <= i){

break;

}

k = k + 1;

}

return(k);

}

#Generate Vector

result <- factor(list$data);

for(i in 1:n) {

index <- getIndex(p, runif(1));

value <- list$data[index];

result[i] = value;

}

return(result);

36

}

5.1.5 INIT.R

setwd('/Users/ramathreya/Sites/foss-project/r');

source('generateDataset.R');

source('randomizeDataset.R');

source('predictor.R');

source('costFunction.R');

source('f1Score.R');

source('learningCurve.R');

source('plotLearningCurve.R');

partition <- 0.7;

start <- 100;

interval <- 500;

dataset <- read.csv(file="input.csv");

n <- nrow(dataset);

png('../public/plot.png');

opar <- par(no.readonly=TRUE)

par(mfrow=c(3, 3));

z <- c();

37

train <- c();

cv <- c();

f1 <- c();

seq_range <- seq(0.1, 0.9, 0.1);

for(i in seq_range){

curves <- learningCurve(dataset, start, n, interval, partition, "Dropout",

predictor, i);

plotLearningCurve(curves$m, curves$train, curves$test, c("Plot when Z is

", i), "Training Examples", "Error");

train_last <- tail(curves$train, 1);

cv_last <- tail(curves$test, 1);

z <- c(z, i);

cv <- c(cv, sum(abs(curves$test)) / length(curves$test));

train <- c(train, sum(abs(curves$train)) / length(curves$train));

f1 <- c(f1, sum(abs(curves$f1)) / length(curves$f1));

}

w <- (1-f1) * train * cv;

analysis <- data.frame(z, train, cv, f1, w);

min_index <- which(w==min(w));

write.csv(seq_range[min_index], file="out.z")

dev.off();

5.1.6 LEARNINGCURVE.R

38

learningCurve <- function(dataset, start, end, interval, partition, column,

predictor, z){

train_plot <- c();

test_plot <- c();

x <- c();

f1 <- c();

for(i in seq(start, end, interval)){

m <- i * partition;

training_dataset <- dataset[1:m, ];

test_dataset <- dataset[(m+1):i, ];

train_actual <- unlist(training_dataset[column]);

test_actual <- unlist(test_dataset[column]);

predictor_formula <- predictor(training_dataset);

train_pred <- predict(predictor_formula, type="response",

training_dataset) >= z;

test_pred <- predict(predictor_formula, type="response", test_dataset) >=

z;

train_cost <- costFunction(train_actual, train_pred);

test_cost <- costFunction(test_actual, test_pred);

f1 <- c(f1, f1Score(test_actual, test_pred));

x <- c(x, i);

39

train_plot <- c(train_plot, train_cost);

test_plot <- c(test_plot, test_cost);

}

return(list(train=train_plot, test=test_plot, m=x, f1=f1));

}

5.1.7 MYSQL.R

library(RMySQL)

db = dbConnect(MySQL(), user='root', password='', dbname='mobile_crm',

host='localhost')

5.1.8 PERCRANK.R

percRank <- function(x) trunc(rank(x)) / length(x)

5.1.9 PLOTLEARNINGCURVE.R

plotLearningCurve <- function(m, train_plot, test_plot, title, xlab, ylab,

rnge=range(0, 0.15)){

plot(m, train_plot, type="l", col="red", xlab=NA, ylab=NA, ylim=rnge);

par(new=TRUE);

plot(m, test_plot, type="l", col="green", xlab=NA, ylab=NA, ylim=rnge,

axes=FALSE);

par(new=TRUE);

legend('topright', c("Training", "C.V"),

bty="n", lty=1, lwd=0.5, cex=0.5,

col=c('red', 'green'));

40

title(title,

xlab=xlab,

ylab=ylab);

}

5.1.10 PREDICTION.R

setwd('/Users/ramathreya/Sites/foss-project/r');

source('predictor.R');

dataset <- read.csv(file="input.csv");

z <- read.csv(file="out.z")

z <- z[1, 'x'];

predictor_formula <- predictor(dataset);

input <- read.csv('predict-input.csv');

dataset <- rbind(dataset, input)

l <- nrow(dataset);

prediction <- predict(predictor_formula, newdata=dataset,

type="response");

prediction <- (prediction[l] >= z);

fileConn<-file("output")

writeLines(c(toString(prediction)), fileConn)

close(fileConn)

5.1.11 PREDICTOR.R

41

predictor <- function(dataset){

formula <- glm(

formula = Dropout ~ cbind(Gender, Poverty, Community, Rural, PTR, SCR),

family = binomial,

data = dataset);

return(formula);

}

5.1.12 RANDOMIZEDATASET.R

randomizeDataset <- function(dataset){

result <- subset(dataset, FALSE);

l <- nrow(dataset);

s <- sample(seq(1, l), l);

for(i in 1:l){

result[i, ] <- dataset[s[i], ];

}

return(result);

}

5.2 NODE.JS

5.2.1 APP.JS

;

var express = require('express');

var http = require('http');

var path = require('path');

42

var bodyParser = require('body-parser');

app = express();

app.configure(function() {

app.set('views', __dirname + '/app/views');

app.set('view engine', 'jade');

app.use(express.static(path.join(__dirname, 'public')));

app.use(express.cookieParser());

app.use(express.methodOverride());

app.use(express.session({secret: 'keyboard cat'}));

app.use(bodyParser.json());

app.use(express.json()); // to support JSON-encoded bodies

app.use(express.urlencoded()); // to support URL-encoded bodies

app.locals.basedir = path.join(__dirname, '/app/views');

app.use(app.router);

app.basepath = __dirname;

require('./routes')();

http.createServer(app).listen(3000, function() {

console.log('Server Started');

});

});

5.2.2 PACKAGE.JSON

{

"name": "foss-project",

"scripts": {

43

"start": "node app"

},

"dependencies": {

"body-parser": "̂ 1.5.2",

"connect": "*",

"express": "3.4.0",

"formidable": "1.0.15",

"jade": "*",

"request": "2.x"

},

"engines": {

"node": "0.10.x",

"npm": "1.2.x"

}

}

5.2.3 ROUTES.JS

;

var formidable = require('formidable'),

util = require('util'),

fs = require('fs'),

sys = require('sys'),

exec = require('child_process').exec;

module.exports = function() {

app.get('/', function(req, res) {

res.render('index');

});

44

app.post('/upload', function(req, res) {

// parse a file upload

var form = new formidable.IncomingForm();

form.parse(req, function(err, fields, files) {

//Write to CSV file within r folder

fs.readFile(files.upload.path, function(err, data) {

var newPath = __dirname + "/r/input.csv";

fs.writeFile(newPath, data, function(err) {

function puts(error, stdout, stderr) {

res.render('upload');

}

exec("Rscript r/init.R", puts);

});

});

});

return;

});

app.get('/predict', function(req, res) {

res.render('predict');

});

app.post('/predict', function(req, res) {

var json = JSON.parse(req.body.json);

var key_string = '"",', value_string = '"",';

45

for(var i in json){

key_string += json[i].name + ',';

value_string += json[i].value + ',';

}

key_string += 'Dropout'

value_string += '""';

var string = key_string + '\n' + value_string + '\n';

fs.writeFile('r/predict-input.csv', string, function(err) {

function puts(error, stdout, stderr) {

fs.readFile('r/output', 'utf-8', function(err, data) {

res.end(data);

});

}

exec("Rscript r/prediction.R", puts);

});

});

};

5.2.4 INDEX.JADE

doctype html

html

head

title Dashboard

meta(charset="UTF-8")

meta(content='width=device-width, initial-scale=1, maximum-scale=1,

user-scalable=no' name='viewport')

46

link(rel="stylesheet",href="css/bootstrap.min.css",type="text/css")

link(rel="stylesheet",href="css/font-

awesome.min.css",type="text/css")

link(rel="stylesheet",href="css/ionicons.min.css",type="text/css")

link(rel="stylesheet",href="css/morris/morris.css",type="text/css")

link(rel="stylesheet",href="css/jvectormap/jquery-jvectormap-

1.2.2.css",type="text/css")

link(rel="stylesheet",href="css/bootstrap-wysihtml5/bootstrap3-

wysihtml5.min.css",type="text/css")

link(rel="stylesheet",href="css/AdminLTE.css",type="text/css")

body(class="skin-blue")

header(class="header")

a(href="/",class="logo") FOSS Project

nav(class="navbar navbar-static-top",role="navigation")

a(href="#",class="navbar-btn sidebar-toggle",data-

toggle="offcanvas",role="button")

span(class="sr-only") Toggle Navigation

span(class="icon-bar")



div(class="wrapper row-offcanvas row-offcanvas-left")

aside(class="left-side sidebar-offcanvas")

section(class="sidebar")

ul(class="sidebar-menu")

li

a(href="/")

i(class="fa fa-upload")

span Upload

47

li

a(href="/predict")

i(class="fa fa-search")

span Predict

aside(class="right-side")

section(class="content-header")

h1 Upload

section

div(class="box box-primary")

form(action="/upload",enctype="multipart/form-

data",method="post",role="form")

div(class="box-body")

div(class="form-group")

input(type="file",name="upload",multiple="multiple")

div(class="box-footer")

button(type="submit",class="btn btn-primary")

Upload

script(src="js/jquery.js")

script(src="js/bootstrap.min.js")

script(src="js/plugins/jvectormap/jquery-jvectormap-1.2.2.min.js")

script(src="js/plugins/jvectormap/jquery-jvectormap-world-mill-

en.js")

script(src="js/AdminLTE/app.js")

5.2.5 PREDICT.JADE

48

doctype html

html

head

title Dashboard

























49




li

a(href="/")


span Upload

li

a(href="/predict")


span Predict



h1 Predict

section


form(action="#",enctype="multipart/form-

data",method="post",role="form",id="form")

div(class="box-body")

div(class="form-group col-md-2")

label Gender

select(class="form-control",name="Gender")

option(value="Male") Male

option(value="Female") Female


label Poverty

select(class="form-control",name="Poverty")

option(value="Yes") Yes

option(value="No") No

50


label Community

select(class="form-control",name="Community")

option(value="General") General

option(value="OBC") OBC

option(value="SC") SC

option(value="ST") ST


label Rural

select(class="form-control",name="Rural")

option(value="Yes") Yes

option(value="No") No


label PTR

select(class="form-control",name="PTR")

option(value="Low") Low

option(value="Medium") Medium

option(value="High") High


label SCR

select(class="form-control",name="SCR")

option(value="Low") Low

option(value="Medium") Medium

option(value="High") High

div(class="box-footer",style="margin-left: 5px;")

button(type="button",class="btn btn-

primary",id="submit") Predict

label(id="outcome",style="margin-left: 10px;")

51





en.js")


script(type="text/javascript").

$(document).ready(function(){

$('#submit').click(function(){

var json = JSON.stringify($('#form').serializeArray());

$.ajax({

url: '/predict',

method: 'post',

data: {

json: json

},

success: function(response){

var label = $('#outcome');

if(response.indexOf("TRUE") >= 0){

label.css('color', 'red');

label.html('Student will Dropout');

}

else{

label.css('color', 'green');

label.html('Student will Not Dropout');

}

}

});

});

52

});

5.2.6 UPLOAD.JADE

doctype html

html

head

title Dashboard





















53








li

a(href="/")


span Upload

li

a(href="/predict")


span Predict



h1 Learning Curves

section


iframe(src="plot.png",style="width: 600px; height:

500px;",frameborder="0")





en.js")


54

CHAPTER 6

RESULTS

6.1 DATASET UPLOAD

55

6.2 UPLOAD RESULT

Fig 6.1 : Upload Result

6.3 PREDICTION

Fig 6.2 : Prediction Screen

56

Fig 6.3 : Predicting Student will not Dropout

Fig 6.4 : Predicting Student will Dropout

57

CHAPTER 7

CONCLUSIONS

The advent of Information Technology and the Internet has lead to vast

amounts of data being gathered and stored in multiple formats by multiple sources. Thus both big corporations as well as Government Agencies are attempting to tap

into these vast troves of data for making better decisions and creating efficient

processes. Several techniques such as Machine Learning, Neural Networks etc,

which are commonly termed as Big Data, are trying to revolutionize the way we

analyze information and are adding real value.

This project was inspired by such technologies. The aim was to create an

objective mechanism for solving the dropout problem that could be used for policy

making. This algorithm could provide an objective solution by identifying

vulnerable students who truly need help and thereby improve retention and

completion rates in schools.

Personally, it was a great opportunity for me to discover an area of

programming that I had wanted to learn for some time now. At the same time getting a chance to solve a real world problem that is vital to our society made it

all the more worthwhile. I humbly admit that the algorithm developed is in no way

perfect but it was a determined attempt from my end to prove what is possible.

Hopefully people after me would take this up and extend it to such a point that it

can be of use to Government Agencies and provide real value to students who are

the final beneficiaries of this system and the future of our nation.

58

CHAPTER 8

REFERENCES

RESEARCH PAPERS

Data Mining: A prediction for Student's Performance Using Classification

Method (World Journal of Computer Application and Technology)

A comparative study for predicting student’s academic performance using

Bayesian Network Classifiers (IOSR Journal of Engineering)

School Dropout across Indian States and UTs: An Econometric Study

(International Research Journal of Social Sciences)

Mining Educational Data to Analyze Students’ Performance (International

Journal of Advanced Computer Science and Applications)

Gender Issues and Dropout Rates in India: Major Barrier in Providing

Education for All (Amirtham, N. S. & Kundupuzhakkal, S. / Educationia

Confab)

Mining Educational Data Using Classification to Decrease Dropout Rate of

Students (INTERNATIONAL JOURNAL OF MULTIDISCIPLINARY

SCIENCES AND ENGINEERING)

Predicting Students Academic Performance Using Education Data Mining

(International Journal of Computer Science and Mobile Computing)

Prediction of student academic performance by an application of data

mining techniques (2011 International Conference on Management and

Artificial Intelligence)

Educational Data Mining: A Review of the State-of-the-Art(Transactions

on Systems, Man, and Cybernetics)

59

SURVEYS

School Drop out: Patterns, Causes, Changes and Policies (UNESCO)

The Criticality of Pupil Teacher Ratio (Azim Preji Foundation)

Survey for Assessment of Dropout Rates at Elementary Level in 21 States

(edCil)

Right to Education Report Card (ANNUAL STATUS OF EDUCATION

REPORT 2011)

How High Are Dropout Rates in India? (Economic and Political Weekly

March 17, 2007)

GOVERNMENT REPORTS

Review, Examination and Validation of Data on Dropout in Karnataka

(Department of Education Government of Karnataka)

Drop – out rate at primary level: A note based on DISE 2003 – 04 & 2004 –

05 data (National Institute of Educational Planning and Administration)

Dropout in Secondary Education: A Study of Children Living in Slums of

Delhi (National University of Educational Planning and Administration)

BOOKS

Data Mining: Concepts and Techniques (Jiawei Han

and Micheline Kamber)

R in Action (Robert I. Kabacoff)

60

LINKS

http://www.wikipedia.org

http://scholar.google.com

https://www.coursera.org/course/ml

http://www.wikipedia.org/

http://scholar.google.com/

https://www.coursera.org/course/ml

Forecasting a Student's Education Fulfillment using Regression Analysis

Data & Analytics