i FORECASTING A STUDENT’S EDUCATION FULFILLMENT USING REGRESSION ANALYSIS Submitted by RAM G ATHREYA Roll No.: 1202FOSS0019 Reg. No.: 75812200021 A PROJECT REPORT Submitted to the FACULTY OF SCIENCE AND HUMANITIES in partial fulfillment for the requirement of award of the degree of MASTER OF SCIENCE IN FREE / OPEN SOURCE SOFTWARE (CS-FOSS) CENTRE FOR DISTANCE EDUCATION ANNA UNIVERSITY CHENNAI 600 025 AUGUST 2014
74
Embed
Forecasting a Student's Education Fulfillment using Regression Analysis
Our government spends substantial amount of resources in educating our children. Additionally several welfare schemes are introduced aimed especially at underprivileged children to ensure that all of them complete a basic level of education. In spite of these measures many students do not complete their basic education. The aim of this project is to formulate a Supervised Learning Algorithm that will aid in identifying such students who have a higher likelihood of not completing their education. To perform this task the algorithm will perform Logistic Regression Analysis on historical data of students from a given school. The historical data includes basic background information (features) such as gender, community, number of siblings etc. It must be noted that the historical data also contains information on whether the student completed his/her education, which is the outcome we are interested in. Typically a student finishing education will be denoted using a value of 1 and a student not finishing will be denoted with a value of 0. Based on the training (historical) data a logistic classifier can be built. Such a classifier after learning from the training set will develop specific weightages for each of the features. These weightages can then be extrapolated into an equation that can be used for prediction. That is we can apply the equation on a current student (whose background we already know) to calculate the probability that he/she will complete his/her education. Such an algorithm will be beneficial to government agencies since it can serve as an early warning system using which they can take more proactive action to prevent a student from dropping out. Policy makers can also use it as a tool to identify schools that are more vulnerable and direct their resources and energies to help them.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
i
FORECASTING A STUDENT’S EDUCATION
FULFILLMENT USING REGRESSION ANALYSIS
Submitted by
RAM G ATHREYA
Roll No.: 1202FOSS0019 Reg. No.: 75812200021
A PROJECT REPORT
Submitted to the
FACULTY OF SCIENCE AND HUMANITIES
in partial fulfillment for the requirement of award of the degree of
MASTER OF SCIENCE IN
FREE / OPEN SOURCE SOFTWARE (CS-FOSS)
CENTRE FOR DISTANCE EDUCATION ANNA UNIVERSITY CHENNAI 600 025
AUGUST 2014
ii
CENTRE FOR DISTANCE EDUCATION
ANNA UNIVERSITY
CHENNAI 600 025
BONA FIDE CERTIFICATE
Certified that this Project report titled “FORECASTING A STUDENT’S
EDUCATION USING REGRESSION ANALYSIS” is the bona fide work of Mr. RAM G
ATHREYA, who carried out the research under my supervision. I certify further, that
to the best of my knowledge the work reported herein does not form part of any
other Project report or dissertation on the basis of which a degree or award was
conferred on an earlier occasion on this or any other candidate.
RAM G ATHREYA Dr. SRINIVASAN SUNDARARAJAN
Student at Anna University Professor
iii
CERTIFICATE OF VIVA-VOCE-EXAMINATION
This is to certify that Thiru/Mr. RAM G ATHREYA
(Roll No. 1202FOSS0019; Register No. 75812200021) has been subjected to Viva-
voce-Examination on 14 September 2014 at 9:30 AM at the Study centre The AU-
KBC research Centre, Madras Institute of Technology, Anna Universisty, Chrompet,
Chennai 600044.
Internal Examiner External Examiner
Name : Name :
(in capital letters) (in capital letters)
Designation : Designation :
Address : Address :
Coordinator centre
Name :
(in capital letters)
Designation :
Address :
Date :
iv
ACKNOWLEDGEMENT
I am highly indebted to my guide Dr. SRINIVASAN SUNDARARAJAN for his
guidance, monitoring, constant supervision, kind co-operation and encouragement
that helped me in completion of this project.
I would also like to express my special gratitude to AU-KBC faculties involved in
M.Sc. (CS-FOSS) course for their cordial support and guidance as well as for
providing necessary information regarding the project and also for their support in
completing the project.
Finally, I thank Center of Distance Education, Anna University for giving me an
opportunity to do this project.
v
ABSTRACT
Our government spends substantial amount of resources in educating our
children. Additionally several welfare schemes are introduced aimed especially at
underprivileged children to ensure that all of them complete a basic level of
education. In spite of these measures many students do not complete their basic
education.
The aim of this project is to formulate a Supervised Learning Algorithm
that will aid in identifying such students who have a higher likelihood of not
completing their education.
To perform this task the algorithm will perform Logistic Regression
Analysis on historical data of students from a given school. The historical data
includes basic background information (features) such as gender, community,
number of siblings etc. It must be noted that the historical data also contains
information on whether the student completed his/her education, which is the
outcome we are interested in. Typically a student finishing education will be
denoted using a value of 1 and a student not finishing will be denoted with a value
of 0.
Based on the training (historical) data a logistic classifier can be built. Such
a classifier after learning from the training set will develop specific weightages for
each of the features. These weightages can then be extrapolated into an equation
that can be used for prediction.
That is we can apply the equation on a current student (whose background
we already know) to calculate the probability that he/she will complete his/her
education.
vi
Such an algorithm will be beneficial to government agencies since it can
serve as an early warning system using which they can take more proactive action
to prevent a student from dropping out. Policy makers can also use it as a tool to
identify schools that are more vulnerable and direct their resources and energies to
help them.
vii
சசசசசசசசச
சசசச சசசச சசசச சசசசசசசசசச சசசசச சசசசசசச
சசசசசசச சசசச சசசசசசசசசசசசச. சசசசசசசச பல
சசசசசசசசசசசசசச சசசசசசசசசசச சசசசசசச சசசசச சசச
சசசசசசசச சசசச சசசசசசச சசசசச சசசசச சசசசசசசசச
சசசசசசசசசச சசசசசசசசசச சசசசசசச. சசசச
சசசசசசசசசசசச சசச பல சசசசசசசசச சசசச சசசசசசசச
சசசசசசச சசசசசசச.
சசசச சசசசசசசசசசச சசசசசசசச, சசசசசச சசசசச சசசசசச
சசசச சசசசசசசச சசசசசச சசசச சசசசசச சசசசசசசசச
சசசசசசசச சசசசசச சசசசச சசச சசசசசசசசசசசசசச
சசசசசச சசசசசசச சசசசசசசசச சசசசசச.
சசசசசசச சசசசசசசசசசசசசச சசசசசசசசச சசசசசசச
சசசசசசசசச சசசசசசசச சசசச சசசசசசசசச சசசசசசசசச
சசசசசசசசசசச சசசசச சசசச சசசசச சசசசச. சசசசசசசச
சசசச சசசசசசசச சசசசசசச சசசசச சசசசசச சசசசசசச,
சசசசசச, சசசசசசசச சசசச சசசசசச சசசசசசச சசசசசசச
சசசசசச சசசசச சசசசச / சசசசச சசசசச, சசசசசச சசசசசச
சசசசச சசச சசசசசசசசசசசசச சசசசசசசச சசசசசச
சசசசசசசசசசசசசச சசசசசசசசச (சசசசசசசசச) சசசசசசசச .
சசசசசசசச சசச சசசசசச சசசசசசச சசசசச சசச சசசசசசச 1
சசசசசசசசசசச சசசசசசசசசசசசசச சசசசசசச சசச சசசசசச
0 சசச சசசசசசச சசசசசசசசசசசசசச சசசசசசசச.
சசசசசசச சசசசசசசசசசசச (சசசசசசசச) சசசச
சசசசசசசசசச சசசசசசசசசசசச சசசசசசசசசசசச. சசசசசசச
சசசசசசசசசசச சசசசசசச சசசசச சசசசசசச சசசசசசச சசச
சசசசசசசசசசசச சசசசசசசசச சசசசசசசச சசசசசசசசசசச
viii
சசசசசசச சசசசசசசசசசச. சசசச சசசசசசசசசச சசசசசசச
சசசசசசச சசசசசசசசசச சசசசசசசச சசசசச சசச
சசசசசசசசசசசச சசசசசசசசசசசசச.
சசசசச சசசச (சசசச சசசசசசச சசசச சசசசசசச சசசசசசசச)
சசசச / சசசச சசசச / சசசசச சசசசச சசசசசசச சசசசசசசச
சசசசச சசசசசசசசச சசசசசசச சசசசசசசசச சசசசசச சசசச
சசசசசசசசசச சசசசசசசசசசசச சசசசசசசச.
சசச சசசசசசச சசசசசச சசசசசசசசசச சசசசசசச சசச
சசசசசச சசசசசச சசசசசசச சசசசசசசசசசசசச சசசசசசசசச
சசசசசச சசசசசசசச சசசசசசசசசசச சசச சசசசச
சசசசசசசசசச சசசசசசச சசசசசசசச சசசசசசசச சசசசசசசச
சசசசசசச சசச சசசசசசச சசசசசசச சசசசசசசசசசசசசசச
சசசசசசசசசச சசசசசசசசச. சசசசசசச சசசசசசசசசசசசச
சசசசசசச சசசசசச சசசசசசச சசசசசசச சசசசசச சசசசசச
சசசசசசசசசசசசசச சசசசச சசசசசசசச சசசசசசசச
சசசசசசச சசசசசசசச சசச சசசசசசசச சசச சசசசசசசசசச
சசசசசசசச.
ix
TABLE OF CONTENTS
CHAPTER NO TITLE PAGE NO
ACKNOWLEDGEMENT iv
ABSTRACT v
ABSTRACT IN TAMIL vii
LIST OF FIGURES xii
LIST OF TABLES xiii
LIST OF ABBREVIATIONS xiv
1 INTRODUCTION 1
1.1 OVERVIEW OF THE PROJECT 1
1.2 LITERATURE SURVEY 2
1.3 PROPOSED SYSTEM 2
1.4 SCOPE 2
2 REQUIREMENT SPECIFICATION 4
2.1 INTRODUCTION 4
2.2 OVERALL DESCRIPTION 4
2.2.1 PRODUCT PERSPECTIVE 5
2.2.2 PRODUCT FUNCTIONS 5
3 PROJECT REQUIREMENTS 7
3.1 SOFTWARE REQUIREMENTS 7
3.2 HARDWARE REQUIREMENTS 7
4 SYSTEM DESIGN 9
x
4.1 METHODOLOGY 9
4.2 ALGORITHM 9
4.2.1 SUPERVISED LEARNING 10
4.2.2 CLASSIFICATION 11
4.2.3 LOGISTIC REGRESSION 13
4.3 DATA COLLECTION 15
4.3.1 FEATURE DETECTION 15
4.3.1.1 PERSONAL 15
4.3.1.2 ENVIRONMENTAL 15
4.3.1.3 SCHOOL 16
4.3.2 DATASET GENERATION 16
4.4 MODELING 18
4.4.1 HYPOTHESIS DEVELOPMENT 19
4.4.2 GENERALIZATION ERROR 19
4.5 VALIDATION 20
4.5.1 DATASET PARTITIONING 21
4.5.1.1 TRAINING DATASET 21
4.5.1.2 CV DATASET 22
4.5.2 COST FUNCTION 23
4.5.3 ERROR METRICS 24
4.5.3.1 TRAINING AND CV
ERROR 25
4.5.3.2 F1 SCORE 25
4.5.3.3 W – SCORE 26
4.5.4 LEARNING CURVES 27
4.6 PREDICTION 29
5 IMPLEMENTATION 31
5.1 R 31
xi
5.1.1 COST FUNCTION.R 31
5.1.2 F1SCORE.R 31
5.1.3 GENERATEDATASET.R 32
5.1.4 GENERATEVECTOR.R 34
5.1.5 INIT.R 36
5.1.6 LEARNINGCURVE.R 37
5.1.7 MYSQL.R 39
5.1.8 PERCRANK.R 39
5.1.9 PLOTLEARNINGCURVE.R 39
5.1.10 PREDICTION.R 40
5.1.11 PREDICTOR.R 40
5.1.12 RANDOMIZEDATASET.R 41
5.2 NODE.JS 41
5.2.1 APP.JS 41
5.2.2 PACKAGE.JSON 42
5.2.3 ROUTES.JS 43
5.2.4 INDEX.JADE 45
5.2.5 PREDICT.JADE 47
5.2.6 UPLOAD.JADE 52
6 RESULTS 54
6.1 DATASET UPLOAD 54
6.2 UPLOAD RESULT 55
6.3 PREDICTION 56
7 CONCLUSIONS 57
8 REFERENCES 58
xii
LIST OF FIGURES
FIGURE NO TITLE PAGE NO
4.1 Logistic Regression Curve
4.2 Dataset Generation
4.3 Modeling
4.4 Dataset Partitioning
4.5 Developing Multiple Models
4.6 Calculating Cross-Validation Errors
4.7 Single Subject Learning
4.8 Learning from Experience
4.9 Score & Learning Time vs Experience
4.10 Training & Cross – Validation Error
Convergence
4.11 Choosing the Best Model
4.12 Prediction
6.1 Upload Result
6.2 Prediction Screen
6.3 Predicting Student will not Dropout
6.4 Predicting Student will Dropout
xiii
LIST OF TABLES
TABLE NO TITLE PAGE NO
4.1 Sample Dataset 17
xiv
LIST OF ABBREVIATIONS
FOSS Free and Open Source Software
IDE Integrated Development Environment
OS Operating System
PTR Pupil Teacher Ratio
SCR Student Classroom Ratio
1
CHAPTER 1
INTRODUCTION
1.1 OVERVIEW OF THE PROJECT
Dropout is a universal phenomenon of the education system in India, which is
spread across all levels of education, in all parts of the country, and across socio-
economic groups the dropout rates are much higher for educationally backward
states and districts. Girls in India tend to have higher dropout rates than boys.
Similarly, children belonging to the socially disadvantaged groups like Scheduled
Castes and Scheduled Tribes have the higher dropout rates in comparison to the
general population.
There are also regional and location wise differences and the children living in
rural areas are more likely to drop out of school. In order to reduce wastage and
improve the efficiency of education system, educational planners need to
understand and identify the social groups that are more susceptible to dropout and
the reasons for their dropping out.
Keeping the above context in perspective, it would be helpful to develop a
system or an algorithm that can systematically identify such vulnerable students
who have a higher likelihood of dropping out from school. The goal of this project
is to develop such an algorithm or system.
Hopefully such an algorithm or system could assist educational planners and
administrative staff of educational institutions to better allocate resources and
make better decisions, which could curb this growing dropout problem.
2
1.2 LITERATURE SURVEY
The literature survey covers existing research and studies with respect to the
dropout problem. They are grouped into three broad categories:
1 Research Papers
2 Surveys
3 Govt Reports
The detailed list of resources researched during the literature survey is
provided in the references section.
1.3 PROPOSED SYSTEM
The proposed system will implement an algorithm that will take in student
data as input and learn from it. This learned function, otherwise called as the
hypothesis will serve as an approximate explanation of the data. Error metrics and
validation techniques will be used to determine the accuracy of the hypothesis.
The best hypothesis that fits the data will then be used for prediction. The final
goal of the algorithm is to make reasonably accurate predictions of new unlabeled
data. Unlabeled data is data for which the outcome is unknown.
This system will be implemented in such a way that it can be operated from a
web interface where the user can upload datasets as well as make predictions
based on learned data.
1.4 SCOPE
3
The algorithm developed is an exploratory proof – of – concept system that
uses machine learning and statistical techniques to make predictions based on
student data. The validity of the results is entirely dependent on the accuracy of
the data and how the algorithm processes it.
Since comprehensive student data was not available for making the algorithm
as best as possible, this iteration of the system can only serve as a proof – of –
concept on what is possible and cannot be directly used in the real world, in its
present form, as a decision making or policy making tool.
4
CHAPTER 2
REQUIREMENT SPECIFICATION
2.1 INTRODUCTION
A software requirements specification (SRS) defines the requirements of a
software system. It is a description of the behavior of a system to be developed
and may include a set of use cases. In addition it also contains non-functional
requirements. Non-functional requirements impose constraints on the design or
implementation (such as performance requirements, quality standards, or design
constraints).
This project requires storage and processing of medium to large volumes of
data/datasets. Such datasets will be passed through the algorithm initially during a
training phase, during this time the algorithm will learn using the training data.
After training is completed the algorithm would then be required to make
predictions for new unlabeled data based on what it learned from the training data.
Additionally it would be helpful it the algorithm can be operated from a
Web User Interface which will be more user friendly than issuing commands from
the command line.
2.2 OVERALL DESCRIPTION
This section will outline a holistic description of the project, which includes
different perspectives, constraints, functional and non – functional requirements of
the project.
5
2.2.1 PRODUCT PERSPECTIVE
The system has 4 main tasks that are
Data Collection
Modeling
Validation
Prediction
In the data collection phase the data required for the
algorithm is gathered converted into a suitable form and supplied to
the system for learning.
In the modeling phase the algorithm tries to generate models
that try to explain the data that has been gathered. Machine Learning
techniques are used in this phase to generate multiple models of
which the best gets chosen in later stages.
In the validation phase the different models are evaluated
based on performance and the best among them is chosen as the
candidate algorithm that can be used for prediction
Finally in the prediction phase the chosen model is used for
making actual real world predictions.
2.2.2 PRODUCT FUNCTIONS
The system has two main functions that are
Training
6
Prediction
In the training phase the dataset is supplied to the algorithm using
which the best model is developed for prediction
In the prediction phase the learnt algorithm can be actually put to
use that is it can be used to make predictions for unlabeled data.
How these processes are implemented is explained in detail in
subsequent sections.
7
CHAPTER 3
PROJECT REQUIREMENTS
The project requirement is to develop an algorithm that can classify
students on whether they would complete education or not (dropout). To achieve
this a system needs to be created that can be operated from a web user interface
that will supply data for training or can make predictions based on already trained
data.
3.1 SOFTWARE REQUIREMENTS
The software requirements for this project are:
R – R is a free software programming language and software
environment for statistical computing and graphics.
Node.js - Node.js is a cross-platform runtime environment and a
library for running applications written in JavaScript outside the
browser (for example, on the server)
Netbeans - NetBeans is an integrated development
environment (IDE) for developing primarily with Java, but also with
other languages, in particular PHP, C++, Node.js & HTML5
RStudio – RStudio is a free and open source (FOSS) integrated
development environment for R, a programming language for
statistical computing and graphics
LINUX – LINUX is a POSIX-compliant computer operating system
(OS) assembled under the model of free and open source software.
The advent of Information Technology and the Internet has lead to vast
amounts of data being gathered and stored in multiple formats by multiple sources. Thus both big corporations as well as Government Agencies are attempting to tap
into these vast troves of data for making better decisions and creating efficient
processes. Several techniques such as Machine Learning, Neural Networks etc,
which are commonly termed as Big Data, are trying to revolutionize the way we
analyze information and are adding real value.
This project was inspired by such technologies. The aim was to create an
objective mechanism for solving the dropout problem that could be used for policy
making. This algorithm could provide an objective solution by identifying
vulnerable students who truly need help and thereby improve retention and
completion rates in schools.
Personally, it was a great opportunity for me to discover an area of
programming that I had wanted to learn for some time now. At the same time getting a chance to solve a real world problem that is vital to our society made it
all the more worthwhile. I humbly admit that the algorithm developed is in no way
perfect but it was a determined attempt from my end to prove what is possible.
Hopefully people after me would take this up and extend it to such a point that it
can be of use to Government Agencies and provide real value to students who are
the final beneficiaries of this system and the future of our nation.
58
CHAPTER 8
REFERENCES
RESEARCH PAPERS
Data Mining: A prediction for Student's Performance Using Classification
Method (World Journal of Computer Application and Technology)
A comparative study for predicting student’s academic performance using
Bayesian Network Classifiers (IOSR Journal of Engineering)
School Dropout across Indian States and UTs: An Econometric Study
(International Research Journal of Social Sciences)
Mining Educational Data to Analyze Students’ Performance (International
Journal of Advanced Computer Science and Applications)
Gender Issues and Dropout Rates in India: Major Barrier in Providing
Education for All (Amirtham, N. S. & Kundupuzhakkal, S. / Educationia
Confab)
Mining Educational Data Using Classification to Decrease Dropout Rate of
Students (INTERNATIONAL JOURNAL OF MULTIDISCIPLINARY
SCIENCES AND ENGINEERING)
Predicting Students Academic Performance Using Education Data Mining
(International Journal of Computer Science and Mobile Computing)
Prediction of student academic performance by an application of data
mining techniques (2011 International Conference on Management and
Artificial Intelligence)
Educational Data Mining: A Review of the State-of-the-Art(Transactions
on Systems, Man, and Cybernetics)
59
SURVEYS
School Drop out: Patterns, Causes, Changes and Policies (UNESCO)
The Criticality of Pupil Teacher Ratio (Azim Preji Foundation)
Survey for Assessment of Dropout Rates at Elementary Level in 21 States
(edCil)
Right to Education Report Card (ANNUAL STATUS OF EDUCATION
REPORT 2011)
How High Are Dropout Rates in India? (Economic and Political Weekly
March 17, 2007)
GOVERNMENT REPORTS
Review, Examination and Validation of Data on Dropout in Karnataka
(Department of Education Government of Karnataka)
Drop – out rate at primary level: A note based on DISE 2003 – 04 & 2004 –
05 data (National Institute of Educational Planning and Administration)
Dropout in Secondary Education: A Study of Children Living in Slums of
Delhi (National University of Educational Planning and Administration)