A Project Report On Data Analysis and Simulation Modeling Submitted to Amity University Uttar Pradesh In partial fulfillment of the requirements for the award of the degree Of Bachelor of Technology In Computer Science and Engineering FACULTY GUIDE: SUBMITTED BY: Ms. SHRUTI GUPTA Mr. VARUN SHARMA ASSISTANT PROFESSOR A2305213337 7CSE-5X
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
A Project Report
On
Data Analysis and Simulation Modeling
Submitted to
Amity University Uttar Pradesh
In partial fulfillment of the requirements for the award of the degree
Of
Bachelor of Technology
In
Computer Science and Engineering
FACULTY GUIDE: SUBMITTED BY:
Ms. SHRUTI GUPTA Mr. VARUN SHARMA
ASSISTANT PROFESSOR A2305213337
7CSE-5X
DECLARATION
I Varun Sharma student of B.Tech (Amity School of Engineering and Technology) hereby
declare that the project titled “Data Analysis and Simulation Modeling” which is submitted by
me to Department of Computer Science and Engineering, Amity School of Engineering and
Technology, Amity University Uttar Pradesh, Noida, in partial fulfilment of requirement for
the award of the degree of Bachelor of Technology in Computer Science and Engineering, has
not been previously formed the basis for the award of any degree, diploma or other similar title
or recognition.
The Author attests that permission has been obtained for the use of any copy righted material
appearing in the Dissertation / Project report other than brief excerpts requiring only proper
acknowledgement in scholarly writing and all such use is acknowledged.
Date:- Signature of Student
Varun Sharma
i
CERTIFICATE
This is to certify that Mr Varun Sharma, student of B.Tech. in (Computer Science and
Engineering) has carried out the work presented in the project entitled " Data Analysis and
Simulation Modeling " as a part of Third year Programme of Bachelor of Technology in
Computer Science and Engineering from (Amity School of Engineering and Technology, Amity
University, Noida, Uttar Pradesh under my supervision.
Signature of the faculty Guide
Ms. Shruti Gupta
Assistant Professor
Department of Computer Science & Engineering
Amity School of Engineering and Technology, AUUP
ii
ACKNOWLEDGEMENTS
The satisfaction that accompanies that the successful completion of my task would be incomplete without the mention of the people whose ceaseless cooperation made it possible, whose constant guidance and encouragement crown all the efforts with success. I would like to thank Prof (Dr) Abhay Bansal, Joint Acting Head, Head of Department, CSE, Amity School of engineering and Technology for giving me the opportunity to undertake this project.
I would like to thank my NTCC guide Ms Shruti Gupta for her guidance and support throughout the duration of my project and for helping me choose this interesting topic.
iii
ABSTRACTThe first half of this report will deal with simulation modeling, i.e. – To generate data via
computer simulation when you don’t have any. Simulation models are also created to prove a hypotheses when conducting the experiment in real life is not possible or is dangerous. I will be working with Stochastic Monte Carlo simulations for this part. The main idea is that we feed the range of values in a scenario with a probability distribution, each iteration giving rise to a distinct case. If we are able to obtain sufficient samples, then according to the law of large numbers, the average result must be close to the true value.
In the second half, I will be talking about Data Analysis and making predictions based on the learning examples. I will be mostly dealing with Supervised Machine Learning and use different models to analyze a given dataset. I will deal with Gradient descent algorithm for Linear Regression on a given training set and retrieve values for unknown cases.
iv
CONTENTSDeclaration................................................................................................................................................... i
Certificate.................................................................................................................................................... ii
Acknowledgements.................................................................................................................................... iii
Abstract...................................................................................................................................................... iv
Monte Carlo Simulations:........................................................................................................................1
HIV Virus......................................................................................................................................................2
Gradient Descent for Linear Regression [3].................................................................................................4
Monte Carlo simulation [1] is a computerized mathematical technique that allows people to account for risk in quantitative analysis and decision making. The technique is used by professionals in such widely disparate fields as finance, project management, energy, manufacturing, engineering, research and development, insurance, oil & gas, transportation, and the environment.
My first two simulation models are basic probability problems.
First Model:
You are given 6 balls in a rag, three are white and other three are black. You pick three balls with eyes closed, find the probability that all three are of the same color.
I will be supplementing this report with a copy of source code for each simulation.
Running this simulation 500k times, we get the -
Output: 0.099574
Which is very close to the real value as per the formulas of probability theory, i.e. – 0.01
Modification of the model:
Everything is same but this time, you are given 8 balls in total, 4 of each color. You still have to randomly pick 3 balls. Running this simulation the same 500k times, we get -
Output: 0.143306
Which is very close to the real value of 0.14
1
HIV VIRUSIn this simulation [2], we have two models. In first one, we do not give any drugs to the patient
and let the virus propagate freely, we get expected results –
The virus propagates without any barrier and grows exponentially. Simulation without any drugs –
Number of initial Viruses: 100
Max Population: 1000
Maximum Birth Probability: 0.1
Clear probability of virus: 0.05
Number of Trials: 90
2
Simulation with Drugs:
This time we run the simulation with drugs and change the drug after 150 cycles. We get such a graph:
Initially, the viruses grow slowly. Picking up resistances on the way. As we change the drug given to the patient, the population of viruses’ drops significantly. In the meantime, the average population of resistant to the given drugs starts to rise. After a few lifecycles, the average population of viruses is equal to the average resistant population. Which means that only those viruses survived who developed a resistance and every virus became resistant in the end.
So we come to know about the importance of switching the drugs for patients with diseases such as HIV, which can develop resistance to the drugs fairly quickly.
3
Machine LearningGRADIENT DESCENT FOR LINEAR REGRESSION [3]At a theoretical level, gradient descent is an algorithm that minimizes functions. Given a function defined by a set of parameters, gradient descent starts with an initial set of parameter values and iteratively moves toward a set of parameter values that minimize the function. This iterative minimization is achieved using calculus, taking steps in the negative direction of the function gradient.
It’s sometimes difficult to see how this mathematical explanation translates into a practical setting, so it’s helpful to look at an example. The canonical example when explaining gradient descent is linear regression.
Here, I will implement linear regression with one variable to predict profits for a food truck. In this scenario, we have a restaurant franchise who are considering different cities for opening a new outlet. The chain already has trucks in various cities and we have data for profits and populations from the cities.
Data Plot:-
4
Running Gradient Descent:-
Visualization of Cost Function with Respect to Θ0 and Θ1:-
5
Contour Map:-
6
Theta found by gradient descent: -3.630291, 1.166362
For the city with a population of 35,000, we predict a profit of 4519.767868
For the city with a population of 70,000, we predict a profit of 45342.450129
The prediction for a 3 bedroom house with area of 1650 sq. Feet:
$289314.620338
8
Machine Learning for Indian Railways
INTRODUCTION - With advanced computers and storage techniques available, Indian Railways hold the capability
to generate and store data like never before. The problem arises when this data becomes so enormous that it cannot be analyzed by conventional methods. But the possibilities remain enormous. CRIS is currently working on models to predict Train Arrival Delays, Possible component breakdowns, and many more.
TRAIN ARRIVAL TIMES – FUTURE ASPECTSWhenever a train comes late, it causes inconvenience to the passengers, delays the schedules
and puts a question on the reliability of Indian Railway’s services. It has been seen that there is always a pattern to every event. The same is the case with Train arrival times. When we analyze weather, seasons, date and time, we see a pattern on how all these constraints affect arrival times. More than that, we get to know the ‘Hotspots’ of delays in train arrivals. By all this data, we are able to predict the chances of any train getting late (and by how much time) at any particular time when we feed in all these constraints to the system. This helps us plan ahead in time and be able to provide a better service.
ELECTRIFICATION OF RAILWAY TRACKData:-
9
Linear Regression (Using Gradient Descent):-
For year = 2017, we predict: 1822.712573 Kms
For year = 2018, we predict: 1937.783785 Kms
10
3D Visualization of Cost Function:-
Contour Map of Cost Function:-
11
INDIAN RAILWAY EARNINGSData:-
12
Linear Regression:-
For year = 2017, we predict Earnings (in 1000 crores) of: 185.212075
For year = 2018, we predict Earnings (in 1000 crores) of: 197.036802
13
Polynomial Regression:
Theta found by gradient descent: 102.183719 26.968488
7.048750 1.294778
Z =
74.2633
80.6250
88.0149
96.9615
107.9933
121.6391
138.4273
158.8866
14
For year = 2014, our model predicts Earnings of: 158.886576
Against actual Earnings of: 157.8810
For year = 2015, we predict Earnings of: 183.545500
For year = 2016, we predict Earnings of: 212.932667
For year = 2017, we predict Earnings of: 247.576667
For year = 2018, we predict Earnings of: 288.006091
Linear Vs. Polynomial:As you can see, there has been a drastic difference between the predictions of my linear model
and Polynomial (Cubic) model. While my linear model predicts earnings of Rs.185.21 (thousand crores) in 2017, the polynomial model predicts 247.57 (thousand crores).
These two models are perfect to explain the concept of under-fitting, it is evident that the linear model underfits the data whereas the cubic one is ‘just right’, we go farther than this, we may risk over-fitting the data such that any future predictions may become useless because we may fail to ‘generalize’ enough to be useful for future predictions. Following is an example of ‘just right’ and an over-fit model:-
Why use Gradient Descent?In regression problems, our main motive is to find ‘Thetas’ for our hypothesis function to fit the
data such that we get an equation where we put new values for our features and get a predicted ‘Y’ value for them. Another algorithm to fit the data is Normal Equation. However, it is highly computation heavy and may take a long time to complete for large data sets. So, we go for Gradient descent, which is
15
a Greedy algorithm which helps us get as close to the values in a reasonable amount of time. However, the cost function must be ‘Convex’ is we are using Gradient Descent, because Gradient Descent is susceptible to Local Minima.
16
FITTING THE CLOSEST MODELIndian Railways, Traction and Non-Traction Power ConsumptionDataset:
17
Model 1 (N4):
Result:
For year = 2014, our model predicts Non-Traction Power Consumption of: 17.350000 (100 Million Watts)
Against actual consumption of: 17.35 (100 Million Watts)
For year = 2015, we predict Non-Traction Power Consumption of: 35.470000
For year = 2016, we predict Non-Traction Power Consumption of: 89.270000
For year = 2017, we predict Non-Traction Power Consumption of: 208.150000
For year = 2018, we predict Non-Traction Power Consumption of: 429.670000
Observation:
18
This model tries ‘Too Hard’ to fit into the training set and it becomes useless for future predictions as it fails to ‘Generalize’. And hence, we get these inflated values.
Model 2 (Linear):
Result:
For year = 2014, our model predicts Non-Traction Power Consumption of: 18.008000 (100 Million Watts)
Against actual consumption of: 17.35 (100 Million Watts)
For year = 2015, we predict Non-Traction Power Consumption of: 22.335000
For year = 2016, we predict Non-Traction Power Consumption of: 26.662000
For year = 2017, we predict Non-Traction Power Consumption of: 30.989000
For year = 2018, we predict Non-Traction Power Consumption of: 35.316000
19
Observation:
This model ‘Fits Better’ and is able to generalize to be able to make somewhat ‘worthy’ prediction. However our training example from 2013 gives us the value of 370 Million Watts which is not in continuum with values from consecutive years: - 1135, 1367, 1735. Which leads to irregularities in data.
Model 3 (Linear with removed Irregularities):
Result:
For year = 2014, our model predicts Non-Traction Power Consumption of: 17.123333 (100 Million Watts)
Against actual consumption of: 17.35 (100 Million Watts)
For year = 2015, we predict Non-Traction Power Consumption of: 20.123333
20
For year = 2016, we predict Non-Traction Power Consumption of: 23.123333
For year = 2017, we predict Non-Traction Power Consumption of: 26.123333
For year = 2018, we predict Non-Traction Power Consumption of: 29.123333
--------------------------
For year = 2014, our model predicts Traction Power Consumption of: 13.390667 (1000 Million Watts)
Against actual consumption of: 13.76 (1000 Million Watts)
For year = 2015, we predict Traction Power Consumption of: 16.697667
For year = 2016, we predict Traction Power Consumption of: 20.004667
For year = 2017, we predict Traction Power Consumption of: 23.311667
For year = 2018, we predict Traction Power Consumption of: 26.618667
Observation:
In this Model, we try not to ‘Overfit’ the data and have removed the irregularities. Some amount of irregularities would not have been a problem if our Training Set was huge in comparison but since we are only working with 4 training example, irregularity in one of them leads to large deviations since it would contribute to 25% of the prediction.
Adding more Training Examples we get more refined results:
For year = 2014, our model predicts Non-Traction Power Consumption of: 16.754000 (100 Million Watts)
Against actual consumption of: 17.35 (100 Million Watts)
For year = 2015, we predict Non-Traction Power Consumption of: 19.477000
For year = 2016, we predict Non-Traction Power Consumption of: 22.200000
For year = 2017, we predict Non-Traction Power Consumption of: 24.923000
For year = 2018, we predict Non-Traction Power Consumption of: 27.646000
--------------------------
For year = 2014, our model predicts Traction Power Consumption of: 12.944800 (1000 Million Watts)
21
Against actual consumption of: 13.76 (1000 Million Watts)
For year = 2015, we predict Traction Power Consumption of: 15.917400
For year = 2016, we predict Traction Power Consumption of: 18.890000
For year = 2017, we predict Traction Power Consumption of: 21.862600
For year = 2018, we predict Traction Power Consumption of: 24.835200
22
REFERENCES
[1] "Monte Carlo Simulation," [Online]. Available: http://www.palisade.com/risk/monte_carlo_simulation.asp.
[2] X. L. Ronald H. Gray, "Stochastic simulation of the impact of antiretroviral," [Online]. Available: http://www.who.int/hiv/events/artprevention/gray_stochastic.pdf.
[3] M. NEDRICH, "An Introduction to Gradient Descent and Linear Regression," [Online]. Available: https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/.