Report

A Project Report

On

Data Analysis and Simulation Modeling

Submitted to

Amity University Uttar Pradesh

In partial fulfillment of the requirements for the award of the degree

Of

Bachelor of Technology

In

Computer Science and Engineering

FACULTY GUIDE: SUBMITTED BY:

Ms. SHRUTI GUPTA Mr. VARUN SHARMA

ASSISTANT PROFESSOR A2305213337

7CSE-5X

DECLARATION

I Varun Sharma student of B.Tech (Amity School of Engineering and Technology) hereby

declare that the project titled “Data Analysis and Simulation Modeling” which is submitted by

me to Department of Computer Science and Engineering, Amity School of Engineering and

Technology, Amity University Uttar Pradesh, Noida, in partial fulfilment of requirement for

the award of the degree of Bachelor of Technology in Computer Science and Engineering, has

not been previously formed the basis for the award of any degree, diploma or other similar title

or recognition.

The Author attests that permission has been obtained for the use of any copy righted material

appearing in the Dissertation / Project report other than brief excerpts requiring only proper

acknowledgement in scholarly writing and all such use is acknowledged.

Date:- Signature of Student

Varun Sharma

i

CERTIFICATE

This is to certify that Mr Varun Sharma, student of B.Tech. in (Computer Science and

Engineering) has carried out the work presented in the project entitled " Data Analysis and

Simulation Modeling " as a part of Third year Programme of Bachelor of Technology in

Computer Science and Engineering from (Amity School of Engineering and Technology, Amity

University, Noida, Uttar Pradesh under my supervision.

Signature of the faculty Guide

Ms. Shruti Gupta

Assistant Professor

Department of Computer Science & Engineering

Amity School of Engineering and Technology, AUUP

ii

ACKNOWLEDGEMENTS

The satisfaction that accompanies that the successful completion of my task would be incomplete without the mention of the people whose ceaseless cooperation made it possible, whose constant guidance and encouragement crown all the efforts with success. I would like to thank Prof (Dr) Abhay Bansal, Joint Acting Head, Head of Department, CSE, Amity School of engineering and Technology for giving me the opportunity to undertake this project.

I would like to thank my NTCC guide Ms Shruti Gupta for her guidance and support throughout the duration of my project and for helping me choose this interesting topic.

iii

ABSTRACTThe first half of this report will deal with simulation modeling, i.e. – To generate data via

computer simulation when you don’t have any. Simulation models are also created to prove a hypotheses when conducting the experiment in real life is not possible or is dangerous. I will be working with Stochastic Monte Carlo simulations for this part. The main idea is that we feed the range of values in a scenario with a probability distribution, each iteration giving rise to a distinct case. If we are able to obtain sufficient samples, then according to the law of large numbers, the average result must be close to the true value.

In the second half, I will be talking about Data Analysis and making predictions based on the learning examples. I will be mostly dealing with Supervised Machine Learning and use different models to analyze a given dataset. I will deal with Gradient descent algorithm for Linear Regression on a given training set and retrieve values for unknown cases.

iv

CONTENTSDeclaration................................................................................................................................................... i

Certificate.................................................................................................................................................... ii

Acknowledgements.................................................................................................................................... iii

Abstract...................................................................................................................................................... iv

Simulation Modeling...................................................................................................................................1

Monte Carlo Simulations:........................................................................................................................1

HIV Virus......................................................................................................................................................2

Gradient Descent for Linear Regression [3].................................................................................................4

Multivariate Gradient Descent....................................................................................................................8

Introduction.................................................................................................................................................9

Train Arrival Times.......................................................................................................................................9

Electrification of Railway Track....................................................................................................................9

Data:-.......................................................................................................................................................9

Linear Regression (Using Gradient Descent):-.......................................................................................10

3D Visualization of Cost Function:-........................................................................................................11

Contour Map of Cost Function:-............................................................................................................11

Indian Railway Earnings.............................................................................................................................12

Linear Regression:-................................................................................................................................13

Polynomial Regression:..........................................................................................................................14

Linear Vs. Polynomial:...........................................................................................................................15

Why use Gradient Descent?..................................................................................................................15

Fitting the closest model...........................................................................................................................17

Indian Railways, Traction and Non-Traction Power Consumption.........................................................17

Dataset:.............................................................................................................................................17

Model 1 (N4):.....................................................................................................................................18

Model 2 (Linear):...............................................................................................................................19

Model 3 (Linear with removed Irregularities):...................................................................................20

Adding more Training Examples we get more refined results:..........................................................21

References.................................................................................................................................................23

v

Data Analysis and Simulation Modeling

SIMULATION MODELINGMonte Carlo Simulations:

Monte Carlo simulation [1] is a computerized mathematical technique that allows people to account for risk in quantitative analysis and decision making. The technique is used by professionals in such widely disparate fields as finance, project management, energy, manufacturing, engineering, research and development, insurance, oil & gas, transportation, and the environment.

My first two simulation models are basic probability problems.

First Model:

You are given 6 balls in a rag, three are white and other three are black. You pick three balls with eyes closed, find the probability that all three are of the same color.

I will be supplementing this report with a copy of source code for each simulation.

Running this simulation 500k times, we get the -

Output: 0.099574

Which is very close to the real value as per the formulas of probability theory, i.e. – 0.01

Modification of the model:

Everything is same but this time, you are given 8 balls in total, 4 of each color. You still have to randomly pick 3 balls. Running this simulation the same 500k times, we get -

Output: 0.143306

Which is very close to the real value of 0.14

1

HIV VIRUSIn this simulation [2], we have two models. In first one, we do not give any drugs to the patient

and let the virus propagate freely, we get expected results –

The virus propagates without any barrier and grows exponentially. Simulation without any drugs –

Number of initial Viruses: 100

Max Population: 1000

Maximum Birth Probability: 0.1

Clear probability of virus: 0.05

Number of Trials: 90

2

Simulation with Drugs:

This time we run the simulation with drugs and change the drug after 150 cycles. We get such a graph:

Initially, the viruses grow slowly. Picking up resistances on the way. As we change the drug given to the patient, the population of viruses’ drops significantly. In the meantime, the average population of resistant to the given drugs starts to rise. After a few lifecycles, the average population of viruses is equal to the average resistant population. Which means that only those viruses survived who developed a resistance and every virus became resistant in the end.

So we come to know about the importance of switching the drugs for patients with diseases such as HIV, which can develop resistance to the drugs fairly quickly.

3

Machine LearningGRADIENT DESCENT FOR LINEAR REGRESSION [3]At a theoretical level, gradient descent is an algorithm that minimizes functions. Given a function defined by a set of parameters, gradient descent starts with an initial set of parameter values and iteratively moves toward a set of parameter values that minimize the function. This iterative minimization is achieved using calculus, taking steps in the negative direction of the function gradient.

It’s sometimes difficult to see how this mathematical explanation translates into a practical setting, so it’s helpful to look at an example. The canonical example when explaining gradient descent is linear regression.

Here, I will implement linear regression with one variable to predict profits for a food truck. In this scenario, we have a restaurant franchise who are considering different cities for opening a new outlet. The chain already has trucks in various cities and we have data for profits and populations from the cities.

Data Plot:-

4

Running Gradient Descent:-

Visualization of Cost Function with Respect to Θ0 and Θ1:-

5

Contour Map:-

6

Theta found by gradient descent: -3.630291, 1.166362

For the city with a population of 35,000, we predict a profit of 4519.767868

For the city with a population of 70,000, we predict a profit of 45342.450129

7

MULTIVARIATE GRADIENT DESCENTEstimating Cost of House:Dataset: [Area (Sq. Feet), Bedrooms] [Price]

Normalizing the Features...

Running gradient descent for Normalized Dataset...

0 50 100 150 200 250 300 350 400Number of iterations

0

1

2

3

4

5

6

7

Cos

t J

1010

Theta computed from gradient descent:

334302.063993

100087.116006

3673.548451

The prediction for a 3 bedroom house with area of 1650 sq. Feet:

$289314.620338

8

Machine Learning for Indian Railways

INTRODUCTION - With advanced computers and storage techniques available, Indian Railways hold the capability

to generate and store data like never before. The problem arises when this data becomes so enormous that it cannot be analyzed by conventional methods. But the possibilities remain enormous. CRIS is currently working on models to predict Train Arrival Delays, Possible component breakdowns, and many more.

TRAIN ARRIVAL TIMES – FUTURE ASPECTSWhenever a train comes late, it causes inconvenience to the passengers, delays the schedules

and puts a question on the reliability of Indian Railway’s services. It has been seen that there is always a pattern to every event. The same is the case with Train arrival times. When we analyze weather, seasons, date and time, we see a pattern on how all these constraints affect arrival times. More than that, we get to know the ‘Hotspots’ of delays in train arrivals. By all this data, we are able to predict the chances of any train getting late (and by how much time) at any particular time when we feed in all these constraints to the system. This helps us plan ahead in time and be able to provide a better service.

ELECTRIFICATION OF RAILWAY TRACKData:-

9

Linear Regression (Using Gradient Descent):-

For year = 2017, we predict: 1822.712573 Kms

For year = 2018, we predict: 1937.783785 Kms

10

3D Visualization of Cost Function:-

Contour Map of Cost Function:-

11

INDIAN RAILWAY EARNINGSData:-

12

Linear Regression:-

For year = 2017, we predict Earnings (in 1000 crores) of: 185.212075

For year = 2018, we predict Earnings (in 1000 crores) of: 197.036802

13

Polynomial Regression:

Theta found by gradient descent: 102.183719 26.968488

7.048750 1.294778

Z =

74.2633

80.6250

88.0149

96.9615

107.9933

121.6391

138.4273

158.8866

14

For year = 2014, our model predicts Earnings of: 158.886576

Against actual Earnings of: 157.8810

For year = 2015, we predict Earnings of: 183.545500




Linear Vs. Polynomial:As you can see, there has been a drastic difference between the predictions of my linear model

and Polynomial (Cubic) model. While my linear model predicts earnings of Rs.185.21 (thousand crores) in 2017, the polynomial model predicts 247.57 (thousand crores).

These two models are perfect to explain the concept of under-fitting, it is evident that the linear model underfits the data whereas the cubic one is ‘just right’, we go farther than this, we may risk over-fitting the data such that any future predictions may become useless because we may fail to ‘generalize’ enough to be useful for future predictions. Following is an example of ‘just right’ and an over-fit model:-

Why use Gradient Descent?In regression problems, our main motive is to find ‘Thetas’ for our hypothesis function to fit the

data such that we get an equation where we put new values for our features and get a predicted ‘Y’ value for them. Another algorithm to fit the data is Normal Equation. However, it is highly computation heavy and may take a long time to complete for large data sets. So, we go for Gradient descent, which is

15

a Greedy algorithm which helps us get as close to the values in a reasonable amount of time. However, the cost function must be ‘Convex’ is we are using Gradient Descent, because Gradient Descent is susceptible to Local Minima.

16

FITTING THE CLOSEST MODELIndian Railways, Traction and Non-Traction Power ConsumptionDataset:

17

Model 1 (N4):

Result:

For year = 2014, our model predicts Non-Traction Power Consumption of: 17.350000 (100 Million Watts)

Against actual consumption of: 17.35 (100 Million Watts)

For year = 2015, we predict Non-Traction Power Consumption of: 35.470000




Observation:

18

This model tries ‘Too Hard’ to fit into the training set and it becomes useless for future predictions as it fails to ‘Generalize’. And hence, we get these inflated values.

Model 2 (Linear):

Result:







19

Observation:

This model ‘Fits Better’ and is able to generalize to be able to make somewhat ‘worthy’ prediction. However our training example from 2013 gives us the value of 370 Million Watts which is not in continuum with values from consecutive years: - 1135, 1367, 1735. Which leads to irregularities in data.

Model 3 (Linear with removed Irregularities):

Result:




20




--------------------------

For year = 2014, our model predicts Traction Power Consumption of: 13.390667 (1000 Million Watts)


For year = 2015, we predict Traction Power Consumption of: 16.697667




Observation:

In this Model, we try not to ‘Overfit’ the data and have removed the irregularities. Some amount of irregularities would not have been a problem if our Training Set was huge in comparison but since we are only working with 4 training example, irregularity in one of them leads to large deviations since it would contribute to 25% of the prediction.

Adding more Training Examples we get more refined results:







--------------------------

For year = 2014, our model predicts Traction Power Consumption of: 12.944800 (1000 Million Watts)

21






22

REFERENCES

[1] "Monte Carlo Simulation," [Online]. Available: http://www.palisade.com/risk/monte_carlo_simulation.asp.

[2] X. L. Ronald H. Gray, "Stochastic simulation of the impact of antiretroviral," [Online]. Available: http://www.who.int/hiv/events/artprevention/gray_stochastic.pdf.

[3] M. NEDRICH, "An Introduction to Gradient Descent and Linear Regression," [Online]. Available: https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/.

23

Report

Documents