Top Banner
ANALYZING THE TREND OF STUDENTS STUDYING ABROAD AS A RESULT OF VARIOUS PARAMETERS OF HOME COUNTRY Arjun Sehgal – N15529324 [email protected]
18

Analyzing the trend of students studying abroad as a result of various parameters of home country

Apr 13, 2017

Download

Data & Analytics

Arjun Sehgal
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Analyzing the trend  of students studying abroad as a result of various parameters of home country

ANALYZINGTHETRENDOFSTUDENTSSTUDYINGABROADASARESULTOFVARIOUSPARAMETERSOFHOMECOUNTRY

ArjunSehgal–[email protected]

Page 2: Analyzing the trend  of students studying abroad as a result of various parameters of home country

TABLEOFCONTENTS

1. Abstract 1

2. Introduction 1

3. DataSources 1

4. SQLPre-Processing 2

5. HiveProcessing 3

6. TableauVisualizations 5

7. PredictiveAnalyticsinH2O 8

8. Conclusions 16

9. FutureScope 16

10. References 16

Page 3: Analyzing the trend  of students studying abroad as a result of various parameters of home country

ARJUNSEHGAL 1

1. ABSTRACTAsweallknowthenumberofstudentsstudyingabroad,isincreasingeveryyearonaglobalscale.Thisflowofstudentsfromdifferentcultures,isdetrimentaltothegrowthoftheworldeconomy. For countries like USA, which are considered to be the hotspots of foreigneducation,thenumberofforeignstudentsalsohasamajorfinancialadvantage.As,aprojectformycourseCS-GY9223BigDataAnalytics,Ihavedecidedtoundertakeaprojectinwhich,Ihaveidentifiedfewfactorswhichmightaffectthenumberofstudentsstudyingabroad.Andthenusingvarioustechnologiestaughtthroughoutthiscourse,Ihavetriedtogaininsightsintothedatasetsobtained.

2. INTRODUCTIONInthisprojectIhaveobtainedthedataonthenumberofstudentsstudyingabroad,thegrossdomestic product(GDP) of various countries, the expenditure on education by thegovernment,therateofunemploymentwithintheyouthofthatcountryandthenumberofinternetuserswithinthecountryasapercentageofthetotalpopulation.Iconsideredthesethreefactorstobedetrimentaltothenumberofstudentsgoingabroadforeducationas,theGDPisaneconomicindicatorwhichshowsusthetotalmonetaryvalueofallthegoodsandservicesproducedwithinacountryinagiventimeframe.Itcanbeusefultodeterminetheeconomichealthofacountry.ThesecondfactorIchoseistheexpenditureoneducationbythecountry.Thiscanbeusedasatooltodeterminewhetherthegovernmentisdevotingenoughresourcestoeducationanditsdevelopment.Naturally,ifthequalityofeducationispoorweshouldbeexpectingagreaternumberofstudentstostudyabroad.Thevaluesforthesehavebeenrepresentedastheexpenditureoneducationasapercentageofthe total expenditureby thegovernment. The thirdand final indicator that I chose is theunemploymentrateamongsttheyouthofthatcountry,i.e.thepopulationthatisaged18-24.Ifeltthatthisfactorwasalsoimportant,asithelpsusindescribingwhethertheyouthwhichprimarilyconsistsofstudentsisabletoobtainjobsintheircountry,ordotheyhavetosearchforbetteropportunitiesabroad,whichcanbethemotivationforstudyingabroad.Ihavealsousedthedatasetfor internetusersfromamongstthepopulation,becauseI feelthat the greater the percentage of population that has access to internet the moreknowledgeablethepopulationwillbeandhencehaveincreasedchancesforstudyingabroad.IreceivedthedatasetforallthefourfromtheUnitedNationsDataset.

3. DATASOURCESThelinksfromwherethedatasetswereobtainedisasfollows:

o Dataset for students studying abroad from a country: http://data.un.org/Data.aspx?q=student&d=UNESCO&f=series%3aED_FSOABS

Page 4: Analyzing the trend  of students studying abroad as a result of various parameters of home country

ARJUNSEHGAL 2

o Dataset for GDP of a country: http://data.un.org/Data.aspx?q=GDP&d=WDI&f=Indicator_Code%3aNY.GDP.MKTP.CD

o Dataset for Youth unemployment rates (ages 15-24): http://data.un.org/Data.aspx?q=unemployment&d=MDG&f=seriesRowID%3a630

o Dataset for Expenditure by Government on Education in home country: http://data.un.org/Data.aspx?d=UNESCO&f=series%3aXGDP_FSGOV

o Dataset for Percentage of Internet Users in home country: http://data.un.org/Data.aspx?d=ITU&f=ind1Code%3aI99H

4. SQLPRE-PROCESSINGAll the dataset’s that I downloaded were in .csv format and were stored using commadelimiter.Firstly,thedatawasloadedontoSQL.InthisIprocessedthedatainordertoensurethatthedataintegritywasmaintained.Inthisprocessofpre-processingthedata,IhaveusedSQL&Exceltocleanthedataandtransformitintoasuitableformat.Inordertodothis,IcreatedrelevanttablesinSQLwiththerespectivedatatypesforeachcolumn for the .csv files.While pre-processing the data, I observed that the data for thecountrynamecolumnwascreatingproblems,assomeofthecountrieshadcomma’sintheirname,whichwasalsobeingusedasthedelimitedthuscausingconfusionwhenloadingthedata.WheneverSQLincorrectlyprocessedacolumn,itencounteredanerror,asanincorrectdatatypewouldbeplacedinthenextcolumn.Also,somefileshadcommentsloadedattheendalongwithfootnotevaluescreatingunequalcolumnwidths.FromtheerrorsobservedinSQLIthencorrectedthedataandsubsequentlyloadedthedatainSQL.WhenthedatawassuccessfullyloadedintoSQL,itwasthenreadytobeloadedinotherapplicationslikePigandHive. Also, for the table of GDP of a country, I noticed that loading the data in HadooptechnologieslikePig&Hivewascreatingproblemsasitwasn’tabletoalwayscorrectlydetectthevalues,astheywereofextremelylargemagnitudes.Asaworkaroundforit,IfirstloadedthedatainSQLandthencreatedtheIDcolumnwhichwillbeexplainedahead.Oncethenewcolumnwascreatedandpopulated,IthenexportedthenewtableanduseditinHivealongwiththeotherdatasets.Oncethedata-setswerepreprocessedandcleanedasshownintheprevioussteps,thedatawasthenloadedintoHDFSbyusingtheHueUI.Onceallthedata-setswereloadedontoHDFS,thenthedatawasprocessedinHive.InthisIhadtocreateakeywithinallthetablessothattheindividualrecordscouldbematchedandidentifieduniquely.Inordertoachievethis,Icreated a new column called ID, which has been derived from two pre existing columnsCountryNameandYear.Byconcatenatingthetwofields,Icreatedanewcolumnwhichwasuniqueforeachrecord.Thebenefitfromthisisthat,whenwearerequiredtoperformjoins,wenowhaveauniquecolumntobereferenced.

Page 5: Analyzing the trend  of students studying abroad as a result of various parameters of home country

ARJUNSEHGAL 3

5. HIVEPROCESSINGInordertoloadthedataintoHivewecanusetwomethods.ThefirstoneisthatwecangotoMetastoreTablesfromtheHueUIandcreateanewtablefromafileloadedontheHDFS.Oncethefileisselected,youcanspecifythedelimiterusedandmentionthecolumnnamesalongwiththeirrespectivedatatypes.Onceallthishasbeenspecified,thenthetablecanbeoperateduponbyexecutingqueriesinHive.ThesecondmethodistocreatethetableusingHiveitself,bywritingaquerytofirstlycreateatable,andthenanotherquerytopopulatethat table. I have used Hive commands to execute the creation of the table and thenpopulatingitwiththerelevantdata.InHive,IcreatedanewcolumncalledIDasmentionedpreviouslyfortheothertablessothatall records can be uniquely accessed and identified. Once the query is executed and the

Page 6: Analyzing the trend  of students studying abroad as a result of various parameters of home country

ARJUNSEHGAL 4

resultshavebeenobtained,wecanobtainthoseresultsin.csvformatoranyotherformataswell.AnotheroptionavailableistoloadtheresultsobtainedfromthequerydirectlyintoanewtableinHive.Thisoptionwasagainamoreconvenientwaytosaveexecutedqueriesintonewtables,sothattheymaybeusedfurther.

OnceIcreatedtheIDcolumnforallthetablesloadedinHive,andthensavedtheresultsinanewtableforeachofthem,thenIperformeda joinonall theseparatetables inordertoensurethatalltherelevantdataforaparticularcountryforthatyearistogetherinasingletable.Also,usingthismethod,Iwasabletofilteroutallthoserecordswhichhadnullvaluesinthem,orforwhomthedatawasn’tavailableforthatparticularyear.As the United Nations Dataset, hasn’t been updated recently, and does not contain allinformation forall thecountries, the incomplete informationcancauseproblems lateronwhenweareanalyzingthedata.Forthispurpose,Ijoinedthetablesusinginnerjoin,asitensuresthatonlythosefieldswillbejoinedwhicharesharingthecommonuniqueidentifiercolumnID.TheseoperationscouldhavebeenperformedinSQLaswell,howeverIchosetodoitinHiveinsteadofSQLasinthecaseoflargedatasets,whichtypicallyinvolvemillionsofrows,performingsuchjoinsinvolving3-4differenttablescanputsignificantpressureonthesystemandaffectsystemperformance.However,whenperformedinHive,thesamejobisrunontopofHadoopandthuscomputesresultsmuchfasterandmoreefficiently.Once I obtained the results for thequery involving joining all table inHive, I saved themsimilarlybyusingtheoptionintheHueUI,bysavingitasa.txtfileonmylocalcomputer.WhenthefinaltxtwasobtainedfromHive,wehaveobtainedadatasetwhichcontainstheinformation like Country Name, Year of Observation, Number of Students Abroad, GDP,ExpenditureonEducation,YouthUnemploymentRate, InternetUsers foreachcountry. Inthisdatasetthatweobtained,wenoticethatalldatafromsomecountriesisavailableonlyfor1or2years,whichmaybenotthatrelevantandcandistorttheaverageswhileweperformpredictiveanalyticsonthem.Inordertohelpwiththis,IusedPigtotransformthedataagain.

Page 7: Analyzing the trend  of students studying abroad as a result of various parameters of home country

ARJUNSEHGAL 5

Iloadedthe.txtfileIsavedfromHivetomylocaldesktopbymakinguseoftheHueUI.OnceI loaded it, then Imadeuseofapigscript to filteroutonly thosecountrieswhichhaveasignificantnumberofresultsintheoutput.Ihaveassumedhere,thatanycountrywithgreaterthanorequaltothreeobservationscanbeconsideredsignificant.Therestoftheobservationshavebeenignored.This step isperformed forall thedatasets thatwerecreated inHive,asacombinationofvariousdifferentfactorsgivesusdifferentlistsofcountriesbecausetheinformationofallthefieldshasnotbeenmadepublicbyallcountries,whichledtoinconsistencieswithinthedata.The data once processed and transformed in Pig, has then been stored locally. It isdownloadedto the localdesktopusingtheHueUI.Afterall thedataprocessinghasbeencompleted,TableauwasusedtocreatevariousdifferentvisualizationsfromtheobtaineddataandH2Owasusedtoperformpredictiveanalyticsontheobtaineddata.

6. TABLEAUVISUALIZATIONSFrom the final analyzed dataset as obtained from Pig, the following visualizations wereobtainedinTableau.Thefollowingfigureisrepresentingthenumberofstudentswhicharegoingtostudyabroadfromeachcountryasshowninthefinaldataset.Fromtheconcludeddataecancorrectlyinferthatamongstallcountries,Chinahasthehighestnumberofstudentsstudyingabroadthananyothercountry,withIndiafallingsecondinthatposition

Page 8: Analyzing the trend  of students studying abroad as a result of various parameters of home country

ARJUNSEHGAL 6

TheDatafromothercountriescanbeseenisfallinginasinglerange,indicatingthatthesetwocountriesarecontributingaheavymajorityofthestudentsstudyingabroadthroughouttheworld.Also,weseethatthecountriesforwhichdataisn’tavailablehavebeengreyedout.Thetotalsumofallstudentsstudyingabroadforallyearshasbeencolorcoded,whichcanbedecodedusingthekeygivenabove.The next visualization created is presenting the number of students which are studyingabroadascomparedwiththeGDPofthatcountryforthatparticularyear.ThisvisualizationhasbeencolorcodedaccordingtothenumberofstudentsstudyingabroadforthatparticularyearforwhichtheGDPhasbeenplotted.Colorcodingthisfigureisespeciallyimportant,aswehavealotofplotpointsinthestartofthefigure,whichcancauseconfusion.However,usingcolorcoding,wecanidentifyavariationamongstthosepointbythechangeincolor.

TheabovevisualizationofGDPandstudents,helpsustogetanideaofthefactthatforamajorityofthecountriesandplotpoints,wecansummarizethegraphusingapolynomialtrendlineoforderthree.However,aswecanseethatcountrieslikeIndiaandChina,whichhaveanabnormallyhighnumberofstudentsstudyingabroad,thosepointsdon’tfallonthistrendlineandcreateananomaly.Thenextvisualizationiscreatedtorepresentthevariationbetweenthenumberofinternetusers in thepopulationof a country and thenumberof students goingabroad from thatcountry.Yetagainasthenumberofstudentshasbeencolorcodedtoensurethatweareabletoidentifythevariousdifferentlevelsofstudentsstudyingabroadforcloselylocatedlevelsofinternetusageincountries.

Page 9: Analyzing the trend  of students studying abroad as a result of various parameters of home country

ARJUNSEHGAL 7

Again, a polynomial trend line has been used with degree three to estimate the data.However,thedatathatisfallingoutofthetrendlineisforthecountriesinwhichthenumberof students studying abroad is abnormal like India and China. These countries can beidentifiedasthehighorangecoloredpeaksintheabovefigure.Inthefollowingvisualizationshowingusthenumberofstudentsstudyingabroadandtherateofunemploymentamongstyouthaged15-24inacountryalsofollowsasimilarpattern,likethe lastgraphof internetusersvs.students. Inthisplotalso,wehaveestimatedthedatapointsusingapolynomialtrendline,howevertheexceptionsforcountrieswithextremelyhighstudentsabroadarepresent.

Page 10: Analyzing the trend  of students studying abroad as a result of various parameters of home country

ARJUNSEHGAL 8

7. PREDICTIVEANALYTICSINH2O

Oncethedatawasanalyzedusingthevisualizationscreatedpreviously,predictiveanalyticswere performed on the data. Thiswas done so thatwe can predict and further emulatevariousscenarioswhichmightaffectthenumberofstudentsstudyingabroad.For this purpose, the software H2O has been used, which can be used for performingpredictiveanalyticsusingthelocalmachine,orontopofR,TableauorHadoop.Mainlytwodifferentmodelshavebeenusedwhilepreparingdifferentanalysis,andfromthesethemodelinwhichthepredictionshadtheminimumerror.Alsotwodifferentdatasetswereusedforthispurpose,inordertoincreasetheefficiencyofthemodelsandidentifythemostrelevantfactors,withwhichthebestresultswereobtained.Thefirstdatasetthatwasused,hadallforfactorsthathavebeenpreviouslydiscussed,thatisunemployment,educationalexpenditure internetusersandGDP. Intheseconddataset,thecolumnforunemploymenthasbeenomitted.Thishasbeendoneasforalargenumberofcountriestheunemploymentpercentagewasn’tavailableandtryingthisdatabaseagainstthesamemodelsmightgiveabetter resultdue the thegreaterversatilityof thedataset.However, at the same time there is trade off between greater number of results andincreasednumberoffactorswhichcanaffecttheresult.ThetwomodelswhichhavebeenuseareGradientBoostingLearningModelandthesecondoneisDeepLearningModel.Boththemodelscanbeusedforregressions,andgiveustheimportanceofthevariableswhichwespecifyshouldbetestedforpredictingthevaluesofthetargetvariable.AGradientBoostingMachine(GBM)isanensembleoftreemodels(eitherregressionorclassification).Bothareforward-learningensemblemethodsthatobtainpredictiveresultsthroughgraduallyimprovedestimates.Boostingisaflexiblenonlinearregressionprocedurethathelpsimprovetheaccuracyoftrees.Bysequentiallyapplyingweakclassificationalgorithmstoincrementallychangingdata,aseriesofdecisiontreesarecreatedthatproduceanensembleofweakpredictionmodels.GBMisthemostaccurategeneralpurposealgorithm.Itcanbeusedforanalysisonnumeroustypesofmodelsandwillalwayspresentrelativelyaccurateresults.Additionally,GradientBoostingMachinesareextremelyrobust,meaningthattheuserdoesnothavetoimputevaluesorscaledata(theycandisregarddistribution).ThismakesGBMthego-tochoiceformanyusers,aslittletweakingisrequiredinordertogetaccurateresults.Inthebelowfigures,GradientBoostingModelhasbeenappliedtothedatasetthatcontainedall the four fields.Firstly, thedata fromthe fileall_fields.csvwas loadedtoh2oasadataframe.Thisframewasthenspitinto25:75inordertocreateavalidationframe,whichistoensurethatthemodelhasconverged.Whilespecifyingthemodelparameters,thevalueofn-foldswassetat8,whichisusedtodeterminethenumberoffoldsforcross-validation.

Page 11: Analyzing the trend  of students studying abroad as a result of various parameters of home country

ARJUNSEHGAL 9

Theresponsecolumnwasthenspecifiedtobestudents,andthecolumnstobeignoredweremarked.Wealsospecifythenumberoftreestobecreatedandthemaximumdepthforthetrees.Alsowecanchangeaparametercalledlearningrate,whichvariesfrom0to1.0.Thisratehasbeensetto0.12.thedefaultvalueforitis0.1.

Oncethemodelwascreated,thenitsparameterswerenoted.Aswecanseefromabovetheblue line represents thescoringhistory for the training frameandtheorangeone for thevalidationframespecified.Also,inthismodelweobtaintherelativeimportanceforthevariablesthatwehadspecified.FromthefollowingtwofigureswecannotethatGDPhasthegreatestimportanceat54%,Unemployment at 16%, Internet users at 10%, Expenditure at 10.12% and the year ofobservationat8.31%.Thuswecanconcludeaccording to thismodel that theyearof theobservationisnotthatrelevanttowardspredictingthenumberofstudentsstudyingabroad.

Page 12: Analyzing the trend  of students studying abroad as a result of various parameters of home country

ARJUNSEHGAL 10

Aftercreatingthemodel,anotherdataframewascreatedwhichhadsamplevaluesforwhichthenumberofstudentsstudyingabroadwasalreadyknown.Thesevalueswerethenfedintoallthemodelssothattheycanbejudgedondeviationfromacommondatasource.Forthisdatamodel,theresultsarecanbeseenbelowareimpressive,withthepercentageoferrorbetweentherealandpredictedvaluebeinglowthroughout,withtheexceptionofafewentries.

Country Year RealValue Error %ErrorPredictedValue

Brazil 2011 29218 6348.00 21.73 35566.00

Brazil 2012 30235 8856.39 29.29 39091.39

Albania 2013 24147 -9505.45 39.36 14641.55

Denmark 2011 6064 760.04 12.53 6824.04

Denmark 2010 5328 1341.83 25.18 6669.83

SouthKorea 2012 121023 3751.17 3.10 124774.17

India 2012 188791 -9810.31 5.20 178980.69

Malaysia 2011 59855 -6918.67 11.56 52936.33Fromtheabovetabletherealvaluesofthenumberofstudentsstudyingabroadandtheirvaluesaspredictedbythemodelhavebeencomparedusinga3-DBarGraph.Fromthisgraphalsowecanviewthatthereisnotmucherrorbetweentherealandpredictedvaluesofthemodel.

Page 13: Analyzing the trend  of students studying abroad as a result of various parameters of home country

ARJUNSEHGAL 11

Thesecondmodelthatwascreatedforthedatasetthatcontainedallthefields,wasDeepLearning.DeepLearningisanotherpopularmodelthatisbeingdeveloped.Itsalgorithmsarebased on distributed representations with the underlying assumption behind distributedrepresentationsisthatobserveddataaregeneratedbytheinteractionsoffactorsorganizedinlayers.DeepLearningwithH2Ofeaturesautomaticadaptiveweight initialization,automaticdatastandardization, expansion of categorical data, automatic handling of missing values,automaticadaptivelearningrates,variousregularizationtechniques,automaticperformancetuning, load balancing, grid-search, N-fold cross-validation, checkpointing and differentdistributedtrainingmodesonclustersforlargedatasets.Thetechnologydoesnotrequirecomplicated configuration files and H2O Deep Learning is highly optimized formaximumperformance.Likethelastmodelinthismodelalsowehaveusedthesamelearningframeandvalidationframe,then-foldsvaluehasbeenkeptsame.Alsotheresponsecolumnhasbeenselectedtobethatofstudentsandthecolumnstobeignoredhavebeenselected.Also,theoptiontospecifytheimportanceofvariousvariablesthathavebeenspecifiedhasalsobeenmarked,toseethedifferencebetweenthepreviousmodelandthisoneonhowtheyaredifferentlyassigningimportance’stovariousdifferentvariables.

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

BRAZIL BRAZIL ALBANIA DENMARK DENMARK SOUTHKOREA

INDIA MALAYSIA

ComparisonofRealvsPredictedValuesforGradientBoostedLearningModel

RealValue PredictedValue

Page 14: Analyzing the trend  of students studying abroad as a result of various parameters of home country

ARJUNSEHGAL 12

Aswecanseefromthefollowingtwofigures,intheDeepLearningModeltheimportance’sthathavebeenassignedtothevariablesthatwespecifiedaredifferentfromtheGradientBoostingModel.

Page 15: Analyzing the trend  of students studying abroad as a result of various parameters of home country

ARJUNSEHGAL 13

ThedeeplearningmodelisheavilyfavoringthevariablesofexpenditureandunemploymentascomparedtotheGradientBoostingModelwhichisevidentasthereisa6%increaseinUnemployment,11%increaseinExpenditure,4%increaseinInternetUsers.

Forthedeeplearningmodel,howeverwhenwepredictvaluesbyusingthesamepredictionframeasweusedinthegradientboostingmodel,wenoticethatthevaluesthatitpredictsareextremelyfarawayfromtherealvalueswhichwealreadyknow.Thiscanbeobservedastheaverageerrorinthiscaseisfargreaterthanthatofthepreviousmodel.Itcanalsobenoticedfromthedifferenceofheightsbetweentherealandpredictedvaluesinthegraphgeneratedbelowthegiventable.Country Year RealValue PredictedValue Error %ERROR

Brazil 2011 29218 34999.26 5781.26 19.79Brazil 2012 30235 33222.91 2987.91 9.88Albania 2013 24147 13657.50 -10489.50 43.44Denmark 2011 6064 16227.10 10163.10 167.60Denmark 2010 5328 13790.95 8462.95 158.84SouthKorea 2012 121023 23365.10 -97657.90 80.69India 2012 188791 108642.89 -80148.11 42.45Malaysia 2011 59855 27459.23 -32395.77 54.12

0

50000

100000

150000

200000

BRAZIL BRAZIL ALBANIA DENMARK DENMARK SOUTHKOREA

INDIA MALAYSIA

ComparisonofRealvsPredictedValuesforDeepLearningModel

RealValue PredictedValue

Page 16: Analyzing the trend  of students studying abroad as a result of various parameters of home country

ARJUNSEHGAL 14

Inthelastmodelthatwastested,thedatasetusedisdifferent.Inthisdatasetwehavenotusedthecolumnforyouthunemployment,asitwasnotavailableforalargenumberofthecountries from the UN Dataset which was used as our source. In this case the GradientBoostingModelhasbeenusedastheDeepLearningModelwashavinggreatererrorthanwhatcanbeallowedinapredictivemodel.

Inthismodelalso,weseethatthebasicparametersfrothemodelarestillthesame.Alsowecan see that GDP is still themost important variable, having61.77% importance, and thenumber of internet users has 19.65% importance, 11% importance for the year ofobservation,withEducationalexpenditurebeingplacedat7.5%.

Page 17: Analyzing the trend  of students studying abroad as a result of various parameters of home country

ARJUNSEHGAL 15

Aftercreatingthemodel,thesamepredictionframeisusedinthismodelalsotopredicttheresponsesforapre-definessetofvalues.Inthismodelalso,weseethatalthoughtheerroranderrorpercentagesarelow,themodelinwhichwehadconsideredallthefourvariablesandusedthegradientboostingmodel,washavingbetterresultswithalesservalueoferror.Country Year RealValue PredictedValue Error %Error

Brazil 2011 35566.00 44358.25 8792.24 24.72

Brazil 2012 39091.39 43721.19 4629.80 11.84

Albania 2013 14641.55 18386.80 3745.25 25.58

Denmark 2011 6824.04 8166.10 1342.06 19.67

Denmark 2010 6669.83 11110.61 4440.78 66.58

SouthKorea 2012 124774.17 107499.65 -17274.52 13.84

India 2012 178980.69 167062.88 -11917.81 6.66

Malaysia 2011 52936.33 31271.67 -21664.67 40.93

0.00

50000.00

100000.00

150000.00

200000.00

Brazil Brazil Albania Denmark Denmark SouthKorea

India Malaysia

ComparisonofRealvsPredictedValuesforGradientBoostedModel(ExcludingYouthUnemployment)

RealValue PredictedValue

Page 18: Analyzing the trend  of students studying abroad as a result of various parameters of home country

ARJUNSEHGAL 16

Wecannowsuccessfullyinferthatthebestfittingmodelforthedatasetthatweobtainedisthegradientboostingmodel,andinordertogetthebestresultsweshouldusethedatasetinwhichallthefourvariablesarepresent.

8. CONCLUSIONHence,wecansuccessfullyinferthatGDPofacountrypaysthemostdominantroleinthedecision of a student to study abroad. Also, Unemployment amongst the youth and thenumber of internet users although might not be that significant factors in terms ofpercentages,theyarealsoafactorwhichshouldbekeptinmindwhilepredictingthevaluesforfutureyearsforvariouscountries.

9. FUTURESCOPEFuture Scope for this project can be increased to adding further variables, which can berelevanttothematter.Addingagreaternumberofvariableswillnodoubteddecreasethepercentage importance of various factors like GDP, which are currently enjoying a highpercentage.However,addingmorediverse factorswill increase thechance topredict thevaluemoreaccurately.Itwillalsohelpusstabilizetheeffectofroguevalueslikeaspikeinanyvaluewhichmightcreateananomalyandgiveusanincorrectpredictionforthevalues.

10. REFERENCES

• discuss.analyticsvidhya.com• en.wikipedia.org• pig.apache.org• cwiki.apache.org• hortonworks.com/hadoop-tutorial• www.stackoverflow.com/• www.h2o.ai/verticals/algos