Top Banner
© 2013 IBM Corporation Comparing R and Python for PCA PyData Boston 2013 Vipin Sachdeva Senior Engineer, IBM Research
26

Comparing R and Python for PCA PyData Boston 2013files.meetup.com/1676436/PyData-2013-PCA.pdf · 2014-03-15 · Dataframes is an essential part of languages supporting data analysis

Jun 07, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Comparing R and Python for PCA PyData Boston 2013files.meetup.com/1676436/PyData-2013-PCA.pdf · 2014-03-15 · Dataframes is an essential part of languages supporting data analysis

© 2013 IBM Corporation

Comparing R and Python for PCA PyData Boston 2013

Vipin Sachdeva

Senior Engineer, IBM Research

Page 2: Comparing R and Python for PCA PyData Boston 2013files.meetup.com/1676436/PyData-2013-PCA.pdf · 2014-03-15 · Dataframes is an essential part of languages supporting data analysis

© 2013 IBM Corporation

Comparison of R and Python for Principal Component Analysis

§  R and Python are popular choices for data analysis.

§  How do they compare in terms of programmer productivity and performance ?

§  Use a common task for both R and Python – Principal Component Analysis (PCA) – PCA is a very commonly used technique for dimension reduction.

§  Dataframes is an essential part of languages supporting data analysis – R provides data frame with numerous statistical packages. – Python has included numPy (arrays) and Pandas (dataframe) for data handling which

we use.

§  Both language have rich development environments – Rstudio for R –  iPython for Python.

§  Both languages have many features that helps in data analysis. –  In this talk we compare those features with some code examples to solve our

problem.

§  This talk is not about as much about principal component analysis as about programming and performance of Python and R

§  Let’s get started

Page 3: Comparing R and Python for PCA PyData Boston 2013files.meetup.com/1676436/PyData-2013-PCA.pdf · 2014-03-15 · Dataframes is an essential part of languages supporting data analysis

© 2013 IBM Corporation

PCA – Short Introduction

§  PCA is a standard tool in modern data analysis.

§  Simple method to extract information from confusing datasets – Reduce a complex dataset to a lower dimension

•  PCA projects the data along the direction where data varies the most. •  Directions are determined by the direction of the eigenvectors coresponding to

largest eigenvalues

Page 4: Comparing R and Python for PCA PyData Boston 2013files.meetup.com/1676436/PyData-2013-PCA.pdf · 2014-03-15 · Dataframes is an essential part of languages supporting data analysis

© 2013 IBM Corporation

PCA – Mathematical approaches

§  Find eigenvalues of standardized covariance matrix.

§  Choose eigenvalues with sum exceeding a threshold.

§  Reduction in dimension from N to K: – Create data with subset of eigenvalues (whose sum exceeds that threshold).

-- --

- 7 -

• How to choose the principal components?

- To choose K , use the following criterion:

K

i=1Σ i

N

i=1Σ i

> Threshold (e.g., 0.9 or 0.95)

• What is the error due to dimensionality reduction?

- We saw above that an original vector x can be reconstructed using its the prin-cipla components:

x̂ − x =K

i=1Σ biui or x̂ =

K

i=1Σ biui + x

- It can be shown that the low-dimensional basis based on principal componentsminimizes the reconstruction error:

e = ||x − x̂||

- It can be shown that the error is equal to:

e = 1/2N

i=K+1Σ i

• Standardization

- The principal components are dependent on the units used to measure the orig-inal variables as well as on the range of values they assume.

- We should always standardize the data prior to using PCA.

- A common standardization method is to transform all the data to have zeromean and unit standard deviation:

xi − ( and are the mean and standard deviation of xi’s)

-- --

- 7 -

• How to choose the principal components?

- To choose K , use the following criterion:

K

i=1Σ i

N

i=1Σ i

> Threshold (e.g., 0.9 or 0.95)

• What is the error due to dimensionality reduction?

- We saw above that an original vector x can be reconstructed using its the prin-cipla components:

x̂ − x =K

i=1Σ biui or x̂ =

K

i=1Σ biui + x

- It can be shown that the low-dimensional basis based on principal componentsminimizes the reconstruction error:

e = ||x − x̂||

- It can be shown that the error is equal to:

e = 1/2N

i=K+1Σ i

• Standardization

- The principal components are dependent on the units used to measure the orig-inal variables as well as on the range of values they assume.

- We should always standardize the data prior to using PCA.

- A common standardization method is to transform all the data to have zeromean and unit standard deviation:

xi − ( and are the mean and standard deviation of xi’s)

Page 5: Comparing R and Python for PCA PyData Boston 2013files.meetup.com/1676436/PyData-2013-PCA.pdf · 2014-03-15 · Dataframes is an essential part of languages supporting data analysis

© 2013 IBM Corporation

PCA using Singular Value Decomposition (SVD)

§  More generalized approach for performing PCA.

§  Decompose X=UDVT

§  D*D is eigenvalues of covariance matrix.

§  Reconstruction of data by zeroing out regions as shown below

§  Choose q (as before)

Figure 5:

We can embed x into an orthogonal space via rotation. D scales, V rotates, and U is aperfect circle.

PCA cuts o↵ SVD at q dimensions. In Figure 6, U is a low dimensional representation.Examples 3 and 1.3 use q = 2 and N = 130. D reflects the variance so we cut o↵ dimensionswith low variance (remember d11 d22...). Lastly, V are the principle components.

Figure 6:

2 Factor Analysis

Figure 7: The hidden variable is the point on the hyperplane (line). The observed value isx, which is dependant on the hidden variable.

Factor analysis is another dimension-reduction technique. The low-dimension represen-tation of higher-dimensional space is a hyperplane drawn through the high dimensionalspace. For each datapoint, we select a point on the hyperplane and choose data from theGaussian around that point. These chosen points are observable whereas the point on thehyperplane is latent.

4

Page 6: Comparing R and Python for PCA PyData Boston 2013files.meetup.com/1676436/PyData-2013-PCA.pdf · 2014-03-15 · Dataframes is an essential part of languages supporting data analysis

© 2013 IBM Corporation

PCA: What data to use ?

§  How about PCA on current 500 S&P stocks data for a “period of time” ?

§  Download symbols from S&P 500 website and create a vector.

§  Use this vector to download symbols data from 1970 to 2012 in a dataframe (if possible).

§  R and Python have various packages for financial data download – quantMod (R) – pandas.io.data.DataReader (Python)

§  Need a package that provides a single dataframe as output from a single call.

Dates MMM ABT … 01-01-1970 109.62 NA … 01-02-1970 107.12 NA NA … … … … 12-31-2012 NA 108.66 104.32

Page 7: Comparing R and Python for PCA PyData Boston 2013files.meetup.com/1676436/PyData-2013-PCA.pdf · 2014-03-15 · Dataframes is an essential part of languages supporting data analysis

© 2013 IBM Corporation

Data Download – R only

§  I am a C/C++/Fortran HPC programmer, and I do use for loops in R and Python. –  for loops are slow in R

§  Can any package return data for S&P stocks as a single dataframe ? – Use fImport package of R to download daily data. – stocksData<-yahooSeries(symbols_nospaces,from="1970-01-01",to="2012-12-31”)

#symbols_nospaces is S&P stock symbols – Extract columns with closing dates.

§  Write to a csv file for repeated runs (takes a long time to download)

§  Read the file in R/Python to get the data –  read.table in R created a R dataframe – Pandas read_table created a Pandas dataframe.

§  Many symbols have NA’s for dates where data is not available.

§  Work with a subset of data – How about 200 stocks for quarter of a century (1988-2012) ?

#Snippet of code to get closing data colname<-paste(symbols_nospaces[i],".Close",sep="") print(colname) stockData_df[,i+1]<-get(colname) colnames(stockData_df)[i+1]<-symbols_nospaces[i]

Page 8: Comparing R and Python for PCA PyData Boston 2013files.meetup.com/1676436/PyData-2013-PCA.pdf · 2014-03-15 · Dataframes is an essential part of languages supporting data analysis

© 2013 IBM Corporation

Combined Data Preparation – R and Python

§  Read the file in R/Python to get the data –  read.table in R created a R dataframe – Pandas read_table created a Pandas dataframe.

§  Many symbols have NA’s for dates where data is not available.

§  Work with a subset of data – How about 200 stocks for quarter of a century (1988-2012) ?

§  Find first occurrence of “1988” in dataframe’s Dates column. – str.contains(“1970”) in Python – agrep(…,fixed=TRUE) in R.

§  Extract stock columns which do not have NA on the first trading day of 1988 –  !is.na in R/ not math.isnan in Python – Get 200 stocks which satisfy above requirement

§  Result: combined data for 200 stocks from 1988-2012 in R/Python dataframes. – Drop rows with NA for any stock.

•  na.omit()/drop.na() – 6162 entries for 200 stocks in total.

§  Both R and Python are remarkably similar for this step.

Page 9: Comparing R and Python for PCA PyData Boston 2013files.meetup.com/1676436/PyData-2013-PCA.pdf · 2014-03-15 · Dataframes is an essential part of languages supporting data analysis

© 2013 IBM Corporation

Code in R and Python for data preparation R code extractData<-function(filename,yearrange, numstocks) { stocksreturns<-read.table("./stocksdata_dataframe.txt", header=T) yrpattern<-paste(yearrange[1],"-*",sep="”) stockdates<-stocksreturns$Dates sindex<-agrep(yrpattern,stockdates,fixed=TRUE)[1] eindex<-length(stockdates) colnames<-colnames(stocksreturns[i]) stocksreturns_short<-data.frame(stockdates[startindex:endindex])]) colnames(stocksreturns_short)[1]<-"Dates"

j<-2

stockindex<-0 for(i in 2:501) { if(stockindex<numstocks) { if(!is.na(stocksreturns[,i][k])) { stocksreturns_short[,j]<-stocksreturns[,i][sindex:eindex] colnames(stocksreturns_short)[j]<-colnames[i][i] j<-j+1 stockindex<-stockindex+1}}}}

9

Python code def extractData(filename,lowyear, numstocks): stocksreturns=pd.read_table('stocksdata_dataframe.txt', sep='\s+') yrpattern="%d-*" % (lowyear) x=stocksreturns['"Dates"'].str.contains(str(lowyear)) for i in range(size(x)): if(x[i]==True): break startindex=i colnames=stocksreturns.columns stocksreturns_short=DataFrame(stocksreturns.ix[startindex:size(x)]['"Dates"']) stockindex=0 for i in range(1,501): if stockindex<numstocks: if not math.isnan(stocksreturns[colnames[i]][startindex]): stocksreturns_short[colnames[i]]=stocksreturns[colnames[i]][startindex:len(stocksreturns.index)] stockindex=stockindex+1 return stocksreturns_short

Page 10: Comparing R and Python for PCA PyData Boston 2013files.meetup.com/1676436/PyData-2013-PCA.pdf · 2014-03-15 · Dataframes is an essential part of languages supporting data analysis

© 2013 IBM Corporation

R/Python packages for PCA

§  Size of data is only about 23 MB in txt format. – No memory-bound issues running on my laptop.

§  R has many choices (one too many) for PCA: – prcomp/princomp/PCA/dudi.pca/acp – prcomp scales and centers data (very convenient)

•  prcomp(stocksreturns_short,scale=TRUE,center=TRUE,retx=TRUE) •  Reconstruct data with predict function. •  prcomp uses svd beneath the covers

§  Python seems to have several choices for PCA as well. – matplotlib.mca.pca – MDP (module for data processing) PCA – numpy.eig/scipy.eig etc

§  Both packages seem to have adequate support for PCA in multiple ways.

§  Our approach: Use SVD in both R/Python: – Do same operations and compare runtimes. – svd in numpy returns transpose(V), while R returns V – Both R and Python return d as a vector; trivial to make a diagonal matrix for

reconstruction of data. – Things start in Python from 0; in R from 1 J

covariance matrix/eigenvalues/eigenvectors approach.

Page 11: Comparing R and Python for PCA PyData Boston 2013files.meetup.com/1676436/PyData-2013-PCA.pdf · 2014-03-15 · Dataframes is an essential part of languages supporting data analysis

© 2013 IBM Corporation

PCA on combined data using SVD

§  PCA is a SVD operation:

§  X is stocks data (6162x200)

§  D*D is eigenvalues. (p=200)

§  Reconstruction of data by zeroing out regions as shown below

§  Choose q (explained ahead)

Figure 5:

We can embed x into an orthogonal space via rotation. D scales, V rotates, and U is aperfect circle.

PCA cuts o↵ SVD at q dimensions. In Figure 6, U is a low dimensional representation.Examples 3 and 1.3 use q = 2 and N = 130. D reflects the variance so we cut o↵ dimensionswith low variance (remember d11 d22...). Lastly, V are the principle components.

Figure 6:

2 Factor Analysis

Figure 7: The hidden variable is the point on the hyperplane (line). The observed value isx, which is dependant on the hidden variable.

Factor analysis is another dimension-reduction technique. The low-dimension represen-tation of higher-dimensional space is a hyperplane drawn through the high dimensionalspace. For each datapoint, we select a point on the hyperplane and choose data from theGaussian around that point. These chosen points are observable whereas the point on thehyperplane is latent.

4

Page 12: Comparing R and Python for PCA PyData Boston 2013files.meetup.com/1676436/PyData-2013-PCA.pdf · 2014-03-15 · Dataframes is an essential part of languages supporting data analysis

© 2013 IBM Corporation

PCA on combined data – Approach

§  Perform a SVD of stock returns data.

§  Find number of eigenvalues q comprising 50%,75%,90% and 100% of sum of all the eigenvalues

– Eigenvalues=d*d from SVD

§  Zero out remaining eigenvectors/eigenvalues –  In Python, use copy.copy for copying eigenvalues/vectors from SVD (assignment is

done using references)

§  Reconstruct data with matrix-multiply operations. – X_reconstructed=U*D*t(V) – Measure the std_dev(data_reconstructed-original_data)

Page 13: Comparing R and Python for PCA PyData Boston 2013files.meetup.com/1676436/PyData-2013-PCA.pdf · 2014-03-15 · Dataframes is an essential part of languages supporting data analysis

© 2013 IBM Corporation

Code in Python for PCA on combined data colnames=stocksreturns_short.columns diff_data_combined=np.zeros(shape=(stocksreturns_short.shape[0]-1,stocksreturns_short.shape[1])) for i in range(0,numstocks): diff_data_combined[:,i]=diff(stocksreturns_short[:,i+1]) [u_original,d_original,v_original]=np.linalg.svd(diff_data_combined,full_matrices=False) d_diag_original=diag(d_original) eigvals_combined = d_original*d_original totalsumeigvals=sum(eigvals_combined) for percent in eigvalspercent:

sumeigvals=0 for i in range(0,200): sumeigvals=sumeigvals+eigvals_combined[i] if sumeigvals>=(percent*totalsumeigvals):

neigvals=i+1 break u=copy.copy(u_original) d_diag=copy.copy(d_diag_original) v=copy.copy(v_original) nvals=shape(diff_data_combined)[1] u[:][neigvals:nvals]=0 d_diag[neigvals:nvals][neigvals:nvals]=0 v[neigvals:nvals][:]=0 dproduct=np.dot(u,np.dot(d_diag,v))

13

Page 14: Comparing R and Python for PCA PyData Boston 2013files.meetup.com/1676436/PyData-2013-PCA.pdf · 2014-03-15 · Dataframes is an essential part of languages supporting data analysis

© 2013 IBM Corporation

Code in R for PCA on combined data data.combined<-diff_stockyr svd.combined<-svd(data.combined) #find SVD eigvalues.combined<-svd.combined$d * svd.combined$d totalsum<-sum(eigvalues.combined) proportionrange<-c(0.5,0.75,0.90,1) for(proportion in proportionrange){ sum<-0 neigvalues<-0 for(i in 1:numstocks) { sum<-sum+eigvalues.combined[i] neigvalues<-neigvalues+1 if((sum/totalsum)>=proportion) { cat(sprintf("Number of eigenvalues for combined data for proportion %f = %d\n",proportion,neigvalues)) break; }} nvals<-dim(data.combined)[2] u<-svd.combined$u d<-diag(svd.combined$d) v<-svd.combined$v #Copy SVD matrices u[,(neigvalues):nvals]<-0 d[(neigvalues):nvals,(neigvalues):nvals]<-0 v[,(neigvalues):nvals]<-0 stock.data<-u %*% d %*% t(v) #Do a matrix multiply to get data

14

Page 15: Comparing R and Python for PCA PyData Boston 2013files.meetup.com/1676436/PyData-2013-PCA.pdf · 2014-03-15 · Dataframes is an essential part of languages supporting data analysis

© 2013 IBM Corporation

PCA on combined data results

•  138 stocks out of 200 account for 90% of the sum of all the eigenvalues •  Reconstruct data with 138 stocks has negligible error (10^-5)

Page 16: Comparing R and Python for PCA PyData Boston 2013files.meetup.com/1676436/PyData-2013-PCA.pdf · 2014-03-15 · Dataframes is an essential part of languages supporting data analysis

© 2013 IBM Corporation

Yearly PCA

§  Instead of doing a PCA on combined data from 1988-2012, how about yearly PCA ? – PCA on yearly data

§  Separate the combined dataframe into yearly dataframes (1 for each year).

§  Number of observations vary for each year.

§  Calculate number of eigenvalues accounting for 50%, 75% ,90% ,100% of sum of all eigenvalues (same operation as PCA on combined data)

§  Do a reconstruction for each proportion/each year. (step operation as PCA on combined data)

– 25 separate PCA’s – 100 reconstructions in total.

Dates MMM ABT …

01-01-1988 109.62 NA …

01-02-1988 107.12 NA NA

01-03-1988 … NA …

Dataframe 1..

Dates MMM ABT …

01-01-2012 109.62 NA …

01-02-2012 107.12 NA NA

01-03-2012 … NA …

…..Dataframe 25

Page 17: Comparing R and Python for PCA PyData Boston 2013files.meetup.com/1676436/PyData-2013-PCA.pdf · 2014-03-15 · Dataframes is an essential part of languages supporting data analysis

© 2013 IBM Corporation

Code in Python for extracting yearly dataframes

Python code colnames=stocksreturns_short.columns x=stocksreturns_short['"Dates"'].str.contains(str(years)) stocksreturns_yr=stocksreturns_short.ix[x] shape0,shape1=np.shape(stocksreturns_yr) diff_data_yr=np.zeros((shape0-1,shape1)) for i in range(0,numstocks): data_yr[:,i]=(stocksreturns_yr[:][colnames[i+1]]) diff_data_yr[:,i]=np.diff(data_yr[:,i])

17

R code yrpattern<-paste(years,"-*",sep="") yrindices<-agrep(yrpattern,stocksreturns_short$Dates,fixed=TRUE) val_stockyr<-data.frame(stocksreturns_short$Dates[yrindices[1]:tail(yrindices,n=1)]) colnames(val_stockyr)[1]<-"Dates" for(i in 2:(numstocks+1)) { val_stockyr[,i]<-stocksreturns_short[,i][yrindices[1]:tail(yrindices,n=1)] colnames(val_stockyr)[i]<-colnames(stocksreturns_short)[i] } diff_stockyr<-data.frame(matrix(NA,nrow=(dim(log_stockyr)[1]-1),ncol=numstocks)) for(j in 1:200) diff_stockyr[,j]<-diff(log_stockyr[,j])

Page 18: Comparing R and Python for PCA PyData Boston 2013files.meetup.com/1676436/PyData-2013-PCA.pdf · 2014-03-15 · Dataframes is an essential part of languages supporting data analysis

© 2013 IBM Corporation

Analysis of yearly PCA data

§  Number of eigenvalues with 50% of total sum drops to 1 in 2008

§  Stock movement is highly correlated due to macro-economic trends.

Page 19: Comparing R and Python for PCA PyData Boston 2013files.meetup.com/1676436/PyData-2013-PCA.pdf · 2014-03-15 · Dataframes is an essential part of languages supporting data analysis

© 2013 IBM Corporation

PCA on yearly data

§  Being a C/C++/Fortran HPC programmer, I use for loops in R/Python

§  Not efficient for R (for loop is an object; assignment in R is a copy operation)

§  Python’s assignment is done with references so it works better with for loops, and lesser overhead of functions.

§  Development Environment: –  ipython-2.7 with pandas and numpy installed through ports package – Rstudio 0.97 with R binary downloaded for Mac – No attempt to optimize the build for either R and Python.

§  Total code for R takes above 20 seconds versus about 11.9 seconds for Python on my Macbook Pro.

§  Timings may change with less reliance on for loops in the code.

Page 20: Comparing R and Python for PCA PyData Boston 2013files.meetup.com/1676436/PyData-2013-PCA.pdf · 2014-03-15 · Dataframes is an essential part of languages supporting data analysis

© 2013 IBM Corporation

Parallelizing yearly PCA

§  Can we use parallelism in R and Python productively ?

§  Both R and Python provide several ways for parallelization – Multiple cores – Distributed parallelism using MPI or sockets

§  Use coarse-grained parallelism to speed up our computations. – Look into how both packages allow use of multicores on modern day processors

§  Very easy to apply coarse-grained parallelism to yearly PCA – Divide years amongst threads/processes.

§  For R use doMC/foreach package that works on the multiple cores.

§  Python threads does not work well due to global interpreter lock (GIL). – Use iPython ipcluster parallelization framework.

§  Further evaluation using MPI on distributed clusters needed.

Page 21: Comparing R and Python for PCA PyData Boston 2013files.meetup.com/1676436/PyData-2013-PCA.pdf · 2014-03-15 · Dataframes is an essential part of languages supporting data analysis

© 2013 IBM Corporation

Parallelizing yearly PCA in R

§  foreach depends on a backend for execution § We register DoMC (multiple cores) as backend for the yearly PCA in this case § MPI can also be used as a backend for distributed clusters. §  Snow package another option (higher level for distributed clusters). § Not just limited to for loops:

§  Use mclapply for multi-core lapply etc.

#Sequential code

for(years in 1988:2012) { for(proportion in c(0.5,0.75,0.90,1){ Code to extract yearly data, do PCA and then reconstruct

}}

#Parallel code

registerDoMC(4) #Register multicore as backend with 4 cores

foreach(years in 1988:2012) %dopar%{ for(proportion in c(0.5,0.75,0.90,1){ Code to extract yearly data, do PCA and then reconstruct

§  }}

Page 22: Comparing R and Python for PCA PyData Boston 2013files.meetup.com/1676436/PyData-2013-PCA.pdf · 2014-03-15 · Dataframes is an essential part of languages supporting data analysis

© 2013 IBM Corporation

Timing Results

Intel Core i7 Macbook Pro (4 cores, 8 hyper-threading threads)

Threads

Page 23: Comparing R and Python for PCA PyData Boston 2013files.meetup.com/1676436/PyData-2013-PCA.pdf · 2014-03-15 · Dataframes is an essential part of languages supporting data analysis

© 2013 IBM Corporation

Parallelizing yearly PCA in Python

§  Using iPython’s Direct Interface – Start backend of iPython – % ipcluster-2.7 start –n 4 (4 is the number of processes)

§  Rewrite pcaData function so that it can be used with the map API of Python – pcaData(stocksreturns_short,year) – pcaData extracts data for year from stocksreturns_short, performs a SVD and then

reconstructions with eigenvalues percentages as before. §  Processes (unlike threads in R) makes us reimport all the modules inside the function.

– Higher memory footprint – More heavyweight compared to threads.

§  Create a list for each process’s function arguments. §  Parallelize across years as in R

– Each process computes a subset of SVD’s and reuses a single SVD for 4 reconstructions.

§  #code for map_async x=[] for i in range(0,25): x.append(stocksreturns_short) starttime=datetime.now() map_sync(pcaData,x,range(1988,2013)) print(datetime.now()-starttime)

Page 24: Comparing R and Python for PCA PyData Boston 2013files.meetup.com/1676436/PyData-2013-PCA.pdf · 2014-03-15 · Dataframes is an essential part of languages supporting data analysis

© 2013 IBM Corporation

Timing results in Python

§  Starting ipcluster=8 leads to processes hanging. §  ipython with multiple processes led to some memory issues. §  Scalability of Python shows similar trend as R

Threads

Page 25: Comparing R and Python for PCA PyData Boston 2013files.meetup.com/1676436/PyData-2013-PCA.pdf · 2014-03-15 · Dataframes is an essential part of languages supporting data analysis

© 2013 IBM Corporation

Summary

§  Both R and Python offer good choices for PCA

§  R has many packages for tasks such as downloading financial data, PCA etc. – Python has a good support as well.

§  R offers a cohesive framework –  Installing packages is pain-free – Parallelization in R is very simple.

§  R seems to be slower as assignment operator requires copy operations which is a lot of overhead (and my use of for loops).

§  Python is more forgiving of usage of for loop, and seems to require lesser statements to do the same work.

– Pandas/Numpy adds dataframe capabilities to Python’s native string handling capabilities to provide a strong platform for data analysis.

Page 26: Comparing R and Python for PCA PyData Boston 2013files.meetup.com/1676436/PyData-2013-PCA.pdf · 2014-03-15 · Dataframes is an essential part of languages supporting data analysis

© 2013 IBM Corporation

Future Work

§  Profiling of code at statement level etc.

§  How does R/Python work for memory-bound/compute-bound problems ?

§  Work with Distributed matrices (disnumpy for Python,r-pbd for R)

§  Use MPI as backend for parallelization on a cluster

§  Make interpreted code faster for both R/Python through compilers(cmpfun for R, Cython for Python)