RHive tutorial - basic functions This tutorial explains how to load RHive library and use basic Functions for RHive. Loading RHive Load RHive with the method used when using any R package. Load RHive like below: library(RHive) But before loading RHive, you must not forget to configure HADOOP_HOME and HIVE_HOME environment And if they are not set then you can temporarily set them before loading the library, like as follows. HADOOP_HOME is the home directory where Hadoop is installed and HIVE_HOME is the home directory where Hive is installed. Consult RHive tutorial - RHive installation and setting for details on environment variables. Sys.setenv(HIVE_HOME="/service/hive0.7.1") Sys.setenv(HADOOP_HOME="/service/hadoop0.20.203.0") library(RHive) rhive.init rhive.init is a procedure that internally initializes and if, before loading RHive, environment variables were calibrated accurately then they will automatically run. But if these environment variable were not configured while RHive was loaded via library(RHIve) then the following error message will result. rhive.connect() Error in .jcall("java/lang/Class", "Ljava/lang/Class;", "forName", cl, : No running JVM detected. Maybe .jinit() would help. Error in .jfindClass(as.character(class)) : No running JVM detected. Maybe .jinit() would help.
One can learn how to use basic functions in RHive as reading this document. This document was updated at 5th March 2012.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
RHive tutorial - basic functions This tutorial explains how to load RHive library and use basic Functions for RHive.
Loading RHive Load RHive with the method used when using any R package. Load RHive like below:
library(RHive)
But before loading RHive, you must not forget to configure HADOOP_HOME and HIVE_HOME environment And if they are not set then you can temporarily set them before loading the library, like as follows. HADOOP_HOME is the home directory where Hadoop is installed and HIVE_HOME is the home directory where Hive is installed. Consult RHive tutorial - RHive installation and setting for details on environment variables.
rhive.init rhive.init is a procedure that internally initializes and if, before loading RHive, environment variables were calibrated accurately then they will automatically run. But if these environment variable were not configured while RHive was loaded via library(RHIve) then the following error message will result.
rhive.connect()
Error in .jcall("java/lang/Class", "Ljava/lang/Class;", "forName", cl, :
No running JVM detected. Maybe .jinit() would help.
Error in .jfindClass(as.character(class)) :
No running JVM detected. Maybe .jinit() would help.
For this case then designate HADOOP_HOME and HADOOP_HOME as shown below or exit R then configure environment variables and restart R.
rhive.connect All Functions of RHive will only work after having connected to Hive server. If before using other Functions of RHive, you have not established a connection by using the rhive.connect Function, All RHive Functions will malfunction and produce the following errors when running.
Error in .jcast(hiveclient[[1]], new.class = "org/apache/hadoop/hive/service/HiveClient", :
cannot cast anything but Java objects
Establishing a connection with Hive server to use RHive is simple with the following:
rhive.connect()
The example above can additionally assign a few more things.
rhiveConnection <-‐ rhive.connect("10.1.1.1")
In the case the user’s Hive server is installed to a server other than the one with RHive installed, and has to remotely connect, a connection can be made by handing arguments over to the rhive.connect Function.
Then if you have multiple Hadoop and Hive clusters, then after making the right configurations to have RHive activated, and you want to switch between the Hives then just like using DB client such as MySQL, you should make connections and hand it over to the Functions via arguments to explicitly select connection.
rhive.query If the user has experience in using Hive, then he/she probably knows that Hive supports SQL syntax to handle the data for Map/Reduce and HDFS. rhive.query gives SQL to Hive and receives results from Hive. Users who know SQL syntax will find this a frequently encountered example.
rhive.query("SELECT * FROM usarrests")
If you run the example above then you will see the contents of a table named ‘usarrests’ printed on the screen. Or, on top of printing the returned result on the screen, you can also assign to a data.frame object those results.
resultDF <-‐ rhive.query("SELECT * FROM usarrests")
A thing to beware of is if the data returned from rhive.query is bigger than the RHive server’s memory or laptop’s, exhaustion of available memory will induce an error message. That is why you must not receive and put into object any data of such size. It is better to first create a temporary table and then put the results of the SQL to the temporary table. You can do it as the following.
rhive.query("
CREATE TABLE new_usarrests (
rowname string,
murder double,
assault int,
urbanpop int,
rape double
)")
rhive.query("INSERT OVERWRITE TABLE new_usarrests SELECT * FROM usarrest")
Consult a Hive document for a detailed account of how to use Hive SQL.
rhive.close If you have finished using Hive and do not wish to use RHive Functions any longer, you can use the rhive.close Function to terminate the connection.
rhive.close()
Alternatively, you can assign a specific connection to close it.
conn <-‐ rhive.connect()
rhive.close(conn)
rhive.list.tables The rhive.list.tables Function returns the results of tables in Hive.
rhive.list.tables()
tab_name
1 aids2
2 new_usarrests
3 usarrests
This is effectively identical to this:
rhive.query("SHOW TABLES")
rhive.desc.table The rhive.desc.table Function shows the description of the chosen table.
rhive.desc.table("usarrests")
col_name data_type comment
1 rowname string
2 murder double
3 assault int
4 urbanpop int
5 rape double
This is effectively identical to this:
rhive.query("DESC usarrests")
rhive.load.table The rhive.load.table Function loads Hive tables’ contents as R’s data.frame object.
df1 <-‐ rhive.load.table("usarrests")
df1
This is effectively identical to this:
df1 <-‐ rhive.query("SELECT * FROM usarrests")
df1
rhive.write.table The rhive.write.table Function is the antithesis of rhive.load.table. But it is more useful than rhive.load.table. If you wish to add data to a table located in Hive, you must first make a table. But using rhive.write.table does not require any additional work, and simply creates R’s dataframe into Hive and inserts all data.
head(UScrime)
M So Ed Po1 Po2 LF M.F Pop NW U1 U2 GDP Ineq Prob Time y
The rhive.write.table Function encounters an error and does not work if the table to be saved into Hive already exists. Hence, if attempting to save to Hive any dataframes with the same name and symbol as any table already in Hive, it is imperative that you delete them before using rhive.write.table.
if (rhive.exist.table("uscrime")) {
rhive.query("DROP TABLE uscrime")
}
rhive.write.table(UScrime)
RHive - alias functions RHive’s Functions look similar to S3 generic’s naming rules but many are actually not generic. This is for the S3 generic Functions which RHive may or may not support in the future. For users who detest confusion wrought by Functions that, despite containing “.” yet still do not count as generic, there exist some Functions with different names but serve the same roles. The following alias Functions are such as described below.
hiveConnect This is same as rhive.connect.
hiveQuery This is same as rhive.query.
hiveClose This is same as hive.close.
hiveListTables This is same as hive.list.tables.
hiveDescTable This is same as hive.desc.table.
hiveLoadTable This is same as hive.load.table.
rhive.basic.cut rhive.basic.cut converts one numerical column from a table to one factorized column. First, the range of the numerical column is divided into intervals, and the values in the numerical column are factorized according to which interval they fall. Rhive.basic.cut receives the following six arguments, tablename(a table name), col(a numerical column name), breaks, right, summary, and forcedRef. breaks are numerical cut points for the numerical column. right indicates if the ends of the intervals are open or closed. If TRUE, the intervals are closed on the right and open on the left. If not, vice versa. summary = TRUE spits out total counts of numerical values corresponding to the intervals. If FALSE, the name of a new table updated by the factorized table is returned. forcedRef = TRUE forces rhive.basic.cut to return a table name instead of a data frame for forcedRef = FALSE. The defaults of right, summary, and forcedRef are TRUE, FALSE, and TRUE respectively.
rhive.basic.cut2 rhive.basic.cut2 converts two numerical columns from a table to two factorized columns. That is, the range of each numerical column is divided into intervals, and the values in each numerical column are factorized according to which interval they fall. Rhive.basic.cut2 receives the following eight arguments, tablename(a table name), col1, col2(two column names), breaks1, breaks2, right, keepCol, and forcedRef. breaks1 and breaks2 are numerical cut points for the two numerical columns. right indicates if the ends of the intervals are open or closed. If TRUE, the intervals are closed on the right and open on the left. If not, vice versa. keepCol = TRUE makes the two numerical columns kept even after the conversion. Otherwise, the factorized columns replace the original numerical columns. forcedRef = TRUE forces rhive.basic.cut to return a table name instead of a data frame for forcedRef = FALSE. The defaults of right, summary, and forcedRef are TRUE, FALSE, and TRUE respectively.
> results = rhive.query("select * from rhive_result_1330315663")
> head(results)
rowname sepalwidth petalwidth species sepallength sepallength_cut petallength petallength_cut rep
1 1 3.5 0.2 setosa 5.1 NULL 1.4 [1.0,1.5) 1
2 2 3.0 0.2 setosa 4.9 [4.5,5.0) 1.4 [1.0,1.5) 1
3 3 3.2 0.2 setosa 4.7 [4.5,5.0) 1.3 [1.0,1.5) 1
4 4 3.1 0.2 setosa 4.6 [4.5,5.0) 1.5 [1.5,2.0) 1
5 5 3.6 0.2 setosa 5.0 NULL 1.4 [1.0,1.5) 1
rhive.basic.xtabs rhive.basic.xtabs makes a contingency table from cross-classifying factors. A formula object and a table name are used as input arguments and a contingency table with matrix format is returned based on the given formula. For instance, two column names, agegp and alcgp from a table are cross-classifying factors in this formula, "ncontrols ~ agegp + alcgp". Also, observations for each combination of the cross-classifying factors are summed up through another column name, ncontrols.
rhive.basic.t.test The rhive.basic.t.test Function runs Welch's t-test on two samples. In this case the two sample's mean difference is tested while holding the alternative hypothesis, "two sample's mean difference is not 0." Thus, two-side test is performed.
The following is an example of test the mean difference between the irises' sepal widths and petal widths. Pay attention to how the Functions that used the "sepallength" and "petallength" variables were called.
[1] "t = 13.1422338118038, df = 211.542688378717, p-‐value = 0, mean of x : 5.84333333333333, mean of y : 3.758"
$statistic
t
13.14223
$parameter
df
211.5427
$p.value
[1] 0
$estimate
$estimate[[1]]
mean of x
5.843333
$estimate[[2]]
mean of y
3.758
>
Interpreting the results gives you a p-value of 0, thus revealing a difference between the means of petal width and sepal width. The resulting statistics are converted as an R list Object, and the string made from amassed statistics is printed onto console.
Iris data is 150 observation cases provided by R. Using this data for R's t.test results in a slightly off t-statistic of 13.0984. This is due to the variance used by t.test Function to find t-statistic is sample variance, while rhive.basic.t.test Function uses population variance. Like the example scenario, in the case of little data, t-statistic deviance may exist but the larger the data gets the deviance dwindles. With rhive.basic.t.test being a Function made for massive data analysis in mind, population variance is used for speedy calculations.
rhive.block.sample The percent argument is an optional argument that sets the percentage of data to extract from the total data. It has a default value of 0.01, which means it extracts 0.01% of the total data. But this percent argument's value is not the ratio of the actually sampled data count to the total data count but more akin to the ratio of Blocks to the total Blocks. Thus, rhive.block.sample Function takes Samples by the Block.
Thus the entire data may be returned when using the rhive.block.sample Function on Hive Tables of small data size. This occurs when the data is smaller than the Block size set in Hive.
The seed variable is for specifying the Random Seed used when executing Block Sampling in Hive. Should the Random Seeds be identical, Hive's Block Sampling returns the same results. Thus in order to guarantee Random Samples for every sampling, it is best to assign a value for the seed variable in rhive.block.sample, by using the Sample Function of R.
The subset variable is an optional variable that can specify the condition for the data to be extracted from the Table targeted by Hive, when returning Sample Block. This argument uses the character type and corresponds to the 'where' clause in Hive HQL. Thus it must use syntax appropriate for HQL's where clause.
rhive.block.sample Function's return values are the character values of the name of the Hive Table that contain Sample Block results. That is, the rhive.block.sample Function uses Sample Block to automatically create a temporary Hive Table and return that Table's name. The following example involves sampling data worth 0.01% of the Hive Table called listvirtualmachines. This example used R's sample Function for the Random Seed to be used during Block Sampling of Hive.
As per this example, a Hive Table of the name "rhive_sblk_1330404552" bearing 0.01% worth of data from the Hive Table, "listvirtualmachines", has been created.
rhive.basic.scale The rhive.basic.scale function converts numerical data with 0 average and 1 deviation. Input table name for the first argument, and the output column name for the second.
In the returned list, there is added a "scaled_column name" column saved as a string. This is also approachable/editable in RHive, along with/just like other Hive tables.
rhive.basic.by The rhive.basic.by Function consists of code that runs group by for a specified/particular column. Thus the code below excecutes/applies group by for "species" column, and returns the result of applying the sum Function on
"sepallength". In the results you will find the sum of each species and sepallength.
sepallength sepalwidth petallength petalwidth species assault urbanpop rape rowname
1 4.3 3.0 1.1 0.1 setosa 102 62 16.5 14
2 4.4 2.9 1.4 0.2 setosa 149 85 16.3 9
3 4.4 3.0 1.3 0.2 setosa 149 85 16.3 39
4 4.4 3.2 1.3 0.2 setosa 149 85 16.3 43
5 4.9 3.1 1.5 0.1 setosa 159 67 29.3 10
Merge is similar with ‘join’ in SQL. Followings are same with that.
# Use join to extract and print the names of all rows not found to be common after merging. # Should row names overlap, only print out the name of the former row.
rhive.big.query('select a.sepallength,a.sepalwidth,a.petallength,a.petalwidth,a.species,b.assault,b.urbanpop,b.rape,a.rowname from iris a join usarrests b on a.sepallength = b.murder')
sepallength sepalwidth petallength petalwidth species assault urbanpop rape rowname
1 4.3 3.0 1.1 0.1 setosa 102 62 16.5 14
2 4.4 2.9 1.4 0.2 setosa 149 85 16.3 9
3 4.4 3.0 1.3 0.2 setosa 149 85 16.3 39
4 4.4 3.2 1.3 0.2 setosa 149 85 16.3 43
5 4.9 3.1 1.5 0.1 setosa 159 67 29.3 10
rhive.basic.mode rhive.basic.mode returns the mode and its frequency within a specified row of the Hive table.
rhive.basic.mode('iris', 'sepallength')
sepallength freq
1 5 10
rhive.basic.range rhive.basic.range returns the greatest and lowest values within the specified numerical row of the Hive table.