Top Banner
HBASE CONTINUES dwivedishashwat@gmail. com
29
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Hbase

HBASE CONTI

NUES

[email protected]

Page 2: Hbase

QUERIES

Flow of Hbase read/write

Page 3: Hbase
Page 4: Hbase
Page 5: Hbase

QUERY

About Meter Logs: It best suited for persistent and day by day increasing data, as you data grows you can keep on adding more nodes, and you have lot of facility, process this data such as pig, hive and map-reduce.

You can run map-reduce and output the datasets which can be indexed and used for faster search of that huge data.

Page 6: Hbase

QUERYUsibility

HBase isn't suitable for every problem. First, make sure you have enough data. If you have hundreds of millions

or billions of rows, then HBase is a good candidate. If you only have a few thousand/million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle.

Second, make sure you can live without all the extra features that an RDBMS provides (e.g., typed columns, secondary indexes, transactions, advanced query languages, etc.) An application built against an RDBMS cannot be "ported" to HBase by simply changing a JDBC driver, for example. Consider moving from an RDBMS to HBase as a complete redesign as opposed to a port.

Third, make sure you have enough hardware. Even HDFS doesn't do well with anything less than 5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode.

HBase can run quite well stand-alone on a laptop - but this should be considered a development configuration only.

Page 7: Hbase

QUERY

Difference Between NoSQL DB and RDBMS

NoSQL is a kind of database that doesn't have a fixed schema like a traditional RDBMS does. With the NoSQL databases the schema is defined by the developer at run time. They don't write normal SQL statements against the database, but instead use an API to get the data that they need. The NoSQL databases can usually scale across different physical servers easily without needing to know which server the data you are looking for is on.

Why we should think of using it Durability Scalability on fly Distributed data Persistence etc.

Page 8: Hbase

QUERY

Row Oriented and Column oriented DBS

A column-oriented DBMS is a database management system (DBMS) that stores data tables as sections of columns of data rather than as rows of data, like most relational DBMSs

1,Smith,Joe,40000; 2,Jones,Mary,50000; 3,Johnson,Cathy,44000;

1,2,3; Smith,Jones,Johnson; Joe,Mary,Cathy; 40000,50000,44000;

Page 9: Hbase

A column-oriented database is different from traditional row-oriented databases because of how they store data. By storing a whole column together instead of a row, you can minimize disk access when selecting a few columns from a row containing many columns. In row-oriented databases there's no difference if you select just one or all fields from a row.ID Nam

e Age

1 Bhavin 29

2 Roger 30

This would be persisted in a conventional RDBMS as follows -1,Bhavin,29|2,Roger,30

In a column oriented DBMS this would be persisted as -1,2|Bhavin,Roger|29,30

Page 10: Hbase

MORE DEPTH OF HBASE CONCEPTS

Page 11: Hbase

MODES OF HBASE OPERATION

Stand Alone :In standalone mode, there is no distributed file system and no Java services/daemons are started. All mappers and reducers run inside a single Java VM.

This mode is best suited for testing purpose, and experimentations with HBase.

Page 12: Hbase

MODES OF HBASE OPERATION…

Pseudo-Distributed ModePseudo-distributed mode, Hbase processing is distributed over all of the cores/processors on a single machine. Hbase writes all files to the Hadoop Distributed FileSystem (HDFS), and all services and daemons communicate over local TCP sockets for inter-process communication

A pseudo-distributed mode is simply a distributed mode run on a single host. Use this configuration testing and prototyping on HBase. Do not use this configuration for production nor for evaluating HBase performance

Page 13: Hbase

MODES OF HBASE OPERATION…

Distributed:In distributed mode the daemons are spread across all nodes in the cluster

Distributed modes require an instance of the Hadoop Distributed File System (HDFS)

Page 14: Hbase

BASIC PREREQUISITES

JavaSSHDNSNTPulimit and nprocHadoop for Distributed mode

Page 15: Hbase

BASIC PREREQUISITES IN DETAIL

Just like Hadoop, HBase requires at least java 6 from Oracle

ssh must be installed and sshd must be running to use Hadoop's scripts to manage remote Hadoop and HBase daemons. You must be able to ssh to all nodes, including your local node, using passwordless login

HBase uses the local hostname to self-report its IP address. Both forward and reverse DNS resolving must work

Page 16: Hbase

The clocks on cluster members should be in basic alignments. Some skew is tolerable but wild skew could generate odd behaviors. Run NTP on your cluster, or an equivalent.

It uses a lot of files all at the same time. The default ulimit -n -- i.e. user file limit -- of 1024 on most *nix systems is insufficient Any significant amount of loading will lead you to “java.io.IOException...(Too many open files)”

Hadoop for Distributed file system and mapreduce processing.

BASIC PREREQUISITES IN DETAIL

Page 17: Hbase

HBASE CONFIGURATION FILES

hbase-site.xmlhbase-default.xmlhbase-env.shlog4j.propertiesregionservers

Page 18: Hbase

HBASE-DEFAULT.XML

Not all configuration options make it out to hbase-default.xml. Configuration that it is thought rare anyone would change can exist only in code; the only way to turn up such configurations is via a reading of the source code itself.

Page 19: Hbase

hbase.rootdir The directory shared by region servers and into which HBase

persists. Default: file:///tmp/hbase-${user.name}/hbase hdfs://namenode:9000/hbase For Distributed

hbase.master.port The port the HBase Master should bind to. Default is 60000

hbase.cluster.distributed The mode the cluster will be in. Possible values are false for

standalone mode and true for distributed mode

hbase.tmp.dir Temporary directory on the local filesystem. Default: /tmp/hbase-${user.name}

hbase.regionserver.port The port the HBase RegionServer binds to. Default: 60020

There are lot many parameters which need to be change for more customized and optimized Hbase cluster.

Page 20: Hbase

HBASE-SITE.XML

Just as in Hadoop where you add site-specific HDFS configuration to the hdfs-site.xml file, for HBase, site specific customizations go into the file conf/hbase-site.xml. For the list of configurable properties

Page 21: Hbase

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property>

<name>hbase.zookeeper.quorum</name>

<value>node1,node2,node3</value>

<description>The directory shared by RegionServers.

</description>

</property>

<property>

<name>hbase.zookeeper.property.dataDir</name>

<value>/export/zookeeper</value>

<description>Property from ZooKeeper's config zoo.cfg.

The directory where the snapshot is stored.

</description>

Page 22: Hbase

</property>

<property>

<name>hbase.rootdir</name>

<:value>hdfs//node0:8020/hbase</value>

<description>The directory shared by RegionServers.

</description>

</property>

<property>

<name>hbase.cluster.distributed</name>

<value>true</value>

<description>The mode the cluster will be in. Possible values are

false: standalone and pseudo-distributed setups with managed Zookeeper

true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh)

</description>

</property>

</configuration>

Page 23: Hbase

HBASE-ENV.SH

Set HBase environment variables in this file. Examples include options to pass the JVM on start of an HBase daemon such as heap size and garbarge collector configs. You can also set configurations for HBase configuration, log directories, niceness, ssh options, where to locate process pid files, etc. Open the file at conf/hbase-env.sh and peruse its content. Each option is fairly well documented. Add your own environment variables here if you want them read by HBase daemons on startup.

Page 24: Hbase

export JAVA_HOME=/usr/lib//jvm/java-6-sun/

export HBASE_CLASSPATH=

export HBASE_HEAPSIZE=1000

Page 25: Hbase

LOG4J.PROPERTIES

Edit this file to change rate at which HBase files are rolled and to change the level at which HBase logs messages.

Changes here will require a cluster restart for HBase

Page 26: Hbase

REGIONSERVERS

In this file you list the nodes that will run RegionServers.

Eg :regionservernode1 regionservernode2 regionservernode3

Page 27: Hbase

CONFIGURATIONS

Required Configurations Java SSH DNS NTP ulimit and nproc Hadoop for Distributed mode

Recommended Configurations zookeeper.session.timeout

The default timeout is three minutes (specified in milliseconds). This means that if a server crashes, it will be three minutes before the Master notices the crash and starts recovery

Number of ZooKeeper Instances Compression Bigger Regions Balancer

The balancer is a periodic operation which is run on the master to redistribute regions on the cluster. It is configured via hbase.balancer.period and defaults to 300000

Still more are there, Just read more to have more optimized cluster.

Page 28: Hbase

GETTING STARTED WITH HBASE

Start HBase./bin/start-hbase.sh starting Master, logging to logs/hbase-user-master-example.org.out

Connect to your running HBase via the shell

./bin/hbase shell HBase Shellhbase(main):001:0>

And on this shell you can type shell command which hbase provides to perform various operations.

Page 29: Hbase

I am here :

[email protected] Twitter : shashwat_2010Facebook : [email protected] Skype: shriparv

Search Read Research and Share to have more better understandability.

Right ?

http://db.csail.mit.edu/projects/cstore/abadi-sigmod08.pdf

https://cs.uwaterloo.ca/~aelhelw/papers/dolap11.pdf

https://ccp.cloudera.com/download/attachments/14549380/CDH2_Installation_Guide.pdf?version=1&modificationDate=1348871340000

http://hbase.apache.org/

https://ccp.cloudera.com/display/CDHDOC/HBase+Installation

Last and the best one

www.google.com :)

NEED MORE CLARIFICATION ON QUERIES??