Sunday, April 20, 2014

Hbase Architecture:
------------------------------------------------------------------------------------------------------------------
Available as
1. Stand Alone mode
2. Psuedo Distributed mode
3. Fully Distributed mode

2.2.2. Distributed
Distributed mode can be subdivided into distributed but all daemons run on a single node -- a.k.a pseudo-distributed-- and fully-distributed where the daemons are spread across all nodes in the cluster 
Distributed modes require an instance of the Hadoop Distributed File System (HDFS). 

2.2.2.1. Pseudo-distributed
A pseudo-distributed mode is simply a distributed mode run on a single host.
First, setup your HDFS in pseudo-distributed mode.
Next, configure HBase. Below is an example conf/hbase-site.xml. This is the file into which you add local customizations
Note that the hbase.rootdir property points to the local HDFS instance.

Note
Let HBase create the hbase.rootdir directory. If you don't, you'll get warning saying HBase needs a migration run because the directory is missing files expected by HBase (it'll create them if you let it).
 Pseudo-distributed Configuration File
Below is a sample pseudo-distributed file for the node h-24-30.example.com. hbase-site.xml
<configuration>
  ...
  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://h-24-30.sfo.stumble.net:8020/hbase</value>
  </property>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
  </property>
  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>h-24-30.sfo.stumble.net</value>
  </property>
  ...
</configuration>
. Pseudo-distributed Extras
 Startup
To start up the initial HBase cluster...
% bin/start-hbase.sh

To start up an extra backup master(s) on the same server run...
% bin/local-master-backup.sh start 1
... the '1' means use ports 60001 & 60011, and this backup master's logfile will be at logs/hbase-${USER}-1-master-${HOSTNAME}.log.
To startup multiple backup masters run...
% bin/local-master-backup.sh start 2 3
You can start up to 9 backup masters (10 total).
To start up more regionservers...
% bin/local-regionservers.sh start 1
where '1' means use ports 60201 & 60301 and its logfile will be at logs/hbase-${USER}-1-regionserver-${HOSTNAME}.log.
To add 4 more regionservers in addition to the one you just started by running...
% bin/local-regionservers.sh start 2 3 4 5
This supports up to 99 extra regionservers (100 total).
2.2.2.1.2.2. Stop
Assuming you want to stop master backup # 1, run...
% cat /tmp/hbase-${USER}-1-master.pid |xargs kill -9
Note that bin/local-master-backup.sh stop 1 will try to stop the cluster along with the master.
To stop an individual regionserver, run...
% bin/local-regionservers.sh stop 1
                

2.2.2.2. Fully-distributed
For running a fully-distributed operation on more than one host, make the following configurations. In hbase-site.xml, add the property hbase.cluster.distributed and set it to true and point the HBase hbase.rootdir at the appropriate HDFS NameNode and location in HDFS where you would like HBase to write data. For example, if you namenode were running at namenode.example.org on port 8020 and you wanted to home your HBase in HDFS at /hbase, make the following configuration.
<configuration>
  ...
  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://namenode.example.org:8020/hbase</value>
    <description>The directory shared by RegionServers.
    </description>
  </property>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
    <description>The mode the cluster will be in. Possible values are
      false: standalone and pseudo-distributed setups with managed Zookeeper
      true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh)
    </description>
  </property>
  ...
</configuration>
2.2.2.2.1. regionservers
In addition, a fully-distributed mode requires that you modify conf/regionservers. The Section 2.4.1.2, “regionservers file lists all hosts that you would have runningHRegionServers, one host per line (This file in HBase is like the Hadoop slaves file). All servers listed in this file will be started and stopped when HBase cluster start or stop is run.

----------------------------------------------------------------------------------------------------------------
has own webui
--------------------------------------------------------------------------------------------------------------------
Distributed Column Oriented Database - ontop of hdfs
ROW                                                                                                Column
1. OLTP                                                                                 1. OLAP (has a single column with all values)
2. Single row insert                                                         2. Aggregation over
3.Small number of rows and columns                    3. High Compression
----------------------------------------------------------------------------------------------------------------
HBase                                                                                                             RDBMS
1. Schema-less                                                                  1. Schema
2. for Wide Tables                                                           2. Thin tables
3. Denormalized                                                              3. Normalised
-----------------------------------------------------------------------------------------------------------
HBase                                                                                                   HDFS
1. Low latency access                                                     1. High Latency Access
2. Random access                                                            2. No concept of Random access
(e.g. Facebook )

--------------------------------------------------------------------------------------------------------------
                                                                HBase Architecture
1. Master Server - assign region, handling load balancing, Finding out where the data is, sharding
a) when tables get bigger it is splitted at the middle and distributed across the region server uniformally
b) when slow , just add more region server

zookeeper - this is the guy to whom master talks with

2. Region Server 
Table
region
store
memstore(file gets stored first and then flushed to hfile)
hfile(store file)

Column Family for grouping similar data
Pseudo mode:
zookeper is for quorum management.

Download Apache Hbase
1. From Apache website
2. unpack the tar
3. move to /usr/local/hbase (with app permission to the appropriate user)
export $HBASE_HOME
export $PATH
Configuration:
vim /usr/local/hbase/conf/hbase-env.sh
edit JAVA_HOME
vim /usr/local/hbase/conf/hbase-site.xml

hbase.rootdir=hdfs://hw1:10001/hbase - points hbase to hdfs and the directory is the shared directory used by all region server

#hbase.zookeeper.quorum=zoo1,zoo2 - used to point to zookeeper node
#hbase.cluster.distributed=false - for pseudo and stand alone mode

Region Server
vim vim /usr/local/hbase/conf/regionservers
localhost for pseudo and ip for distributed mode
command is start-hbase.sh
webui : hw1:60010
firing region servers - command is
local-regionservers.sh start 1 2 3


Creating TABLE
hbase shell
create 'htest' ,'cf'
put 'htest','r1','cf:c1','v1'
put 'htest','r1','cf:c2','v2'
put 'htest','r1','cf:c3','v3'
scan 'htest'

cells are versioned in hbase table
get 'htest','r2'
put 'htest','r2','cf:c2','v2updated'
get 'htest','r2'
delete 'htest','r3','cf:c3'
scan 'htest'
disable 'htest'
drop 'htest'

HBASE DATA-ACCESS
1. JAVA
2. HBASE SHELL
both 1&2 uses ClientAPI
3. REST - for text
4. AVRO - for Binary
5. THRIFT - for both Text and Binary
6. HIVE
7. PIG
8. MapReduce

4,5 are interactive clients
6,7,8 for Batch Processing Clients
Loading HBASE
1. IMPORT TSV - for tab seperated values
2. Complete Bulk Load
3. MapReduce
4. Pig and Hive

Configuring Fully Distributed HBASE
1. vim /usr/local/hbase/conf/hbase-site.xml
Properties:
hbase.rootdir
same for region server
hbase.cluster.distributed - true
hbase.zookeeper.quorum - HNHBMaster
hbase.zookeeper.property.clientport -2181
hbase.zookeeper.property.datadir - path to the directory where zookeeper stores its data
these configurations are to be scp to all region server too
start-hbase.sh

fires all services