Hack your kernel: April 2014

Hbase Architecture:

------------------------------------------------------------------------------------------------------------------

Available as

1. Stand Alone mode

2. Psuedo Distributed mode

3. Fully Distributed mode

2.2.2. Distributed

Distributed mode can be subdivided into distributed but all daemons run on a single node -- a.k.a pseudo-distributed-- and fully-distributed where the daemons are spread across all nodes in the cluster

Distributed modes require an instance of the Hadoop Distributed File System (HDFS).

2.2.2.1. Pseudo-distributed

A pseudo-distributed mode is simply a distributed mode run on a single host.

First, setup your HDFS in pseudo-distributed mode.

Next, configure HBase. Below is an example conf/hbase-site.xml. This is the file into which you add local customizations

Note that the hbase.rootdir property points to the local HDFS instance.

Note

Let HBase create the hbase.rootdir directory. If you don't, you'll get warning saying HBase needs a migration run because the directory is missing files expected by HBase (it'll create them if you let it).

Pseudo-distributed Configuration File

Below is a sample pseudo-distributed file for the node h-24-30.example.com. hbase-site.xml

...

<name>hbase.rootdir</name>

<value>hdfs://h-24-30.sfo.stumble.net:8020/hbase</value>

</property>

<name>hbase.cluster.distributed</name>

</property>

<name>hbase.zookeeper.quorum</name>

<value>h-24-30.sfo.stumble.net</value>

</property>

...

</configuration>

. Pseudo-distributed Extras

Startup

To start up the initial HBase cluster...

% bin/start-hbase.sh

To start up an extra backup master(s) on the same server run...

% bin/local-master-backup.sh start 1

... the '1' means use ports 60001 & 60011, and this backup master's logfile will be at logs/hbase-${USER}-1-master-${HOSTNAME}.log.

To startup multiple backup masters run...

% bin/local-master-backup.sh start 2 3

You can start up to 9 backup masters (10 total).

To start up more regionservers...

% bin/local-regionservers.sh start 1

where '1' means use ports 60201 & 60301 and its logfile will be at logs/hbase-${USER}-1-regionserver-${HOSTNAME}.log.

To add 4 more regionservers in addition to the one you just started by running...

% bin/local-regionservers.sh start 2 3 4 5

This supports up to 99 extra regionservers (100 total).

2.2.2.1.2.2. Stop

Assuming you want to stop master backup # 1, run...

% cat /tmp/hbase-${USER}-1-master.pid |xargs kill -9

Note that bin/local-master-backup.sh stop 1 will try to stop the cluster along with the master.

To stop an individual regionserver, run...

% bin/local-regionservers.sh stop 1

2.2.2.2. Fully-distributed

For running a fully-distributed operation on more than one host, make the following configurations. In hbase-site.xml, add the property hbase.cluster.distributed and set it to true and point the HBase hbase.rootdir at the appropriate HDFS NameNode and location in HDFS where you would like HBase to write data. For example, if you namenode were running at namenode.example.org on port 8020 and you wanted to home your HBase in HDFS at /hbase, make the following configuration.

...

<name>hbase.rootdir</name>

<value>hdfs://namenode.example.org:8020/hbase</value>

<description>The directory shared by RegionServers.

</description>

</property>

<name>hbase.cluster.distributed</name>

<description>The mode the cluster will be in. Possible values are

false: standalone and pseudo-distributed setups with managed Zookeeper

true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh)

</description>

</property>

...

</configuration>

2.2.2.2.1. regionservers

In addition, a fully-distributed mode requires that you modify conf/regionservers. The Section 2.4.1.2, “regionservers” file lists all hosts that you would have runningHRegionServers, one host per line (This file in HBase is like the Hadoop slaves file). All servers listed in this file will be started and stopped when HBase cluster start or stop is run.

----------------------------------------------------------------------------------------------------------------

has own webui

--------------------------------------------------------------------------------------------------------------------

Distributed Column Oriented Database - ontop of hdfs

ROW Column

1. OLTP 1. OLAP (has a single column with all values)

2. Single row insert 2. Aggregation over

3.Small number of rows and columns 3. High Compression

----------------------------------------------------------------------------------------------------------------

HBase RDBMS

1. Schema-less 1. Schema

2. for Wide Tables 2. Thin tables

3. Denormalized 3. Normalised

-----------------------------------------------------------------------------------------------------------

HBase HDFS

1. Low latency access 1. High Latency Access

2. Random access 2. No concept of Random access

(e.g. Facebook )

--------------------------------------------------------------------------------------------------------------

HBase Architecture

1. Master Server - assign region, handling load balancing, Finding out where the data is, sharding

a) when tables get bigger it is splitted at the middle and distributed across the region server uniformally

b) when slow , just add more region server

zookeeper - this is the guy to whom master talks with

2. Region Server

Table

region

store

memstore(file gets stored first and then flushed to hfile)

hfile(store file)

Column Family for grouping similar data

Pseudo mode:

zookeper is for quorum management.

Download Apache Hbase

1. From Apache website

2. unpack the tar

3. move to /usr/local/hbase (with app permission to the appropriate user)

export $HBASE_HOME

export $PATH

Configuration:

vim /usr/local/hbase/conf/hbase-env.sh

edit JAVA_HOME

vim /usr/local/hbase/conf/hbase-site.xml

hbase.rootdir=hdfs://hw1:10001/hbase - points hbase to hdfs and the directory is the shared directory used by all region server

#hbase.zookeeper.quorum=zoo1,zoo2 - used to point to zookeeper node

#hbase.cluster.distributed=false - for pseudo and stand alone mode

Region Server

vim vim /usr/local/hbase/conf/regionservers

localhost for pseudo and ip for distributed mode

command is start-hbase.sh

webui : hw1:60010

firing region servers - command is

local-regionservers.sh start 1 2 3

Creating TABLE

hbase shell

create 'htest' ,'cf'

put 'htest','r1','cf:c1','v1'

put 'htest','r1','cf:c2','v2'

put 'htest','r1','cf:c3','v3'

scan 'htest'

cells are versioned in hbase table

get 'htest','r2'

put 'htest','r2','cf:c2','v2updated'

get 'htest','r2'

delete 'htest','r3','cf:c3'

scan 'htest'

disable 'htest'

drop 'htest'

HBASE DATA-ACCESS

1. JAVA

2. HBASE SHELL

both 1&2 uses ClientAPI

3. REST - for text

4. AVRO - for Binary

5. THRIFT - for both Text and Binary

6. HIVE

7. PIG

8. MapReduce

4,5 are interactive clients

6,7,8 for Batch Processing Clients

Loading HBASE

1. IMPORT TSV - for tab seperated values

2. Complete Bulk Load

3. MapReduce

4. Pig and Hive

Configuring Fully Distributed HBASE

1. vim /usr/local/hbase/conf/hbase-site.xml

Properties:

hbase.rootdir

same for region server

hbase.cluster.distributed - true

hbase.zookeeper.quorum - HNHBMaster

hbase.zookeeper.property.clientport -2181

hbase.zookeeper.property.datadir - path to the directory where zookeeper stores its data

these configurations are to be scp to all region server too

start-hbase.sh

fires all services

Hack your kernel

Sunday, April 20, 2014

About Me