HBase tutorial for beginners

From ole-martin.net » HBase tutorial for beginners – a blog by Ole-Martin Mørk.

First of all, HBase is a column oriented database. However, you have to forget everything you have learned about tables, columns and rows in the RDBMS world. The data in an HBase instance is layed out more like a hashtable, and the data is immutable. Whenever you update the data, you are actually just creating a new version of it.

This tutorial will be very hands-on, with not too much explanation. There are a number of articles where the column oriented databases are described in details. Check out my delicious tag for some good ones, for instance jimbojw.com’s excellent introduction

I used Apple OSX 10.5.6 in this tutorial, I am not sure if this will work on windows and linux.

The goal for this tutorial is to create a model for a blog with integration from a java program.

Get started

  • Download from the latest stable release from apache. I went with the hbase-0.18.1 release.
  • Unpack it, for instance to ~/hbase
  • Edit ~/hbase/conf/hbase-env.sh and set the correct JAVA_HOME variable.
  • Start hbase by running ~/hbase/bin/start-hbase.sh

Create a table

  • Start the hbase shell by running ~/hbase/bin/hbase shell
  • Run create ‘blogposts’,’post’,’image’ in the shell

Now you have a table called blogposts, with a post, and a image family. These families are “static” like the columns in the RDBMS world.

Add some data to the table

Run the following commands in the shell:

  • put ‘blogposts’,’post1′,’post:title’,’Hello World’
  • put ‘blogposts’,’post1′,’post:author’,’The Author’
  • put ‘blogposts’,’post1′,’post:body’,’This is a blog post’
  • put ‘blogposts’,’post1′,’image:header’,’image1.jpg’
  • put ‘blogposts’,’post1′,’image:bodyimage’,’image2.jpg’

Look at the data

Run get ‘blogposts’, ‘post1’ in the shell. This should output something like this.

COLUMN CELL
image:bodyimage timestamp=1229953133260, value=image2.jpg
image:header timestamp=1229953110419, value=image1.jpg
post:author timestamp=1229953071910, value=The Author
post:body timestamp=1229953072029, value=This is a blog post
post:title timestamp=1229953071791, value=Hello World

Summary part1

So, what have we accomplished so far? We have created a table and added one ‘record’ to it. This record consists of the blogpost itself, and the images attached to it. So, how do we retrieve those data from a java application?

Integrate with HBase from Java

In order to integrate with HBase you will need the following jar files in your classpath:

  • commons-logging-1.0.4.jar
  • hadoop-0.18.1-core.jar
  • hbase-0.18.1.jar
  • log4j-1.2.13.jar

All these are found within ~/hbase/lib and ~/hbase

Ok. Here’s the java code:

01 import org.apache.hadoop.hbase.client.HTable;
02 import org.apache.hadoop.hbase.HBaseConfiguration;
03 import org.apache.hadoop.hbase.io.RowResult;
04
05 import java.util.HashMap;
06 import java.util.Map;
07 import java.io.IOException;
08
09 public class HBaseConnector {
10
11 public static Map retrievePost(String postId) throws IOException {
12 HTable table = new HTable(new HBaseConfiguration(), "blogposts");
13 Map post = new HashMap();
14
15 RowResult result = table.getRow(postId);
16
17 for (byte[] column : result.keySet()) {
18 post.put(new String(column), new String(result.get(column).getValue()));
19 }
20 return post;
21 }
22
23 public static void main(String[] args) throws IOException {
24 Map blogpost = HBaseConnector.retrievePost("post1");
25 System.out.println(blogpost.get("post:title"));
26 System.out.println(blogpost.get("post:author"));
27 }
28 }

This code should print out ‘Hello World’ and ‘The Author’.

Understanding HBase column-family performance options

From Understanding HBase column-family performance options – Jimbojw.com.

In the comments to Understanding HBase and BigTable, I recieved some insightul questions. Here I attempt to answer them, in no particular order.

Picking the correct HBase performance options is akin to deciding which engine use, or whether to use CHAR vs VARCHAR vs TEXT in a relational database. These decisions can make a big impact on the amount of data stored and the speed with which it is created, updated, read, and deleted.

This article assumes the reader is familiar with HBase concepts, particularly its column-oriented nature and the relationship between rows, column families, columns and cells. If this seems foreign to you, I recommend revisiting the aforementioned Understanding HBase article before reading further.

For information on the syntax used to create a table using the options mentioned here, see HbaseShell (Hadoop wiki).

Are there any performance implications that are implied with column families?

Definitely. All the columns within a column family will share the same characteristics such as versioning and compression. By default, HBase does not employ any kind of compression on cell data, but two alternative compressions may be specified: BLOCK and RECORD.

Block compression

Say you have a single column which will contain large blobs of text data, and you only want to keep one version for any given row.

In that case, you’d probably want that column to belong to a column family which supports BLOCK compression, since this compression type will span across multiple rows in order to achieve the best compression ratio.

Record compression

On the other hand, say you had a variable number of rows containing text data, of which you’d want to keep multiple versions. Then you might want those columns to belong to a family which uses RECORD compression, since this compression type will be be localized within each row.

Although compression ratios generally would be better with BLOCK compression rather than RECORD compression, access times for RECORD compression would theoretically be faster since only a single row would need to be pulled in order to decompress a given cell.

Should columns be grouped into families based on how they’re used in a particular application?

Yes, absolutely. Consider these other column family options: BLOOMFILTER, IN_MEMORY, MAX_LENGTH and MAX_VERSIONS.

Bloom Filters

If a column family supports bloom filters, that means that an extra index is kept which helps cut down on the time necessary to determine if a given column exists in a given row. This has nothing to do with the cell values, just the the row/column identifiers.

In the case where you have a very large number of variably named columns, each cell having a small amount of data, you may want to specify them in a column family utilizing a bloom filter, so that lookup times are reduced.

Like any index, bloom filters incur an additional storage cost (memory) and an update cost (time). The sole purpose of a bloom filter is to quickly determine whether a given input has ever been seen before, using a minimum of storage space. Inserting new items and checking for existing items are both fast. The one slow operation is deletion, which requires rebuilding the entire index from scratch, but skipping the deleted item. Other bloom filter variants support a certain deletion tolerance, but given enough deletions the index would still need to be rebuilt.

IN_MEMORY

Another example characteristic is the IN_MEMORY option, which directs HBase to keep cell values loaded in memory more aggressively than it would normally do. The upside is that this should really speed up certain kinds of read/write patterns.

The downside is, of course, that this eats up RAM, and secondarily that it may interfere with making HDFS backups since the data would be written to disk less frequently (again, this is speculation as I haven’t seen any benchmarks on the issue).

MAX_LENGTH and MAX_VERSIONS

Deciding which characteristics to apply to a column family is important from a performance perspective, but rarely have an impact on actual functionality. About the only settings which make a functional difference are MAX_VERSIONS and MAX_LENGTH, which specify how many versions of a cell to keep (default is 3), and how many bytes of data can be stored in each cell version (default is the max size of a 32bit signed integer).

Summary

In order to pick the appropriate performance options for your column families, you’ll have to consider the forms of data likely to be stored, as well as the manner in which is is inserted, updated and retrieved.

Hope this helps, and, as always, I’ll be happy to try and answer any questions you may have.