HbaseBook:Chapter 9. Data Model

From The Apache Book.

Chapter 9. Data Model

In short, applications store data into an HBase table. Tables are made of rows and columns. All columns in HBase belong to a particular column family. Table cells — the intersection of row and column coordinates — are versioned. A cell’s content is an uninterpreted array of bytes.

Table row keys are also byte arrays so almost anything can serve as a row key from strings to binary representations of longs or even serialized data structures. Rows in HBase tables are sorted by row key. The sort is byte-ordered. All table accesses are via the table row key — its primary key.

9.1. Conceptual View

The following example is a slightly modified form of the one on page 2 of the BigTable paper. There is a table called webtable that contains two column families named contents and anchor. In this example, anchor contains two columns (anchor:cssnsi.comanchor:my.look.ca) and contents contains one column (contents:html).

Column Names

By convention, a column name is made of its column family prefix and a qualifier. For example, the column contents:html is of the column familycontents The colon character (:) delimits the column family from the column family qualifier.

 

Table 9.1. Table webtable

Row Key Time Stamp ColumnFamily contents ColumnFamily anchor
“com.cnn.www” t9 anchor:cnnsi.com = “CNN”
“com.cnn.www” t8 anchor:my.look.ca = “CNN.com”
“com.cnn.www” t6 contents:html = “<html>…”
“com.cnn.www” t5 contents:html = “<html>…”
“com.cnn.www” t3 contents:html = “<html>…”

 

9.2. Physical View

Although at a conceptual level tables may be viewed as a sparse set of rows. Physically they are stored on a per-column family basis. New columns (i.e.,columnfamily:column) can be added to any column family without pre-announcing them.

Table 9.2. ColumnFamily anchor

Row Key Time Stamp Column Family anchor
“com.cnn.www” t9 anchor:cnnsi.com = “CNN”
“com.cnn.www” t8 anchor:my.look.ca = “CNN.com”

 

Table 9.3. ColumnFamily contents

Row Key Time Stamp ColumnFamily “contents:”
“com.cnn.www” t6 contents:html = “<html>…”
“com.cnn.www” t5 contents:html = “<html>…”
“com.cnn.www” t3 contents:html = “<html>…”


It is important to note in the diagram above that the empty cells shown in the conceptual view are not stored since they need not be in a column-oriented storage format. Thus a request for the value of the contents:html column at time stamp t8 would return no value. Similarly, a request for an anchor:my.look.cavalue at time stamp t9 would return no value. However, if no timestamp is supplied, the most recent value for a particular column would be returned and would also be the first one found since timestamps are stored in descending order. Thus a request for the values of all columns in the row com.cnn.www if no timestamp is specified would be: the value of contents:html from time stamp t6, the value of anchor:cnnsi.com from time stamp t9, the value of anchor:my.look.cafrom time stamp t8.

9.3. Table

Tables are declared up front at schema definition time.

9.4. Row

Row keys are uninterrpreted bytes. Rows are lexicographically sorted with the lowest order appearing first in a table. The empty byte array is used to denote both the start and end of a tables’ namespace.

9.5. Column Family

Columns in HBase are grouped into column families. All column members of a column family have the same prefix. For example, the columns courses:history andcourses:math are both members of the courses column family. The colon character (:) delimits the column family from the . The column family prefix must be composed of printable characters. The qualifying tail, the column family qualifier, can be made of any arbitrary bytes. Column families must be declared up front at schema definition time whereas columns do not need to be defined at schema time but can be conjured on the fly while the table is up an running.

Physically, all column family members are stored together on the filesystem. Because tunings and storage specifications are done at the column family level, it is advised that all column family members have the same general access pattern and size characteristics.

 

9.6. Cells

{row, column, version} tuple exactly specifies a cell in HBase. Cell content is uninterrpreted bytes

9.7. Versions

{row, column, version} tuple exactly specifies a cell in HBase. Its possible to have an unbounded number of cells where the row and column are the same but the cell address differs only in its version dimension.

While rows and column keys are expressed as bytes, the version is specified using a long integer. Typically this long contains time instances such as those returned by java.util.Date.getTime() or System.currentTimeMillis(), that is: the difference, measured in milliseconds, between the current time and midnight, January 1, 1970 UTC.

The HBase version dimension is stored in decreasing order, so that when reading from a store file, the most recent values are found first.

There is a lot of confusion over the semantics of cell versions, in HBase. In particular, a couple questions that often come up are:

  • If multiple writes to a cell have the same version, are all versions maintained or just the last?[13]
  • Is it OK to write cells in a non-increasing version order?[14]

Below we describe how the version dimension in HBase currently works[15].

9.7.1. Versions and HBase Operations

In this section we look at the behavior of the version dimension for each of the core HBase operations.

9.7.1.1. Get/Scan

Gets are implemented on top of Scans. The below discussion of Get applies equally to Scans.

By default, i.e. if you specify no explicit version, when doing a get, the cell whose version has the largest value is returned (which may or may not be the latest one written, see later). The default behavior can be modified in the following ways:

  • to return more than one version, see Get.setMaxVersions()
  • to return versions other than the latest, see Get.setTimeRange()

    To retrieve the latest version that is less than or equal to a given value, thus giving the ‘latest’ state of the record at a certain point in time, just use a range from 0 to the desired version and set the max versions to 1.

9.7.1.2. Default Get Example

The following Get will only retrieve the current version of the row

        Get get = new Get(Bytes.toBytes("row1"));
        Result r = htable.get(get);
        byte[] b = r.getValue(Bytes.toBytes("cf"), Bytes.toBytes("attr"));  // returns current version of value

 

9.7.1.3. Versioned Get Example

The following Get will return the last 3 versions of the row.

        Get get = new Get(Bytes.toBytes("row1"));
        get.setMaxVersions(3);  // will return last 3 versions of row
        Result r = htable.get(get);
        byte[] b = r.getValue(Bytes.toBytes("cf"), Bytes.toBytes("attr"));  // returns current version of value
        List<KeyValue> kv = r.getColumn(Bytes.toBytes("cf"), Bytes.toBytes("attr"));  // returns all versions of this column

 

9.7.1.4. Put

Doing a put always creates a new version of a cell, at a certain timestamp. By default the system uses the server’s currentTimeMillis, but you can specify the version (= the long integer) yourself, on a per-column level. This means you could assign a time in the past or the future, or use the long value for non-time purposes.

To overwrite an existing value, do a put at exactly the same row, column, and version as that of the cell you would overshadow.

9.7.1.4.1. Implicit Version Example

The following Put will be implicitly versioned by HBase with the current time.

          Put put = new Put(Bytes.toBytes(row));
          put.add(Bytes.toBytes("cf"), Bytes.toBytes("attr1"), Bytes.toBytes( data));
          htable.put(put);

 

9.7.1.4.2. Explicit Version Example

The following Put has the version timestamp explicitly set.

          Put put = new Put( Bytes.toBytes(row ));
          long explicitTimeInMs = 555;  // just an example
          put.add(Bytes.toBytes("cf"), Bytes.toBytes("attr1"), explicitTimeInMs, Bytes.toBytes(data));
          htable.put(put);

 

9.7.1.5. Delete

When performing a delete operation in HBase, there are two ways to specify the versions to be deleted

  • Delete all versions older than a certain timestamp
  • Delete the version at a specific timestamp

A delete can apply to a complete row, a complete column family, or to just one column. It is only in the last case that you can delete explicit versions. For the deletion of a row or all the columns within a family, it always works by deleting all cells older than a certain version.

Deletes work by creating tombstone markers. For example, let’s suppose we want to delete a row. For this you can specify a version, or else by default thecurrentTimeMillis is used. What this means is delete all cells where the version is less than or equal to this version. HBase never modifies data in place, so for example a delete will not immediately delete (or mark as deleted) the entries in the storage file that correspond to the delete condition. Rather, a so-called tombstone is written, which will mask the deleted values[16]. If the version you specified when deleting a row is larger than the version of any value in the row, then you can consider the complete row to be deleted.

9.7.2. Current Limitations

There are still some bugs (or at least ‘undecided behavior’) with the version dimension that will be addressed by later HBase releases.

9.7.2.1. Deletes mask Puts

Deletes mask puts, even puts that happened after the delete was entered[17]. Remember that a delete writes a tombstone, which only disappears after then next major compaction has run. Suppose you do a delete of everything <= T. After this you do a new put with a timestamp <= T. This put, even if it happened after the delete, will be masked by the delete tombstone. Performing the put will not fail, but when you do a get you will notice the put did have no effect. It will start working again after the major compaction has run. These issues should not be a problem if you use always-increasing versions for new puts to a row. But they can occur even if you do not care about time: just do delete and put immediately after each other, and there is some chance they happen within the same millisecond.

9.7.2.2. Major compactions change query results

…create three cell versions at t1, t2 and t3, with a maximum-versions setting of 2. So when getting all versions, only the values at t2 and t3 will be returned. But if you delete the version at t2 or t3, the one at t1 will appear again. Obviously, once a major compaction has run, such behavior will not be the case anymore…[18]


[13Currently, only the last written is fetchable.

[14Yes

[15See HBASE-2406 for discussion of HBase versions. Bending time in HBase makes for a good read on the version, or time, dimension in HBase. It has more detail on versioning than is provided here. As of this writing, the limiitation Overwriting values at existing timestamps mentioned in the article no longer holds in HBase. This section is basically a synopsis of this article by Bruno Dumon.

[16When HBase does a major compaction, the tombstones are processed to actually remove the dead values, together with the tombstones themselves.

[18See Garbage Collection in Bending time in HBase

HBaseBook:Chapter 9. Data Model

FROM The Apache Book.

Chapter 9. Data Model

In short, applications store data into an HBase table. Tables are made of rows and columns. All columns in HBase belong to a particular column family. Table cells — the intersection of row and column coordinates — are versioned. A cell’s content is an uninterpreted array of bytes.

Table row keys are also byte arrays so almost anything can serve as a row key from strings to binary representations of longs or even serialized data structures. Rows in HBase tables are sorted by row key. The sort is byte-ordered. All table accesses are via the table row key — its primary key.

9.1. Conceptual View

The following example is a slightly modified form of the one on page 2 of the BigTable paper. There is a table called webtable that contains two column families named contents and anchor. In this example, anchor contains two columns (anchor:cssnsi.comanchor:my.look.ca) and contents contains one column (contents:html).

Column Names

By convention, a column name is made of its column family prefix and a qualifier. For example, the column contents:html is of the column familycontents The colon character (:) delimits the column family from the column family qualifier.

 

Table 9.1. Table webtable

Row Key Time Stamp ColumnFamily contents ColumnFamily anchor
“com.cnn.www” t9 anchor:cnnsi.com = “CNN”
“com.cnn.www” t8 anchor:my.look.ca = “CNN.com”
“com.cnn.www” t6 contents:html = “<html>…”
“com.cnn.www” t5 contents:html = “<html>…”
“com.cnn.www” t3 contents:html = “<html>…”

 

9.2. Physical View

Although at a conceptual level tables may be viewed as a sparse set of rows. Physically they are stored on a per-column family basis. New columns (i.e.,columnfamily:column) can be added to any column family without pre-announcing them.

Table 9.2. ColumnFamily anchor

Row Key Time Stamp Column Family anchor
“com.cnn.www” t9 anchor:cnnsi.com = “CNN”
“com.cnn.www” t8 anchor:my.look.ca = “CNN.com”

 

Table 9.3. ColumnFamily contents

Row Key Time Stamp ColumnFamily “contents:”
“com.cnn.www” t6 contents:html = “<html>…”
“com.cnn.www” t5 contents:html = “<html>…”
“com.cnn.www” t3 contents:html = “<html>…”


It is important to note in the diagram above that the empty cells shown in the conceptual view are not stored since they need not be in a column-oriented storage format. Thus a request for the value of the contents:html column at time stamp t8 would return no value. Similarly, a request for an anchor:my.look.cavalue at time stamp t9 would return no value. However, if no timestamp is supplied, the most recent value for a particular column would be returned and would also be the first one found since timestamps are stored in descending order. Thus a request for the values of all columns in the row com.cnn.www if no timestamp is specified would be: the value of contents:html from time stamp t6, the value of anchor:cnnsi.com from time stamp t9, the value of anchor:my.look.cafrom time stamp t8.

9.3. Table

Tables are declared up front at schema definition time.

9.4. Row

Row keys are uninterrpreted bytes. Rows are lexicographically sorted with the lowest order appearing first in a table. The empty byte array is used to denote both the start and end of a tables’ namespace.

9.5. Column Family

Columns in HBase are grouped into column families. All column members of a column family have the same prefix. For example, the columns courses:history andcourses:math are both members of the courses column family. The colon character (:) delimits the column family from the . The column family prefix must be composed of printable characters. The qualifying tail, the column family qualifier, can be made of any arbitrary bytes. Column families must be declared up front at schema definition time whereas columns do not need to be defined at schema time but can be conjured on the fly while the table is up an running.

Physically, all column family members are stored together on the filesystem. Because tunings and storage specifications are done at the column family level, it is advised that all column family members have the same general access pattern and size characteristics.

 

9.6. Cells

{row, column, version} tuple exactly specifies a cell in HBase. Cell content is uninterrpreted bytes

9.7. Versions

{row, column, version} tuple exactly specifies a cell in HBase. Its possible to have an unbounded number of cells where the row and column are the same but the cell address differs only in its version dimension.

While rows and column keys are expressed as bytes, the version is specified using a long integer. Typically this long contains time instances such as those returned by java.util.Date.getTime() or System.currentTimeMillis(), that is: the difference, measured in milliseconds, between the current time and midnight, January 1, 1970 UTC.

The HBase version dimension is stored in decreasing order, so that when reading from a store file, the most recent values are found first.

There is a lot of confusion over the semantics of cell versions, in HBase. In particular, a couple questions that often come up are:

  • If multiple writes to a cell have the same version, are all versions maintained or just the last?[13]
  • Is it OK to write cells in a non-increasing version order?[14]

Below we describe how the version dimension in HBase currently works[15].

9.7.1. Versions and HBase Operations

In this section we look at the behavior of the version dimension for each of the core HBase operations.

9.7.1.1. Get/Scan

Gets are implemented on top of Scans. The below discussion of Get applies equally to Scans.

By default, i.e. if you specify no explicit version, when doing a get, the cell whose version has the largest value is returned (which may or may not be the latest one written, see later). The default behavior can be modified in the following ways:

  • to return more than one version, see Get.setMaxVersions()
  • to return versions other than the latest, see Get.setTimeRange()

    To retrieve the latest version that is less than or equal to a given value, thus giving the ‘latest’ state of the record at a certain point in time, just use a range from 0 to the desired version and set the max versions to 1.

9.7.1.2. Default Get Example

The following Get will only retrieve the current version of the row

        Get get = new Get(Bytes.toBytes("row1"));
        Result r = htable.get(get);
        byte[] b = r.getValue(Bytes.toBytes("cf"), Bytes.toBytes("attr"));  // returns current version of value

 

9.7.1.3. Versioned Get Example

The following Get will return the last 3 versions of the row.

        Get get = new Get(Bytes.toBytes("row1"));
        get.setMaxVersions(3);  // will return last 3 versions of row
        Result r = htable.get(get);
        byte[] b = r.getValue(Bytes.toBytes("cf"), Bytes.toBytes("attr"));  // returns current version of value
        List<KeyValue> kv = r.getColumn(Bytes.toBytes("cf"), Bytes.toBytes("attr"));  // returns all versions of this column

 

9.7.1.4. Put

Doing a put always creates a new version of a cell, at a certain timestamp. By default the system uses the server’s currentTimeMillis, but you can specify the version (= the long integer) yourself, on a per-column level. This means you could assign a time in the past or the future, or use the long value for non-time purposes.

To overwrite an existing value, do a put at exactly the same row, column, and version as that of the cell you would overshadow.

9.7.1.4.1. Implicit Version Example

The following Put will be implicitly versioned by HBase with the current time.

          Put put = new Put(Bytes.toBytes(row));
          put.add(Bytes.toBytes("cf"), Bytes.toBytes("attr1"), Bytes.toBytes( data));
          htable.put(put);

 

9.7.1.4.2. Explicit Version Example

The following Put has the version timestamp explicitly set.

          Put put = new Put( Bytes.toBytes(row ));
          long explicitTimeInMs = 555;  // just an example
          put.add(Bytes.toBytes("cf"), Bytes.toBytes("attr1"), explicitTimeInMs, Bytes.toBytes(data));
          htable.put(put);

 

9.7.1.5. Delete

When performing a delete operation in HBase, there are two ways to specify the versions to be deleted

  • Delete all versions older than a certain timestamp
  • Delete the version at a specific timestamp

A delete can apply to a complete row, a complete column family, or to just one column. It is only in the last case that you can delete explicit versions. For the deletion of a row or all the columns within a family, it always works by deleting all cells older than a certain version.

Deletes work by creating tombstone markers. For example, let’s suppose we want to delete a row. For this you can specify a version, or else by default thecurrentTimeMillis is used. What this means is delete all cells where the version is less than or equal to this version. HBase never modifies data in place, so for example a delete will not immediately delete (or mark as deleted) the entries in the storage file that correspond to the delete condition. Rather, a so-called tombstone is written, which will mask the deleted values[16]. If the version you specified when deleting a row is larger than the version of any value in the row, then you can consider the complete row to be deleted.

9.7.2. Current Limitations

There are still some bugs (or at least ‘undecided behavior’) with the version dimension that will be addressed by later HBase releases.

9.7.2.1. Deletes mask Puts

Deletes mask puts, even puts that happened after the delete was entered[17]. Remember that a delete writes a tombstone, which only disappears after then next major compaction has run. Suppose you do a delete of everything <= T. After this you do a new put with a timestamp <= T. This put, even if it happened after the delete, will be masked by the delete tombstone. Performing the put will not fail, but when you do a get you will notice the put did have no effect. It will start working again after the major compaction has run. These issues should not be a problem if you use always-increasing versions for new puts to a row. But they can occur even if you do not care about time: just do delete and put immediately after each other, and there is some chance they happen within the same millisecond.

9.7.2.2. Major compactions change query results

…create three cell versions at t1, t2 and t3, with a maximum-versions setting of 2. So when getting all versions, only the values at t2 and t3 will be returned. But if you delete the version at t2 or t3, the one at t1 will appear again. Obviously, once a major compaction has run, such behavior will not be the case anymore…[18]


[13Currently, only the last written is fetchable.

[14Yes

[15See HBASE-2406 for discussion of HBase versions. Bending time in HBase makes for a good read on the version, or time, dimension in HBase. It has more detail on versioning than is provided here. As of this writing, the limiitation Overwriting values at existing timestamps mentioned in the article no longer holds in HBase. This section is basically a synopsis of this article by Bruno Dumon.

[16When HBase does a major compaction, the tombstones are processed to actually remove the dead values, together with the tombstones themselves.

[18See Garbage Collection in Bending time in HBase