Database Access with Hadoop

Database Access with Hadoop | Apache Hadoop for the Enterprise | Cloudera.

Hadoop’s strength is that it enables ad-hoc analysis of unstructured or semi-structured data. Relational databases, by contrast, allow for fast queries of very structured data sources. A point of frustration has been the inability to easily query both of these sources at the same time. The DBInputFormatcomponent provided in Hadoop 0.19 finally allows easy import and export of data between Hadoop and many relational databases, allowing relational data to be more easily incorporated into your data processing pipeline.

This blog post explains how the DBInputFormat works and provides an example of using DBInputFormat to import data into HDFS.

DBInputFormat and JDBC

First we’ll cover how DBInputFormat interacts with databases. DBInputFormat uses JDBC to connect to data sources. Because JDBC is widely implemented, DBInputFormat can work with MySQL, PostgreSQL, and several other database systems. Individual database vendors provide JDBC drivers to allow third-party applications (like Hadoop) to connect to their databases. Links to popular drivers are listed in the resources section at the end of this post.

To start using DBInputFormat to connect to your database, you’ll need to download the appropriate database driver from the list in the resources section (see the end of this post), and drop it into the$HADOOP_HOME/lib/ directory on your Hadoop TaskTracker machines, and on the machine where you launch your jobs from.

Reading Tables with DBInputFormat

The DBInputFormat is an InputFormat class that allows you to read data from a database. An InputFormat is Hadoop’s formalization of a data source; it can mean files formatted in a particular way, data read from a database, etc. DBInputFormat provides a simple method of scanning entire tables from a database, as well as the means to read from arbitrary SQL queries performed against the database. Most queries are supported, subject to a few limitations discussed at the end of this article.

Configuring the job

To use the DBInputFormat, you’ll need to configure your job. The following example shows how to connect to a MySQL database and load from a table:

CREATE TABLE employees ( employee_id INTEGER NOT NULL PRIMARY KEY, name VARCHAR(32) NOT NULL); 

Listing 1: Example table schema

JobConf conf = new JobConf(getConf(), MyDriver.class); conf.setInputFormat(DBInputFormat.class); DBConfiguration.configureDB(conf, “com.mysql.jdbc.Driver”, “jdbc:mysql://localhost/mydatabase”); String [] fields = { “employee_id”, "name" }; DBInputFormat.setInput(conf, MyRecord.class, “employees”, null /* conditions */, “employee_id”, fields); // set Mapper, etc., and call JobClient.runJob(conf); 

Listing 2: Java code to set up a MapReduce job with DBInputFormat

This example code will connect to mydatabase on localhost and read the two fields from theemployees table.

The configureDB() and setInput() calls configure the DBInputFormat. The first call specifies the JDBC driver implementation to use and what database to connect to. The second call specifies what data to load from the database. The MyRecord class is the class where data will be read into in Java, and "employees" is the name of the table to read. The "employee_id" parameter specifies the table’s primary key, used for ordering results. The section “Limitations of the InputFormat” below explains why this is necessary. Finally, the fields array lists what columns of the table to read. An overloaded definition of setInput() allows you to specify an arbitrary SQL query to read from, instead.

After calling configureDB() and setInput(), you should configure the rest of your job as usual, setting the Mapper and Reducer classes, specifying any other data sources to read from (e.g., datasets in HDFS) and other job-specific parameters.

Retrieving the data

The DBInputFormat will read from the database, but how does data get to your mapper? ThesetInput() method used in the example above took, as a parameter, the name of a class which will hold the contents of one row. You’ll need to write an implementation of the DBWritable interface to allow DBInputFormat to populate your class with fields from the table. DBWritable is an adaptor interface that allows data to be read and written using both Hadoop’s internal serialization mechanism, and using JDBC calls. Once the data is read into your custom class, you can then read the class’ fields in the mapper.

The following example provides a DBWritable implementation that holds one record from theemployees table, as described above:

class MyRecord implements Writable, DBWritable { long id; String name; public void readFields(DataInput in) throws IOException { this.id = in.readLong(); this.name = Text.readString(in); } public void readFields(ResultSet resultSet) throws SQLException { this.id = resultSet.getLong(1); this.name = resultSet.getString(2); } public void write(DataOutput out) throws IOException { out.writeLong(this.id); Text.writeString(out, this.name); } public void write(PreparedStatement stmt) throws SQLException { stmt.setLong(1, this.id); stmt.setString(2, this.name); } } 

Listing 3: DBWritable implementation for records from the employees table

java.sql.ResultSet object represents the data returned from a SQL statement. It contains a cursor representing a single row of the results. This row will contain the fields specified in the setInput() call. In the readFields() method of MyRecord, we read the two fields from the ResultSet. ThereadFields() and write() methods that operate on java.io.DataInput and DataOutput objects are part of the Writable interface used by Hadoop to marshal data between mappers and reducers, or pack results into SequenceFiles.

Using the data in a mapper

The mapper then receives an instance of your DBWritable implementation as its input value. The input key is a row id provided by the database; you’ll most likely discard this value.

public class MyMapper extends MapReduceBase implements Mapper<LongWritable, MyRecord, LongWritable, Text> { public void map(LongWritable key, MyRecord val, OutputCollector<LongWritable, Text> output, Reporter reporter) throws IOException { // Use val.id, val.name here output.collect(new LongWritable(val.id), new Text(val.name)); } } 

Listing 4: Example mapper using a custom DBWritable

Writing results back to the database

A companion class, DBOutputFormat, will allow you to write results back to a database. When setting up the job, call conf.setOutputFormat(DBOutputFormat.class); and then callDBConfiguration.configureDB() as before.

The DBOutputFormat.setOutput() method then defines how the results will be written back to the database. Its three arguments are the JobConf object for the job, a string defining the name of the table to write to, and an array of strings defining the fields of the table to populate. e.g.,DBOutputFormat.setOutput(job, "employees", "employee_id", "name");.

The same DBWritable implementation that you created earlier will suffice to inject records back into the database. The write(PreparedStatement stmt) method will be invoked on each instance of theDBWritable that you pass to the OutputCollector from the reducer. At the end of reducing, those PreparedStatement objects will be turned into INSERT statements to run against the SQL database.

Limitations of the InputFormat

JDBC allows applications to generate SQL queries which are executed against the database; the results are then returned to the calling application. Keep in mind that you will be interacting with your database via repeated SQL queries. Therefore:

  • Hadoop may need to execute the same query multiple times. It will need to return the same results each time. So any concurrent updates to your database, etc, should not affect the query being run by your MapReduce job. This can be accomplished by disallowing writes to the table while your MapReduce job runs, restricting your MapReduce’s query via a clause such as “insert_date <yesterday,” or dumping the data to a temporary table in the database before launching your MapReduce process.
  • In order to parallelize the processing of records from the database, Hadoop will execute SQL queries that use ORDER BYLIMIT, and OFFSET clauses to select ranges out of tables. Your results, therefore, need to be orderable by one or more keys (either PRIMARY, like the one in the example, or UNIQUE).
  • In order to set the number of map tasks, the DBInputFormat needs to know how many records it will read. So if you’re writing an arbitrary SQL query against the database, you will need to provide a second query that returns the number of rows that the first query will return (e.g., by using COUNTand GROUP BY).

With these restrictions in mind, there’s still a great deal of flexibility available to you. You can bulk load entire tables into HDFS, or select large ranges of data. For example, if you want to read records from a table that is also being populated by another source concurrently, you might set up that table to attach a timestamp field to each record. Before doing the bulk read, pick the current timestamp, then select all records with timestamps earlier than that one. New records being fed in by the other writer will have later timestamps and will not affect the MapReduce job.

Finally, be careful to understand the bottlenecks in your data processing pipeline. Launching a MapReduce job with 100 mappers performing queries against a database server may overload the server or its network connection. In this case, you’ll achieve less parallelism than theoretically possible, due to starvation, disk seeks, and other performance penalties.

Limitations of the OutputFormat

The DBOutputFormat writes to the database by generating a set of INSERT statements in each reducer. The reducer’s close() method then executes them in a bulk transaction. Performing a large number of these from several reduce tasks concurrently can swamp a database. If you want to export a very large volume of data, you may be better off generating the INSERT statements into a text file, and then using a bulk data import tool provided by your database to do the database import.

Conclusions

DBInputFormat provides a straightforward interface to read data from a database into your MapReduce applications. You can read database tables into HDFS, import them into Hive, or use them to perform joins in MapReduce jobs. By supporting JDBC, it provides a common interface to a variety of different database sources.

This is probably best not used as a primary data access mechanism; queries against database-driven data are most efficiently executed within the database itself, and large-scale data migration is better done using the bulk data export/import tools associated with your database. But when analysis of ad hoc data in HDFS can be improved by the addition of some additional relational data, DBInputFormat allows you to quickly perform the join without a large amount of setup overhead. DBOutputFormat then allows you to export results back to the same database for combining with other database-driven tables.

DBInputFormat is available in Hadoop 0.19 and is provided by HADOOP-2536, a patch started by Fredrik Hedberg and further developed by Enis Soztutar. A backport of this patch that can be applied to Hadoop 0.18.3 is available at the above link.

This article is based on a talk I gave at the SF Bay Hadoop User Group meetup on Feburary 18th; the slides from that talk are available as a PDF.

Resources

WordPress Utility: Post by Email

Post by Email « Support — WordPress.com.

Post by Email is a way of publishing posts on your blog by email. Any email client can be used to send the email, allowing you to publish quickly and easily from devices such as cell phones.

Generating a Post by Email Address

Before you can publish by email, you must generate a special email address. This address is unique to you and must be kept secret.

1) First go to Dashboard > My Blogs.

2) Locate the blog that you wish to post to and click Enable.

generate-email

You now have a special email address. If you want to add the address to your address book, you can download it as a vCard:

vcard

↑ Table of Contents ↑

Sending Emails

Once you have your Post by Email address, sending an email is simple:

email

The email subject is used as your post’s title. The body is the post’s contents. A few minutes after receiving your email, you should receive a notification email informing you of the published post’s details.

Please remember to send the email to your secret email address, not to the example given in the image above.

↑ Table of Contents ↑

Mail Formatting

Your email can be plain text or formatted. As much formatting as possible will be retained, although the Post by Email system will strip unnecessary HTML tags so that your email is displayed correctly. Note that you will need to use an email client that supports rich text or HTML formatting in order to make use of this feature. Most website based clients (Hotmail, Gmail) do support this, as do most desktop clients (Outlook, Mail). You may need to switch your client into rich text or formatted mode.

↑ Table of Contents ↑

Attachments

Image attachments will be included in your published post as follows:

  • Single images will be displayed inline (a single image is defined as an image without an image immediately following it).
  • Multiple images will be displayed as a gallery.

Multiple galleries and single images are allowed in the same post. Note that using the [nogallery]shortcode will disable all galleries.

If you have purchased the Space Upgrade, then the following additional attachment types will be supported:

  • Supported audio files (mp3) will be displayed using the WordPress Audio player.
  • All other files (doc, PDF, etc) will be displayed as links to the attachment.

Additionally, if you’ve purchased the VideoPress Upgrade, you’ll be able to send mp4, mov, wmv, avi, mpg, and m4v files as attachments, which will then be displayed using the WordPress video player.

Note that you can send your attachments to:

media+YOUR_SECRET_EMAIL@post.wordpress.com

Your attachments will just be stored in your account without creating a new post – the attachments will appear alongside your other media.

↑ Table of Contents ↑

Shortcodes

Special shortcodes can be embedded in your email to configure various aspects of the published post:

  • [category x,y,z]
  • [excerpt]some excerpt[/excerpt]
  • [tags x,y,z]
  • [delay +1 hour]
  • [comments on | off]
  • [status publish | pending | draft | private]
  • [password secret-password]
  • [slug some-url-name]
  • [title Your post title]
  • [end] – everything after this shortcode is ignored (i.e. signatures)
  •  – replaces the auto-gallery with a slideshow
  • [nogallery] – disables the auto-gallery and displays all images inline
  • [more] – more tag
  • [nextpage] – pagination
  • [geotag on | off] – override your geotagging privacy defaults to enable or disable the showing of geo information
  • [publicize off|yahoo|twitter|facebook] – change Publicize options (see below)
  • [poll]question and answers[/poll] – insert a Polldaddy poll into your post (see below)

Shortcodes can be included anywhere in the body of your email and must be in lowercase.

↑ Table of Contents ↑

Specifying the Category

The category shortcode will match the start of category titles, as well as category IDs. For example:

[category Hol,Fo]

Will match “Holiday” and “Food.” Note that categories must already exist on your blog and spaces between the commas are not important.

↑ Table of Contents ↑

Specifying Tags

Any number of tags can be added to your post, each separated by a comma:

[tags one potato, two potato, three potato, more]

This will add four tags: “one potato”, “two potato”, “three potato”, and “more”. Note that your tags do not need to exist elsewhere in your blog and new tags will be created automatically.

↑ Table of Contents ↑

Changing your Publicize settings

WordPress Publicize lets you notify other web services about your posts. With the[publicize] shortcode you can control this from emails.

[publicize off] – disable all Publicize notifications
[publicize twitter] – only send a notification to Twitter
[publicize twitter]my new post[/publicize] – only send a notification to Twitter and set the Twitter status to my new post

Note that your Publicize settings must have been previously configured.

↑ Table of Contents ↑

Inserting a Polldaddy poll

You must first have created or imported a Polldaddy account into WordPress.com before using this shortcode. Once setup you can insert a poll in an email as follows:

[poll]
What is the worst movie of the decade?
* The Love Guru
* Fool's Gold
[/poll]

Note how the poll question is added after the [poll] shortcode, and each answer is on a new line and starts with an asterisk. The poll must be finished with [/poll].

You can configure your poll by adding extra details to the [poll] shortcode:

  • type="single | multi | 2 | 3" – how many times a vote may be registered (single by default)
  • style="manga medium" – the style of the poll, as taken from the Polldaddy polls page.
  • other="yes | no" – allow an ‘other’ response (no by default)

For example, to create a poll with the wide ‘manga’ style and allow up to three responses (including an ‘other’ response):

[poll style="manga wide" other="yes" type="3"]
What is the worst movie of the decade?
* The Love Guru
* Fool's Gold
[/poll]

↑ Table of Contents ↑

Delaying Your Post

The delay shortcode will accept any time allowed by PHP’s strtotime. For example, you can:

[delay +1 hour]
[delay +2 days]

↑ Table of Contents ↑

Providing a post title

The title of your published post is usually taken from the subject line of your email. In some instances, such as when sending an email from some cell phones or via a MMS-Email gateway, you may not be able to provide a subject. In this instance, you can set your post title directly inside the email:

[title My Fancy Post]

↑ Table of Contents ↑

Changing your post status

Sometimes you may want your post to be private, or to be reviewed by yourself or someone else at a later date before being published. To do this, you can use the [status]shortcode to set the post status.
[status private]

↑ Table of Contents ↑

Geotagging

If your email includes an image that contains appropriate GPS information (for example, as sent from iPhone), then this will be used to geotag your post. Information will only be shown to the public if you configure your blog to do so. You can override this on a per-email basis using the [geotag on] and [geotag off] shortcodes.

Emails sent from a SPOT GPS device will be automatically geotagged.

↑ Table of Contents ↑

Signatures

Post by email will automatically remove any email signatures that match the standard signature block pattern:

--

(that is dash dash space)

It will also remove anything after a <hr/> HTML tag and attempts to clean up cellphone network signatures.

If your email system attaches a signature that does not match any of these patterns then you can manually tell Post by Email to stop including text by adding the special [end] shortcode. Anything after this will be removed from your post. If your cellphone network is adding a signature and you want us to remove it, then let us know the details and we’ll look into it.

↑ Table of Contents ↑

Example Email with Shortcodes

The following email will be published in two days’ time to the “WordPress” category, with tags “announcement” and “WordPress”:

Welcome to Post by Email, the easiest way to blog!

[tags announcement, WordPress]
[category WordPress]
[delay +2 days]

Post by Email

Post by Email