Friday, July 22, 2016

It’s all about Apache Sqoop…

Sqoop provides a mechanism to connect the external systems like EDW (Enterprise Data Warehouse like Amazon Redshift), Relational Database Management Systems like MySQL, Oracle, MS SQL Server, etc. to transfer data between Hadoop System and the above mentioned external Systems. It efficiently transfers bulk data between Hadoop systems and databases.

Sqoop helps us to have the structured data from RDBMS into HDFS, whereas the other Hadoop eco system components such as MapReduce, HBase, Hive, etc. are dealing with big data and unstructured data for analyzing the data. Sqoop paradigm are categorized into two such as Sqoop Import and Sqoop Export.

Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.


The following diagram clearly illustrates how Sqoop connects and works.


Connectors are forced to follow the JDBC model and are required to use common JDBC vocabulary (URL, database, table, etc), regardless if it is applicable.

Sqoop-Import – Purpose:
The import tool imports an individual table from an RDBMS to HDFS. Each row from a table is represented as a separate record in HDFS. Records can be stored as text files (one record per line), or in binary representation as Avro or SequenceFiles.

Connecting to database server:
Sqoop is designed to import tables from a database into HDFS. To do so, you must specify a connect string that describes how to connect to the database. The connect string is similar to a URL, and is communicated to Sqoop with the --connect argument. This describes the server and database to connect to; it may also specify the port. For example

$ sqoop import --connect jdbc:mysql://<db_url>/<db_name>

It’s important that you do not use the URL localhost if you intend to use Sqoop with a distributed Hadoop cluster. The connect string you supply will be used on TaskTracker nodes throughout your MapReduce cluster; if you specify the literal name localhost, each node will connect to a different database (or more likely, no database at all). Instead, you should use the full hostname or IP address of the database host that can be seen by all your remote nodes.

Alternatively, we can authenticate the database before sqoop access it; for example,

$ sqoop import --connect jdbc:mysql://<db_url>/<db_name>
--username <username> --password <password>

You can use Sqoop with any other JDBC-compliant database. First, download the appropriate JDBC driver for the type of database you want to import, and install the .jar file in the $SQOOP_HOME/lib directory on your client machine. For example,

$ sqoop import --driver com.microsoft.jdbc.sqlserver.SQLServerDriver
--connect <connect-string> ...

Importing Table Data into HDFS:
The syntax is used to import tables from database systems to Hadoop file system is as follows :

$ sqoop import (generic-args) (import-args)
$ sqoop-import (generic-args) (import-args)

The below are some of the commonly used import control arguments.

--append
Append data to an existing dataset in HDFS
--as-textfile
Imports data as plain text (default)
--columns <col,col,col…>
Columns to import from table
--fetch-size <n>
Number of entries to read from database at once.
-m,--num-mappers <n>
Use n map tasks to import in parallel
--table <table-name>
Table to read
--target-dir <dir>
HDFS destination dir
--warehouse-dir <dir>
HDFS parent for table destination
--where <where clause>
WHERE clause to use during import
--null-string <null-string>
The string to be written for a null value for string columns

The following is the command to import data from database to Hadoop file system; in HDFS, the table data will be stored in delimited format by default. We can specify the control arguments as --as-textfile if it has to be stored as text file.

$ sqoop import --connect jdbc:mysql://<db_host>/<db_name>
--username <<user> --password <password>
--table <table_name> --m 1

Suppose if we want to have the table data into a specific directory in HDFS, we need to specify the control arguments as follows:

$ sqoop import --connect jdbc:mysql://<db_host>/<db_name>
--username <<user> --password <password>
--table <table_name>
--target-dir /<target_dir_name>

Filtering records: Sqoop allows us to fetching off subset of table data using --where conditional argument. It is as same as the where clause in a typical SQL. The syntax will be as follows:

$ sqoop import --connect jdbc:mysql://<db_host>/<db_name>
--username <<user> --password <password>
--table <table_name>
--where <condition>
--target-dir /<target_dir_name>

Incremental Imports: Sqoop provides a methodology which can be used to retrieve only rows newer than some previously-imported set of rows. The following arguments are controlling incremental imports:

--check-column (col)
Specifies the column to be examined when determining which rows to import.
--incremental (mode)
Specifies how Sqoop determines which rows are new. Legal values for mode include append and lastmodified.
--last-value (value)
Specifies the maximum value of the check column from the previous import.

You should specify append mode when importing a table where new rows are continually being added with increasing row id values. You specify the column containing the row’s id with --check-column. Sqoop imports rows where the check column has a value greater than the one specified with --last-value.

$ sqoop import --connect jdbc:mysql://<db_host>/<db_name>
--username <<user> --password <password>
--table <table_name>
--incremental append
--check-column <col_name>
--last value <col_value>

All Tables Imports: Sqoop provides options to imports all tables data from a database to Hadoop file system; the syntax to import all tables is as follows:

$ sqoop import-all-tables --connect jdbc:mysql://<db_host>/<db_name>
--username <<user> --password <password>

Sqoop-Export – Purpose:
The export tool exports a set of files from HDFS back to an RDBMS. The target table must already exist in the database. The input files are read and parsed into a set of records according to the user-specified delimiters.
The default operation is to transform these into a set of INSERT statements that inject the records into the database. In "update mode," Sqoop will generate UPDATE statements that replace existing records in the database, and in "call mode" Sqoop will make a stored procedure call for each record.

Exporting HDFS Data into Database Tables:
The syntax is used to export into tables from Hadoop file system is as follows :

$ sqoop export (generic-args) (import-args)
$ sqoop-export (generic-args) (import-args)

The below are some of the commonly used to export control arguments.

--export-dir <dir>
HDFS source path for the export
-m,--num-mappers <n>
Use n map tasks to export in parallel
--export-dir <dir>
HDFS source path for the export
--table <table-name>
Table to populate
--call <stored-proc-name>
Stored Procedure to call
--update-key <col-name>
Anchor column to use for updates. Use a comma separated list of columns if there are more than one column.
--update-mode <mode>
Specify how updates are performed when new rows are found with non-matching keys in database.
--staging-table <staging-table-name>
The table in which data will be staged before being inserted into the destination table.
--clear-staging-table
Indicates that any data present in the staging table can be deleted.

The following is the command to export data to database

$ sqoop export --connect jdbc:mysql://<db_host>/<db_name>
--username <<user> --password <password>
--table <table_name>
--export-dir <hdfs_dir>

Failed Exports
Exports may fail for a number of reasons:
  • Loss of connectivity from the Hadoop cluster to the database (either due to hardware fault, or server software crashes)
  • Attempting to INSERT a row which violates a consistency constraint (for example, inserting a duplicate primary key value)
  • Attempting to parse an incomplete or malformed record from the HDFS source data
  • Attempting to parse records using incorrect delimiters
  • Capacity issues (such as insufficient RAM or disk space)

If an export map task fails due to these or other reasons, it will cause the export job to fail. The results of a failed export are undefined. Each export map task operates in a separate transaction. Furthermore, individual map tasks commit their current transaction periodically. If a task fails, the current transaction will be rolled back. Any previously-committed transactions will remain durable in the database, leading to a partially-complete export.

Sqoop Job – Purpose:
The job tool allows you to create and work with saved jobs. Saved jobs remember the parameters used to specify a job, so they can be re-executed by invoking the job by its handle.
If a saved job is configured to perform an incremental import, state regarding the most recently imported rows is updated in the saved job to allow the job to continually import only the newest rows.

Syntax: The following is the syntax used to create sqoop job.

$ sqoop job (generic-args) (job-args) [-- [subtool-name] (subtool-args)]
$ sqoop-job (generic-args) (job-args) [-- [subtool-name] (subtool-args)]

The followings are the generic arguments for Sqoop Job.

Argument
Description
--create <job-id>
Define a new saved job with the specified job-id (name). A second Sqoop command-line, separated by a -- should be specified; this defines the saved job.
--delete <job-id>
Delete a saved job.
--exec <job-id>
Given a job defined with --create, run the saved job.
--show <job-id>
Show the parameters for a saved job.
--list
List all saved jobs

We will see one by one.

Create Job: This is the syntax to create sqoop job:

$ sqoop job --create <job_name> -- import --connect jdbc:mysql://<db_host>/<db_name>
--username <<user> --password <password>
--table <table_name>

This creates a job named <job_name> which can be executed later. The job is not run. This job is now available in the list of saved jobs:

List Job Syntax: $ sqoop job –list

We can inspect the configuration of a job with the show action:

Description of a Job: $ sqoop job –show <job_name>

Execution of a job: $ sqoop job –exec <job_name>

The above are some of the concepts of Sqoop. So far, Sqoop Import & Export, Sqoop Jobs are covered. In our next posts, we will cover some more concepts of Sqoop like CodeGen, SQL Eval, etc. Also, the above are only syntaxes; we will provide details on Sqoop concepts with examples in next posts.

We are data integration experts at Pacific Big Dataproviding solutions to fit your needs.


 Contact us at: