It’s
all about Apache Sqoop…
Sqoop
provides a mechanism to connect the external systems like EDW (Enterprise Data
Warehouse like Amazon Redshift), Relational Database Management Systems like MySQL,
Oracle, MS SQL Server, etc. to transfer data between Hadoop System and the
above mentioned external Systems. It efficiently transfers bulk data between
Hadoop systems and databases.
Sqoop
helps us to have the structured data from RDBMS into HDFS, whereas the other Hadoop
eco system components such as MapReduce, HBase, Hive, etc. are dealing with big
data and unstructured data for analyzing the data. Sqoop paradigm are categorized
into two such as Sqoop Import and Sqoop Export.
Sqoop
automates most of this process, relying on the database to describe the schema
for the data to be imported. Sqoop uses MapReduce to import and export the
data, which provides parallel operation as well as fault tolerance.
The
following diagram clearly illustrates how Sqoop connects and works.
Connectors are forced to follow the JDBC
model and are required to use common JDBC vocabulary (URL, database, table,
etc), regardless if it is applicable.
Sqoop-Import – Purpose:
The import tool imports an individual
table from an RDBMS to HDFS. Each row from a table is represented as a separate
record in HDFS. Records can be stored as text files (one record per line), or
in binary representation as Avro or SequenceFiles.
Connecting to database server:
Sqoop
is designed to import tables from a database into HDFS. To do so, you must
specify a connect string that describes how to connect to the
database. The connect string is similar to a URL, and is
communicated to Sqoop with the --connect argument. This describes the
server and database to connect to; it may also specify the port. For example
$ sqoop import --connect jdbc:mysql://<db_url>/<db_name>
It’s
important that you do not use the URL localhost if
you intend to use Sqoop with a distributed Hadoop cluster. The connect string
you supply will be used on TaskTracker nodes throughout your MapReduce cluster;
if you specify the literal name localhost, each node will connect to a
different database (or more likely, no database at all). Instead, you should
use the full hostname or IP address of the database host that can be seen by
all your remote nodes.
Alternatively,
we can authenticate the database before sqoop access it; for example,
$ sqoop import --connect jdbc:mysql://<db_url>/<db_name>
--username <username> --password <password>
You
can use Sqoop with any other JDBC-compliant database. First, download the
appropriate JDBC driver for the type of database you want to import, and
install the .jar file in the $SQOOP_HOME/lib directory on your client
machine. For example,
$ sqoop import --driver com.microsoft.jdbc.sqlserver.SQLServerDriver
--connect <connect-string> ...
Importing Table Data into HDFS:
The
syntax is used to import tables from database systems to Hadoop file system is
as follows :
$ sqoop import (generic-args) (import-args)
$ sqoop-import (generic-args) (import-args)
The
below are some of the commonly used import control arguments.
--append
|
Append
data to an existing dataset in HDFS
|
--as-textfile
|
Imports data as plain text
(default)
|
--columns
<col,col,col…>
|
Columns to import from table
|
--fetch-size
<n>
|
Number of entries to read from
database at once.
|
-m,--num-mappers
<n>
|
Use n map tasks
to import in parallel
|
--table
<table-name>
|
Table to read
|
--target-dir
<dir>
|
HDFS destination dir
|
--warehouse-dir
<dir>
|
HDFS parent for table destination
|
--where
<where clause>
|
WHERE clause to use during import
|
--null-string
<null-string>
|
The string to be written for a
null value for string columns
|
The
following is the command to import data from database to Hadoop file system; in
HDFS, the table data will be stored in delimited format by default. We can
specify the control arguments as --as-textfile
if it has to be stored as text file.
$ sqoop import --connect jdbc:mysql://<db_host>/<db_name>
--username <<user> --password <password>
--table <table_name> --m 1
Suppose
if we want to have the table data into a specific directory in HDFS, we need to
specify the control arguments as follows:
$ sqoop import --connect jdbc:mysql://<db_host>/<db_name>
--username <<user> --password <password>
--table <table_name>
--target-dir /<target_dir_name>
Filtering records:
Sqoop allows us to fetching off subset of table data using --where conditional argument. It is as same as the where clause in a typical SQL. The
syntax will be as follows:
$ sqoop import --connect jdbc:mysql://<db_host>/<db_name>
--username <<user> --password <password>
--table <table_name>
--where <condition>
--target-dir /<target_dir_name>
Incremental Imports:
Sqoop provides a methodology which can be used to retrieve only rows newer than
some previously-imported set of rows. The following arguments are controlling
incremental imports:
--check-column (col)
|
Specifies the column to be
examined when determining which rows to import.
|
--incremental (mode)
|
Specifies how Sqoop determines
which rows are new. Legal values for mode include append and lastmodified.
|
--last-value (value)
|
Specifies the maximum value of the
check column from the previous import.
|
You
should specify append mode when importing a table where
new rows are continually being added with increasing row id values. You specify
the column containing the row’s id with --check-column.
Sqoop imports rows where the check column has a value greater than the one
specified with --last-value.
$ sqoop import --connect jdbc:mysql://<db_host>/<db_name>
--username <<user> --password <password>
--table <table_name>
--incremental append
--check-column <col_name>
--last value <col_value>
All Tables Imports:
Sqoop provides options to imports all tables data from a database to Hadoop
file system; the syntax to import all tables is as follows:
$ sqoop import-all-tables --connect
jdbc:mysql://<db_host>/<db_name>
--username <<user> --password <password>
Sqoop-Export – Purpose:
The export tool exports a set of files
from HDFS back to an RDBMS. The target table must already exist in the
database. The input files are read and parsed into a set of records according
to the user-specified delimiters.
The default operation is to
transform these into a set of INSERT statements that inject the
records into the database. In "update mode," Sqoop will generate UPDATE statements that replace
existing records in the database, and in "call mode" Sqoop will make
a stored procedure call for each record.
Exporting HDFS Data into Database Tables:
The
syntax is used to export into tables from Hadoop file system is as follows :
$ sqoop export (generic-args) (import-args)
$ sqoop-export (generic-args) (import-args)
The
below are some of the commonly used to export control arguments.
--export-dir <dir>
|
HDFS source path for the export
|
-m,--num-mappers <n>
|
Use n map tasks
to export in parallel
|
--export-dir <dir>
|
HDFS source path for the export
|
--table <table-name>
|
Table to populate
|
--call <stored-proc-name>
|
Stored Procedure to call
|
--update-key <col-name>
|
Anchor column to use for updates.
Use a comma separated list of columns if there are more than one column.
|
--update-mode <mode>
|
Specify how updates are performed
when new rows are found with non-matching keys in database.
|
--staging-table
<staging-table-name>
|
The table in which data will be
staged before being inserted into the destination table.
|
--clear-staging-table
|
Indicates that any data present in
the staging table can be deleted.
|
The
following is the command to export data to database
$ sqoop export --connect jdbc:mysql://<db_host>/<db_name>
--username <<user> --password <password>
--table <table_name>
--export-dir <hdfs_dir>
Failed Exports
Exports
may fail for a number of reasons:
- Loss of connectivity from the Hadoop
cluster to the database (either due to hardware fault, or server software
crashes)
- Attempting to INSERT a
row which violates a consistency constraint (for example, inserting a duplicate
primary key value)
- Attempting to parse an incomplete or
malformed record from the HDFS source data
- Attempting to parse records using
incorrect delimiters
- Capacity issues (such as insufficient
RAM or disk space)
If
an export map task fails due to these or other reasons, it will cause the export
job to fail. The results of a failed export are undefined. Each export map task
operates in a separate transaction. Furthermore, individual map tasks commit their
current transaction periodically. If a task fails, the current transaction will
be rolled back. Any previously-committed transactions will remain durable in
the database, leading to a partially-complete export.
Sqoop Job – Purpose:
The job tool allows you to create and work with saved jobs. Saved
jobs remember the parameters used to specify a job, so they can be re-executed
by invoking the job by its handle.
If a saved job is configured to perform an incremental import,
state regarding the most recently imported rows is updated in the saved job to
allow the job to continually import only the newest rows.
Syntax:
The following is the syntax used to create sqoop job.
$ sqoop job (generic-args) (job-args) [-- [subtool-name] (subtool-args)]
$ sqoop-job (generic-args) (job-args) [-- [subtool-name] (subtool-args)]
The followings are the generic arguments for Sqoop Job.
Argument
|
Description
|
--create
<job-id>
|
Define
a new saved job with the specified job-id (name). A second Sqoop
command-line, separated by a -- should be specified; this defines
the saved job.
|
--delete
<job-id>
|
Delete
a saved job.
|
--exec
<job-id>
|
Given
a job defined with --create, run the saved job.
|
--show
<job-id>
|
Show
the parameters for a saved job.
|
--list
|
List
all saved jobs
|
We will see one by one.
Create Job: This is the syntax to create sqoop job:
$ sqoop
job --create <job_name> -- import --connect
jdbc:mysql://<db_host>/<db_name>
--username
<<user> --password <password>
--table
<table_name>
This
creates a job named <job_name> which can be executed later. The
job is not run. This job is now available in the list of saved jobs:
List
Job Syntax: $ sqoop
job –list
We can
inspect the configuration of a job with the show action:
Description
of a Job: $ sqoop job –show
<job_name>
Execution
of a job: $ sqoop job –exec
<job_name>
The
above are some of the concepts of Sqoop. So far, Sqoop Import & Export,
Sqoop Jobs are covered. In our next posts, we will cover some more concepts of Sqoop
like CodeGen, SQL Eval, etc. Also, the above are only syntaxes; we will provide
details on Sqoop concepts with examples in next posts.
We are data integration experts at Pacific Big Data, providing solutions to fit your needs.