overwrite hdfs directory Sqoop import - sqoop

Is it possible to overwrite HDFS directory automatically instead of overwriting it every time manually while Sqoop import?
(Do we have any option like "--overwrite" like we have for hive import "--hive-overwrite")

Use --delete-target-dir
​It will delete <HDFS-target-dir> provided in command before writing data to this directory.

Use this: --delete-target-dir
This will work for overwriting the hdfs directory using sqoop syntax:
$ sqoop import --connect jdbc:mysql://localhost/dbname --username username -P --table tablename --delete-target-dir --target-dir '/targetdirectorypath' -m 1
E.g:
$ sqoop import --connect jdbc:mysql://localhost/abc --username root -P --table empsqooptargetdel --delete-target-dir --target-dir '/tmp/sqooptargetdirdelete' -m 1
This command will refresh the corresponding hdfs directory or hive table data with updated/fresh data, every time this command is run.

Related

Oozie iterative workflow

I am building an application to ingest data from MYSQL DB to hive tables. App will be scheduled to execute every day.
The very first action is to read a Hive table to load import table info e.g name, type etc., and create a list of tables in a file to import. Next a Sqoop action to transfer data for each table in sequence.
Is it possible to create a shell script Oozie action which will iterate through the table list and launch oozie sub-workflow Sqoop action for each table in sequence? Could you provide some reference? Also any suggestion of a better approach!
I have come up with following shell script containing Sqoop action. It works fine with some environment variable tweaking.
hdfs_path='hdfs://quickstart.cloudera:8020/user/cloudera/workflow/table_metadata' table_temp_path='hdfs://quickstart.cloudera:8020/user/cloudera/workflow/hive_temp
if $(hadoop fs -test -e $hdfs_path)
then
for file in $(hadoop fs -ls $hdfs_path | grep -o -e "$hdfs_path/*.*");
do
echo ${file}
TABLENAME=$(hadoop fs -cat ${file});
echo $TABLENAME
HDFSPATH=$table_temp_path
sqoop import --connect jdbc:mysql://quickstart.cloudera:3306/retail_db --table departments --username=retail_dba --password=cloudera --direct -m 1 --delete-target-dir --target-dir $table_temp_path/$TABLENAME
done
fi

SQOOP export in shell script fails

I am exporting a table from hive to mysql with the help of shell script.The below is the sqoop export command
sqoop export --connect jdbc:mysql://192.168.154.129:3306/ey -username root --table call_detail_records --export-dir /apps/hive/warehouse/xademo.db/call_detail_records --fields-terminated-by '|' --lines-terminated-by '\n' --m 4 --batch
The above command works fine from the CLI. but it doesnt work from the shell script and it generates the below Warning and error.
Warning :
15/05/05 13:30:06 WARN sqoop.SqoopOptions: Character argument '|' has multiple characters; only the first will be used.
15/05/05 13:30:06 WARN sqoop.SqoopOptions: Character argument '\n' has multiple characters; only the first will be used.
Error:
15/05/05 13:30:50 INFO mapreduce.Job: map 0% reduce 0%
15/05/05 13:31:56 INFO mapreduce.Job: Task Id : attempt_1430805361424_0046_m_000001_0, Status : FAILED
Error: java.io.IOException: Can't export data, please check failed map task logs
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:784)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.RuntimeException: Can't parse input data: 'PHONE_NUM|PLAN|DATE|STAUS|BALANCE|IMEI|REGION'
at customer_details.__loadFromFields(customer_details.java:464)
at customer_details.parse(customer_details.java:382)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:83)
... 10 more
Caused by: java.util.NoSuchElementException
at java.util.ArrayList$Itr.next(ArrayList.java:834)
at customer_details.__loadFromFields(customer_details.java:434)
... 12 more
My Sqoop command in shell script will have variables which will be expanded.
nohup sqoop export --connect jdbc:mysql://192.168.154.129:3306/ey -username root --table $TBL_NAME --export-dir $HIVE_DIR --fields-terminated-by "$FIELD_SEP" --lines-terminated-by "'"'\'"$LINE_SEP""'" --m $NUM_MAPPERS --batch > $sqoop_outs/$TBL_NAME.out 2>&1 &
Any help is highly appreciated.
I am struggling with this for long time...
Atlast i found the reason, it is the disparate treatment of " and ' in the SQOOP command when i run from the CLI and Shell script.
Solution :
I had to change in my shell script as follows
nohup sqoop export --connect jdbc:mysql://192.168.154.129:3306/ey -username root --table $TBL_NAME --export-dir $HIVE_DIR --fields-terminated-by "$FIELD_SEP" --lines-terminated-by '\'"$LINE_SEP" --m $NUM_MAPPERS --batch > $sqoop_outs/$TBL_NAME.out 2>&1 &
which will issue the SQOOP command as follows, but it worked fine
sqoop export --connect jdbc:mysql://192.168.154.129:3306/ey -username root --table call_detail_records --export-dir /apps/hive/warehouse/xademo.db/call_detail_records --fields-terminated-by | --lines-terminated-by \n --m 4 --batch
This is for import
When you run the sqoop command from cli, the arguments to the option should have ', on the other hand when you run from oozie it should not be enclosed within the single qoute '.
I was using sqoop fro, oozie with the following arguments:
<arg>--fields-terminated-by</arg>
<arg>'\001'</arg>
<arg>--null-string</arg>
<arg>'\\N'</arg>
<arg>--null-non-string</arg>
<arg>'\\N'</arg>
The above code didn't work as expected, but the below code did
<arg>--fields-terminated-by</arg>
<arg>\001</arg>
<arg>--null-string</arg>
<arg>\\N</arg>
<arg>--null-non-string</arg>
<arg>\\N</arg>

How to import/export hbase data via hdfs (hadoop commands)

I have saved my crawled data by nutch in Hbase whose file system is hdfs. Then I copied my data (One table of hbase) from hdfs directly to some local directory by command
hadoop fs -CopyToLocal /hbase/input ~/Documents/output
After that, I copied that data back to another hbase (other system) by following command
hadoop fs -CopyFromLocal ~/Documents/input /hbase/mydata
It is saved in hdfs and when I use list command in hbase shell, it shows it as another table i.e 'mydata' but when I run scan command, it says there is no table with 'mydata' name.
What is problem with above procedure?
In simple words:
I want to copy hbase table to my local file system by using a hadoop command
Then, I want to save it directly in hdfs in another system by hadoop command
Finally, I want the table to be appeared in hbase and display its data as the original table
If you want to export the table from one hbase cluster and import it to another, use any one of the following method:
Using Hadoop
Export
$ bin/hadoop jar <path/to/hbase-{version}.jar> export \
<tablename> <outputdir> [<versions> [<starttime> [<endtime>]]
NOTE: Copy the output directory in hdfs from the source to destination cluster
Import
$ bin/hadoop jar <path/to/hbase-{version}.jar> import <tablename> <inputdir>
Note: Both outputdir and inputdir are in hdfs.
Using Hbase
Export
$ bin/hbase org.apache.hadoop.hbase.mapreduce.Export \
<tablename> <outputdir> [<versions> [<starttime> [<endtime>]]]
Copy the output directory in hdfs from the source to destination cluster
Import
$ bin/hbase org.apache.hadoop.hbase.mapreduce.Import <tablename> <inputdir>
Reference: Hbase tool to export and import
If you can use the Hbase command instead to backup hbase tables you can use the Hbase ExportSnapshot Tool which copies the hfiles,logs and snapshot metadata to other filesystem(local/hdfs/s3) using a map reduce job.
Take snapshot of the table
$ ./bin/hbase shell
hbase> snapshot 'myTable', 'myTableSnapshot-122112'
Export to the required file system
$ ./bin/hbase class org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot MySnapshot -copy-to fs://path_to_your_directory
You can export it back from the local file system to hdfs:///srv2:8082/hbase and run the restore command from hbase shell to recover the table from the snapshot.
$ ./bin/hbase shell
hbase> disable 'myTable'
hbase> restore_snapshot 'myTableSnapshot-122112'
Reference:Hbase Snapshots

Sqoop Import Create File Name with Date

I am working on a Sqoop script which I want to create the target directory with the current date. Do we have some options in Sqoop like --target-dir /dir1/$DATE. If so, what is the exact syntax?
You can't directly add $DATE to sqoop but
You can use shell script and pass the parameters in the shell script For e.g.
# -----------myscript.sh------------------
DATE=`date`
echo
sqoop import --connect jdbc:db2://localhost:<PORT_NUMBER>/<DB> --table TABLE_NAME --username user -password pass -m 1 --target-dir /user/$DATE
#------------end script----------------------
Now
add permission to script file
chmod 777 myscript.sh
Run the script file
./myscript.sh

Execute multiple sqoop commands from a file

I have multiple sqoop commands, and I want to execute them sequentially. How can I do this.
Currently, --options-file allows us to execute one command at a time.
Use shell script. Write commands one by one and execute the script.It will definitely work.
#!/bin/bash
echo "*************SQOOP IMPORT JOB UTILITY*******************"
# First Sqoop command
echo
sqoop import --connect jdbc:db2://localhost:<PORT_NUMBER>/<DB> --table TABLE_NAME_1 --username user -password pass -m 1 2> log1.txt
# Second Sqoop command
echo
sqoop import --connect jdbc:db2://localhost:<PORT_NUMBER>/<DB> --table TABLE_NAME_2 --username user -password pass -m 1 2> log2.txt
echo "Check log file for sqoop jobs status"
Run shell script
./myscript.sh
I am not sure if that is possible only with Sqoop but for my case i have used Oozie to execute multiple Sqoop commands.

Resources