Oozie iterative workflow - hadoop

I am building an application to ingest data from MYSQL DB to hive tables. App will be scheduled to execute every day.
The very first action is to read a Hive table to load import table info e.g name, type etc., and create a list of tables in a file to import. Next a Sqoop action to transfer data for each table in sequence.
Is it possible to create a shell script Oozie action which will iterate through the table list and launch oozie sub-workflow Sqoop action for each table in sequence? Could you provide some reference? Also any suggestion of a better approach!

I have come up with following shell script containing Sqoop action. It works fine with some environment variable tweaking.
hdfs_path='hdfs://quickstart.cloudera:8020/user/cloudera/workflow/table_metadata' table_temp_path='hdfs://quickstart.cloudera:8020/user/cloudera/workflow/hive_temp
if $(hadoop fs -test -e $hdfs_path)
then
for file in $(hadoop fs -ls $hdfs_path | grep -o -e "$hdfs_path/*.*");
do
echo ${file}
TABLENAME=$(hadoop fs -cat ${file});
echo $TABLENAME
HDFSPATH=$table_temp_path
sqoop import --connect jdbc:mysql://quickstart.cloudera:3306/retail_db --table departments --username=retail_dba --password=cloudera --direct -m 1 --delete-target-dir --target-dir $table_temp_path/$TABLENAME
done
fi

Related

Hive move & restore some partitions

I have a 50TB managed partitioned (by date) Hive table, from which I want to move some old partitions to an external HDD, in order to restore them back later if required.
The scripts are as follows:
Move out:
$ hdfs dfs -get ${HIVE_WAREHOUSE_TABLE_PATH}/ingest_date=2016-01-01 ${LOCAL_TABLE_PATH}/2016-01-01
$ hdfs dfs -rm -r -skipTrash ${HIVE_WAREHOUSE_TABLE_PATH}/ingest_date=2016-01-01
$ hive -e "ALTER TABLE ${TABLE} DROP IF EXISTS PARTITION (ingestion_date='2016-01-01') PURGE;"
Restore:
$ hdfs dfs -put ${LOCAL_TABLE_PATH}/2016-01-01 ${HIVE_WAREHOUSE_TABLE_PATH}/ingest_date=2016-01-01
$ hive -e "ALTER TABLE ${TABLE} ADD PARTITION (ingest_date='2016-01-01') LOCATION ${HIVE_WAREHOUSE_TABLE_PATH}/ingest_date=2016-01-01;"
Do I miss something in the above strategy?
I have tried:
$ hive --hivevar local_path=${LOCAL_TABLE_PATH} -e "EXPORT TABLE myDatabase.theTable PARTITION (ingest_date='2016-01-01') to '${local_path}/2016-01-01';"
but this takes too long to CopyTable for 1 year partitions, and I am trying to avoid this.
Thank you,
Gee

overwrite hdfs directory Sqoop import

Is it possible to overwrite HDFS directory automatically instead of overwriting it every time manually while Sqoop import?
(Do we have any option like "--overwrite" like we have for hive import "--hive-overwrite")
Use --delete-target-dir
​It will delete <HDFS-target-dir> provided in command before writing data to this directory.
Use this: --delete-target-dir
This will work for overwriting the hdfs directory using sqoop syntax:
$ sqoop import --connect jdbc:mysql://localhost/dbname --username username -P --table tablename --delete-target-dir --target-dir '/targetdirectorypath' -m 1
E.g:
$ sqoop import --connect jdbc:mysql://localhost/abc --username root -P --table empsqooptargetdel --delete-target-dir --target-dir '/tmp/sqooptargetdirdelete' -m 1
This command will refresh the corresponding hdfs directory or hive table data with updated/fresh data, every time this command is run.

How to import/export hbase data via hdfs (hadoop commands)

I have saved my crawled data by nutch in Hbase whose file system is hdfs. Then I copied my data (One table of hbase) from hdfs directly to some local directory by command
hadoop fs -CopyToLocal /hbase/input ~/Documents/output
After that, I copied that data back to another hbase (other system) by following command
hadoop fs -CopyFromLocal ~/Documents/input /hbase/mydata
It is saved in hdfs and when I use list command in hbase shell, it shows it as another table i.e 'mydata' but when I run scan command, it says there is no table with 'mydata' name.
What is problem with above procedure?
In simple words:
I want to copy hbase table to my local file system by using a hadoop command
Then, I want to save it directly in hdfs in another system by hadoop command
Finally, I want the table to be appeared in hbase and display its data as the original table
If you want to export the table from one hbase cluster and import it to another, use any one of the following method:
Using Hadoop
Export
$ bin/hadoop jar <path/to/hbase-{version}.jar> export \
<tablename> <outputdir> [<versions> [<starttime> [<endtime>]]
NOTE: Copy the output directory in hdfs from the source to destination cluster
Import
$ bin/hadoop jar <path/to/hbase-{version}.jar> import <tablename> <inputdir>
Note: Both outputdir and inputdir are in hdfs.
Using Hbase
Export
$ bin/hbase org.apache.hadoop.hbase.mapreduce.Export \
<tablename> <outputdir> [<versions> [<starttime> [<endtime>]]]
Copy the output directory in hdfs from the source to destination cluster
Import
$ bin/hbase org.apache.hadoop.hbase.mapreduce.Import <tablename> <inputdir>
Reference: Hbase tool to export and import
If you can use the Hbase command instead to backup hbase tables you can use the Hbase ExportSnapshot Tool which copies the hfiles,logs and snapshot metadata to other filesystem(local/hdfs/s3) using a map reduce job.
Take snapshot of the table
$ ./bin/hbase shell
hbase> snapshot 'myTable', 'myTableSnapshot-122112'
Export to the required file system
$ ./bin/hbase class org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot MySnapshot -copy-to fs://path_to_your_directory
You can export it back from the local file system to hdfs:///srv2:8082/hbase and run the restore command from hbase shell to recover the table from the snapshot.
$ ./bin/hbase shell
hbase> disable 'myTable'
hbase> restore_snapshot 'myTableSnapshot-122112'
Reference:Hbase Snapshots

Sqoop Import Create File Name with Date

I am working on a Sqoop script which I want to create the target directory with the current date. Do we have some options in Sqoop like --target-dir /dir1/$DATE. If so, what is the exact syntax?
You can't directly add $DATE to sqoop but
You can use shell script and pass the parameters in the shell script For e.g.
# -----------myscript.sh------------------
DATE=`date`
echo
sqoop import --connect jdbc:db2://localhost:<PORT_NUMBER>/<DB> --table TABLE_NAME --username user -password pass -m 1 --target-dir /user/$DATE
#------------end script----------------------
Now
add permission to script file
chmod 777 myscript.sh
Run the script file
./myscript.sh

Execute multiple sqoop commands from a file

I have multiple sqoop commands, and I want to execute them sequentially. How can I do this.
Currently, --options-file allows us to execute one command at a time.
Use shell script. Write commands one by one and execute the script.It will definitely work.
#!/bin/bash
echo "*************SQOOP IMPORT JOB UTILITY*******************"
# First Sqoop command
echo
sqoop import --connect jdbc:db2://localhost:<PORT_NUMBER>/<DB> --table TABLE_NAME_1 --username user -password pass -m 1 2> log1.txt
# Second Sqoop command
echo
sqoop import --connect jdbc:db2://localhost:<PORT_NUMBER>/<DB> --table TABLE_NAME_2 --username user -password pass -m 1 2> log2.txt
echo "Check log file for sqoop jobs status"
Run shell script
./myscript.sh
I am not sure if that is possible only with Sqoop but for my case i have used Oozie to execute multiple Sqoop commands.

Resources