My use case:
Day by day hourly tables gets created in mysql db. I need to move them everyday to HDFS using Sqoop and process the HDFS data using Impala.
How to write a shell script or job only to move the tables data that are newly created to HDFS(existing file system) periodically?
Say today is 3rd of Jan 2016, when I run my job today then 2nd Jan 2016 data should be moved from mysql to HDFS like wide everyday it should move the data of previous day.
Daily I need to run my Impala queries on this HDFS cluster and generate a report.
How to process this whole data using Impala and generate a report?
Sqoop supports incremental import: http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_incremental_imports
Sqoop-import can use either a last-modified timestamp, or an always-increasing row ID to decide which rows to import. You need to provide a --last-value parameter. You could store the last value in between jobs, or maybe retrieve it from your Impala database before running the job.
The best way would be to get shell script that takes 2 arguments. 1 would be the name of the table and other would be hdfs path because those would be the only 2 factors that change in your scenarios. Below is sample script that you can put into .sh and run in bash.
!/bin/bash
TABLENAME=${^^1}
HDFSPATH=${^^2}
NOW=$(date +"%m-%d-%Y-%H-%M-%S")
sqoop --import --connect jdbc:db2://mystsrem:60000/SCHEMA \
--username username \
--password-file password \
--query "select * from ${TABLENAME} \$CONDITIONS" \
-m 1 \
--delete-target-dir \
--target-dir ${HDFSPATH} \
--fetch-size 30000 \
--class-name ${TABLENAME} \
--fields-terminated-by '\01' \
--lines-terminated-by '\n' \
--escaped-by '\' \
--verbose &> logonly/${TABLENAME}_import_${NOW}.log
OPTIONAL:
if you need to import into hive table
--hive-drop-import-delims \
--hive-import \
--hive-overwrite \
--hive-table HiveSchema.${TABLENAME}
Related
So I'm trying to import-all-tables into hive db, ie, user/hive/warehouse/... on hdfs, using the below command:
sqoop import-all-tables --connect "jdbc:sqlserver://<servername>;database=<dbname>" \
--username "<username>" \
--password "<password>" \
--warehouse-dir "/user/hive/warehouse/" \
--hive-import \
-m 1
In the testdatabase I have 3 tables, when mapreduce runs, the output is success,
ie, the mapreduce job is 100% complete but the file is not found on hive db.
It’s basically getting overwritten by the last table, try removing the forward slash at the end of the directory path. For the tests I would suggest not to use the warehouse directory, use something like ‘/tmp/sqoop/allTables’
There is a another way
1. Create a hive database pointing to a location says "targetLocation"
2. Create hcatalog table in your sqoop import using previously created database.
3. Use target-directory import options to point that targetLocation.
you doesn't need need to define warehouse directory.just define hive database it will automatically find out working directory.
sqoop import-all-tables --connect "jdbc:sqlserver://xxx.xxx.x.xxx:xxxx;databaseName=master" --username xxxxxx --password xxxxxxx --hive-import --create-hive-table --hive-database test -m 1
it will just run like rocket.
hope it work for you....
How can i edit/change an existing sqoop job ?
Can't find any documentation related to editing of an existing sqoop job.
Please assist.
sqoop1 document does not edit this one Job, but sqoop2 can be modified.
If sqoop1, you should
bin/sqoop job --show your-sync-job
Remember Configuration Item
bin/sqoop job --delete your-sync-job
then
sqoop job --create sqooptest -- import --connect jdbc:mysql://10.10.209.224:3306/sqoop --table userinfo --username sqoop --password "1234" --incremental append --check-column id --last-value 1 --fields-terminated-by '$' --target-dir '/sqoop/userinfo/import2hdfs1'
You have to first delete the job and then create the job agan.
Caution: sqoop job --delete jobname will also delete the metastore information of the job, so be carefull while doing that.
Once you have created a sqoop job, you can always override its parameters in the next execution of that job
For example you can create a generic job like so
sqoop job --create hiveLoad --meta-connect jdbc:hsqldb:hsql://10.113.57.47:16000/sqoop -- import --connect jdbc:oracle:thin:prasads#10.113.59.5:1521/ora11g --username user -P --table DATABASE.TABLE --incremental append --check-column COL_VALUE --last-value '300' -m 1 --target-dir '/user/somewhere' --append
And while executing hiveLoad, you can execute it like so
sqoop job --exec hiveLoad --meta-connect jdbc:hsqldb:hsql://10.113.57.47:16000/sqoop -- --username <username> --password <password> --table <database>.<table> --incremental append --check-column <column> --target-dir '/path/to/hdfs'
where you replace values in <> with your intended values.
However changing the definition of an existing sqoop job is not possible. The overriding facility mitigates this I believe.
The sqoop job will take the configuration of the rest of the unspecified values from the original definition.
Check out Apache Sqoop Cookbook. Its a great resource and covers almost all possible use cases.
sqoop does not provide any edit option. You should delete job and create job as your requirements
I am new to hadoop and have recently started work on sqoop. While trying to export a table from hadoop to sql server i am getting the following error:
input path does not exist hdfs://sandbox:8020/user/root/
The command i am using is :
sqoop export --connect "jdbc:sqlserver://;username=;password=xxxxx;database=" --table --export-dir /user/root/ -input-fields-terminated-by " "
Could you please guide what i am missing here.
Also could you please let me know the command to navigate to the hadoop directory where the tables are stored.
For a proper sqoop export, Sqoop requires the complete data file location. You cant just specify the root folder.
Try specifying the complete src path
sqoop export --connect jdbc:oracle:thin:<>:1521/<> --username <> --password <> --table <> --export-dir hdfs://<>/user/<>/<> -m 1 --input-fields-terminated-by '|' --input-null-string '\\N' --input-null-non-string '\\N'
Hope this helps
I run a command which exports data from my HDFS to MySql.
But I want to insert data to particular columns at run time, while running the Export Command.
Is this possible?
Or if not, is there any work-around to achieve this?
My command would be like this:
bin/sqoop export --connect jdbc:mysql://my ip/test --username uname --password pwd --table table name --export-dir /MR/part-r-00000 --input-fields-terminated-by ',' --verbose -m 1
(Here I want to supply data for certain columns).
I believe that you can take advantage of parameter --columns to specify subset of table's columns that are available in the input files.
We are using Cloudera CDH 4 and we are able to import tables from our Oracle databases into our HDFS warehouse as expected. The problem is we have 10's of thousands of tables inside our databases and sqoop only supports importing one table at a time.
What options are available for importing multiple tables into HDFS or Hive? For example what would be the best way of importing 200 tables from oracle into HDFS or Hive at a time?
The only solution i have seen so far is to create a sqoop job for each table import and then run them all individually. Since Hadoop is designed to work with large dataset it seems like there should be a better way though.
U can use " import-all-tables " option to load all tables into HDFS at one time .
sqoop import-all-tables --connect jdbc:mysql://localhost/sqoop --username root --password hadoop --target-dir '/Sqoop21/AllTables'
if we want to exclude some tables to load into hdfs we can use " --exclude-tables " option
Ex:
sqoop import-all-tables --connect jdbc:mysql://localhost/sqoop --username root --password hadoop --target-dir '/Sqoop21/AllTables' --exclude-tables <table1>,<tables2>
If we want to store in a specified directory then u can use " --warehouse-dir " option
Ex:
sqoop import-all-tables --connect jdbc:mysql://localhost/sqoop --username root --password hadoop --warehouse-dir '/Sqoop'
Assuming that the sqoop configuration for each table is the same, you can list all the tables you need to import and then iterate over them launching sqoop jobs (ideally launch them asynchronously). You can run the following to fetch the list of tables from Oracle:
SELECT owner, table_name FROM dba_tables reference
Sqoop does offer an option to import all tables. Check this link. There are some limitations though.
Modify sqoop source code and recompile it to your needs. The sqoop codebase is well documented and nicely arranged.
--target-dir is not a valid option when using import-all-tables.
To import all tables in particular directory, Use --warehouse-dir instead of --target-dir.
Example:
$ sqoop import-all-tables --connect jdbc:mysql://localhost/movies --username root --password xxxxx --warehouse-dir '/user/cloudera/sqoop/allMoviesTables' -m 1
The best option is do my shell script
Prepare a inputfile which has list of DBNAME.TABLENAME 2)The shell script will have this file as input, iterate line by line and execute sqoop statement for each line.
while read line;
do
DBNAME=`echo $line | cut -d'.' -f1`
tableName=`echo $line | cut -d'.' -f2`
sqoop import -Dmapreduce.job.queuename=$QUEUE_NAME --connect '$JDBC_URL;databaseName=$DBNAME;username=$USERNAME;password=$PASSWORD' --table $tableName --target-dir $DATA_COLLECTOR/$tableName --fields-terminated-by '\001' -m 1
done<inputFile
You can probably import multiple tables : http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_literal_sqoop_import_all_tables_literal
You can use Sqoop "import-all-tables" feature to import all the tables in the database. This also has another parameter, --exclude-tables, along with which you can exclude some of the table that you don't want to import in the database.
Note: --exclude-tables only works with import-all-tables command.
importing multiple tables by sqoop if no of tables are very less.
Create sqoop import for each table as below .
sqoop import --connect jdbc:mysql://localhost/XXXX --username XXXX
password=XXXX
--table XXTABLE_1XX*
sqoop import --connect jdbc:mysql://localhost/XXXX --username XXXX
password=XXXX
--table XXTABLE_2XX*
and so on.
But what if no of tables are 100 or 1000 or even more. Below would be ideal solution.
In such scenario, preparing shell script which takes input from text file containing list of table names to be imported, iterate over, run the scoop import job for each table