Adding column in sqoop import - hadoop

While importing data from SQL through sqoop, Is it possible to add a new column and insert time stamp into that column?
Is it possible by any other ways before getting data into HDFS?

You may use --query parameter of sqoop command and add SQL function to get current timestamp in query.
Example: To import stud table from MySQL having rollnum and name columns.
sqoop import --connect jdbc:mysql://localhost:3306/test --driver com.mysql.jdbc.Driver --username root --query 'select name, rollnum, current_timestamp from stud where $CONDITIONS' --target-dir '/tmp/stud1' --split-by id
Note current_timestamp mysql function used in query.

Related

Is it possible to import a table with sqoop and add an extra timestamp column?

is it possible to use the sqoop command "import table" to import a table from an oracle database to an Hadoop cluster and add an extra column with the current timestamp (for troubleshouting purposes)? so far, I have the following command:
sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true --connect jdbc:oracle:thin:#//MY_ORACLE_SERVER --username USERNAME --password PASSWORD --target-dir /MyDIR --fields-terminated-by '\b' --table SOURCE_TABLE --hive-table DESTINATION_TABLE --hive-import --hive-overwrite --hive-delims-replacement '<newline>'
I would like to add a timestamp column to the table so that I know when that data was loaded. Is it possible?
Thanks in advance
you can use the free-form query import instead of table import, and call the timestamp function :
sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true --connect jdbc:oracle:thin:#//MY_ORACLE_SERVER --username USERNAME --password PASSWORD --target-dir /MyDIR --fields-terminated-by '\b' ----query 'SELECT a.*,systimestamp FROM SOURCE_TABLE a' --hive-table DESTINATION_TABLE --hive-import --hive-overwrite --hive-delims-replacement '<newline>'
Maybe you could use sysdate instead systimestamp (smaller datatype but less precision)
You can create a temp hive table by using sqoop ,after that create a new hive table by using old one with extra required columns.

incremental "lastmodified" not working in sqoop

I'm trying sqoop to perform incremental import from Teradata DB to Hive. Below is the query:
sqoop import --connect jdbc:teradata://xxx.xxx.x.xx/DATABASE=DBN --driver com.teradata.jdbc.TeraDriver --username userN --password pass --query "SELECT alias.colA, alias.call_date, alias.colB, alias.colC FROM tableName alias where \$CONDITIONS" --target-dir /apps/hive/warehouse/staging.db/tableName -m 26 --check-column call_date --incremental append --split-by alias.colA --last-value '2016-02-01'
The column call_date is of DATE type, values in the format 'YYYY-MM-DD'.
When I use 'append' for --incremental, everything works fine. But when I put 'lastmodified', the following error is thrown:
ERROR util.SqlTypeMap: It seems like you are looking up a column that does not
ERROR util.SqlTypeMap: exist in the table. Please ensure that you've specified
ERROR util.SqlTypeMap: correct column names in Sqoop options.
ERROR tool.ImportTool: Imported Failed: column not found: call_date
I'm using sqoop 1.4.4.2.1 on HDP 2.1
While Teradata DB is 14.10
Any pointers will be helpful.
I think, in case of query you can perform the last value check in the query itself some think like this
"SELECT alias.colA, alias.call_date, alias.colB, alias.colC FROM tableName alias where call_date >'2016-02-01' and \$CONDITIONS" .
Reference (refer section Incrementally Updating Data in Hive > 1.Ingest the data.)
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/incrementally-updating-hive-table-with-sqoop-and-ext-table.html

Sqoop incremental lastmodified

I am having a accounts table in mysql db.
it has around 19654 records.I used sqoop to import the table data in HDFS.It created four files in HDFS with data evenly distributed
then i executed below sql statement on DB
update accounts set modified = now() where acct_num in (1,2,3,4) ;
Then i executed below sqoop tool
sqoop import --table accounts --connect jdbc:mysql://localhost/loudacre
--username training --password training
--incremental lastmodified
--check-column modified --last-value '2014-03-18 13:29:47.0'
--merge-key acct_num --target-dir /accounts/
After the above completed it created only one file with around 10 entries only.Does not even include new timestamp value.
I was just trying to update the rows which have new timestamp. Can anyone help?

Sqoop job incremental import using free form query

I am trying to do sqoop job incremental import using free form query. Here's the query being used
sqoop job --create importjobinl -- import --connect jdbc:mysql://localhost/test --username training --password training --query 'select id,name,unix_timestamp(time_updated) from intest where $CONDITIONS' --target-dir /user/new/lll/`date +%d%T|sed 's/://g'` -m 1 --check-column time_updated --incremental append --last-value '1441526438'
The job is not getting created It shows.
Incremental imports require a table.
Try --help for usage instructions.
It works when I use --table intest instead of --query, but I want to use --query to convert date to epochtime using unix_timestamp since the value in mysql table intest is in yyyy-mm-dd format
Version used :Sqoop 1.2.0-cdh3u0
Sqoop incremental imports for free form queries was added from Sqoop 1.4.2
JIRA link : Sqoop Incremental import Support for free form queries
Since you are using Sqoop 1.2.0, this feature might not be available for you to use
Do an initial pull using sqoop.
Make sure the date format of your column is in YYYY-MM-DD HH:MM:SS if you are using the last modified column as date.
Run below statement for incremental load to your hive table which includes free from query.
sqoop import --connect jdbc:mysql://localhost/test --username training --password training --query "select * from intest where $CONDITIONS" --hive-import --hive-table db_name_x.table_name_x --incremental lastmodified -check-column date_x --target-dir /user/xyz -m 1

Configuring Sqoop with Mysql?

I have successfully installed SQOOP now the problem is that how to implement it with RDBMS and how to load data from RDBMS to HDFS using SQOOP.
By Using Sqoop You can Load Data directly to Hive Tables or Store the data in Some target Directory in HDFS
If you Need to copy data from RDBMS into Some directory
sqoop import
--connect ConnectionString
--username username
--password Your_Database_Password {In case no password Do not Specify it}
--table tableName
--col column_name(s) {In case you need to call specific columns}
--target-dir '/tmp/myfolder'
--boundary-query 'Select min,max from table name'
--m 5 {set number of mappers to 5}
--fields-terminated-by ',' {how do you want your data to look in target file}
Boundary Query : This is something you can specify. If you do not specify this , then by default this is run in as an inner query which adds up to a complex query.
If you specify this explicitly then this runs as a normal query and hence the performance is increased.
Also you may want to restrict the number of observation ,say based on column ID, and suppose you need data from ID 1 to 1000. Then using Boundary condition and split-by you will be able to restrict your import data.
--boundary-query "select 0,1000 from employee'
--split-by ID
Split-By : You use Split by on a Sqoop import to specify the column on basis of which split is required. By default,if you do not specify this, sqoop pics up table's primary key as the Split_by column.
Split By picks up data from tables and stores them in different folders based on number of mappers. By Default Number of Mappers are 4.
This may seem unwanted but in case you have a composite primary key or no primary key at all, then sqoop fails to pick up data and may error out.
Note: You may not face any issue if you set the number of mappers to 1. In this case, no split by condition is used since there is only one mapper. So query runs fine. This can be done using
--m 1
If you Need to copy data from RDBMS into Hive Table
sqoop import
--connect ConnectionString
--username username
--password Your_Database_Password {In case no password Do not Specify it}
--table tableName
--boundary-query 'Select min,max from table name'
--m 5 {set number of mappers to 5}
--hive-import
--hive-table serviceorderdb.productinfo
--m 1
Running a query instead of calling entire table itself
sqoop import
--connect ConnectionString
--username username
--password Your_Database_Password
--query 'select name from employees where name like '%s' and $CONDITIONS'
--m 5 {set number of mappers to 5}
--target-dir '/tmp/myfolder'
--fields-terminated-by ',' {how do you want your data to look in target file}
You may see $conditions as extra parameter $CONDITIONS. This is because this time you specified no table and specified a query explicity. When Sqoop runs, it searches for a boundary conditions, which it does not find. Then It Searches for a table and a primary key for applying boundary query which again it will not find. Hence we use $CONDITIONS to explicitly specify that we are not using a query and use default boundry condition from query result.
Checking if your connection is set up properly : For this you can just call list databases and if the you see your data populated then your connection is fine.
$ sqoop list-databases
--connect jdbc:mysql://localhost/
--username root
--password pwd
Connection String for Different Databases :
MYSQL: jdbc:mysql://<hostname>:<port>/<dbname>
jdbc:mysql://127.0.0.1:3306/test_database
Oracle :#//host_name:port_number/service_name
jdbc:oracle:thin:scott/tiger#//myhost:1521/myservicename
You may learn more about sqoop imports from : https://sqoop.apache.org/docs/1.4.1-incubating/SqoopUserGuide.html
By using sqoop import command you can import data from RDBMS to HDFS, Hive and HBase
sqoop import --connect jdbc:mysql://localhost:portnumber/DBName --username root --table emp --password root -m1
By using this command data will be stored in HDFS.
Sample commands to run sqoop import (load data from RDBMS to HDFS):
Postgres
sqoop import --connect jdbc:postgresql://postgresHost/databaseName
--username username --password 123 --table tableName
MySQL
sqoop import --connect jdbc:mysql://mysqlHost/databaseName --username username --password 123 --table tableName
Oracle*
sqoop import --connect jdbc:oracle:thin:#oracleHost:1521/databaseName --username USERNAME --password 123 --table TABLENAME
SQL Server
sqoop import --connect 'jdbc:sqlserver://sqlserverhost:1433;database=dbname;username=<username>;password=<password>' --table tableName
*Sqoop won't find any columns from a table if you don't specify both the username and the table in correct case. Usually, specifying both in uppercase will resolve the issue.
Read the Sqoop User's Guide: https://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html
I also recommend the Apache Sqoop Cookbook. You will learn how to use import and export tools, do incremental import jobs, save jobs, solve problems with jdbc drivers and much more. http://shop.oreilly.com/product/0636920029519.do

Resources