I'm doing sqoop import from oracle to hdfs and create a hive table in parquet format.
I have a Date field in Oracle table( mm/dd/yyyy format) which i need to bring in a Timestamp format( yyyy-mm-dd hh24:mi:ss ) in hive.
i used cast(xyz_date as Timestamp) in sqoop select query but it is save as Long type in Parquet file.
I checked hive table and NULLs are stored in hive table in xyz_date field.
I do not want to store it as a string. Please help.
system sqoop import -D mapDateToTimestamp=true \ --connect \ --username abc \ --password-file file:location \ --query "select X,Y,TO_DATE(to_char(XYZ,'MM/DD/YYYY'),'MM/DD/YYYY') from TABLE1 where $CONDITIONS" \ --split-by Y \ --target-dir /location \ --delete-target-dir \ -as-parquetfile \ --compress \ --compression-codec org.apache.hadoop.io.compress.SnappyCodec \ --map-column-java Y=Long
Here is my hive table format :
CREATE external TABLE IF NOT EXISTS abc ( X STRING, Y BIGINT, XYZ TIMESTAMP ) STORED AS PARQUET LOCATION '/location' TBLPROPERTIES ("parquet.compression"="SNAPPY");
Related
I created sqoop process which imports data from MS SQL to Hive, but I have a problem with 'char' type fields. Sqoop import code:
sqoop import \
--create-hcatalog-table \
--connect "connection_parameters" \
--username USER \
--driver net.sourceforge.jtds.jdbc.Driver \
--null-string '' \
--null-non-string '' \
--class-name TABLE_X \
--hcatalog-table TABLE_X_TEST \
--hcatalog-database default \
--hcatalog-storage-stanza "stored as orc tblproperties ('orc.compress'='SNAPPY')" \
--map-column-hive "column_1=char(10),column_2=char(35)" \
--num-mappers 1 \
--query "select top 10 "column_1", "column_2" from TABLE_X where \$CONDITIONS" \
--outdir "/tmp"
column_1 which is type char(10) should be NULL if there is no data. But Hive fills the field with 10 spaces.
column_2 which is type char(35) should be NULL too, but there are 35 spaces.
It is huge problem because I cannot run query like this:
select count(*) from TABLE_X_TEST where column_1 is NULL and column_2 is NULL;
But I have to use this one:
select count(*) from TABLE_X_TEST where column_1 = ' ' and column_2 = ' ';
I tried change query parameter and use trim function:
--query "select top 10 rtrim(ltrim("column_1")), rtrim(ltrim("column_2")) from TABLE_X where \$CONDITIONS"
but it does not work, so I suppose it is not a problem with source, but with Hive.
How I can prevent Hive from inserting spaces in empty fields?
You need to change these parameters:
--null-string '\\N' \
--null-non-string '\\N' \
Hive, by default, expects that the NULL value will be encoded using the string constant \N. Sqoop, by default, encodes it using the string constant null. To rectify the mismatch, you’ll need to override Sqoop’s default behavior with Hive’s using parameters --null-string and --null-non-string (this is what you do but with incorrect values). For details, see docs.
I tried without giving the options of null-string and null-non-string for creating orc tables using Sqoop hcatalog, all the nulls in source are reflecting as NULL and I am able to query using is null function.
Let me know if you found any other solution to handle null's.
I have a requirement.
I need to get data from Teradata database a into Hadoop.
I have access to only view in that database and need to pull data from that view. As I don't have read/write access to that database. I cannot use --split-by column option in my sqoop import query.
So is there a option where I can say sqoop to use database b for storing the split data and then move the data into Hadoop
Query:
sqoop import \
--connect "jdbc:sqlserver://xx.aa.dd.aa;databaseName=a" \
--connection-manager org.apache.sqoop.manager.SQLServerManager \
--username XXXX \
--password XXXX \
--num-mappers 20 \
--query "select * from (select ID,name,x,y,z from TABLE1 where DT between '2018/01/01' and '2018/01/31') as temp_table where updt_date <'2018/01/31' AND \$CONDITIONS" \
--split-by id \
--target-dir /user/XXXX/sqoop_import/XYZ/2018/TABLE1
I want to import a table with date format from oracle database by Sqoop to Hive. I don't understand why imported fields in Hive are string and not timestamp (like Oracle). I prefer to do not use map-colum-hive because i want to do that automaticly for every date format for each table.
This is my Sqoop request :
sqoop import -D mapDateToTimestamp=true --direct --connect
jdbc:oracle:thin:#//$connection --username $username -password
$password -m 7 --table $table1 --hive-import --hive-overwrite
--hive-database $database1 --hive-table $table2 --where "$date>to_date('$dd/mm/yyyy hh:mm:ss','dd/mm/yyyy hh24:mi:ss')"
--null-string '\N' --null-non-string '\N' --target-dir $targetdirectory -- --schema $schema
I tried it with mapDateToTimestamp = True and oracle.jdbc.mapDateToTimestamp = True both, data are imported but data with date format are in string.
Do you have a solution ? suggestions ? advices ?
Thanks a lot !
Im working on a sqoop import with the following command:
#!/bin/bash
while IFS=":" read -r server dbname table; do
sqoop eval --connect jdbc:mysql://$server/$dbname --username root --password cloudera --table mydata --hive-import --hive-table dynpart --check-column id --last-value $(hive -e "select max(id) from dynpart"); --hive-partition-key 'thisday' --hive-partition-value '01-01-2016'
done<tables.txt
Im doing the partition for everyday.
Hive table:
create table dynpart(id int, name char(30), city char(30))
partitioned by(thisday char(10))
row format delimited
fields terminated by ','
stored as textfile
location '/hive/mytables'
tblproperties("comment"="partition column: thisday structure is dd-mm-yyyy");
But I don't want to give the partition value directly as I want to create a sqoop job and run it everyday. In the script, how can I pass the date value to sqoop command dynamically (format: dd/mm/yyyy) instead of giving it directly ?
Any help is appreciated.
you can use the shell command date to get it (ubuntu 14.04):
$ date +%d/%m/%Y
22/03/2017
You Can try the below Code
#!/bin/bash
DATE=$(date +"%d-%m-%y")
while IFS=":" read -r server dbname table; do
sqoop eval --connect jdbc:mysql://$server/$dbname --username root --password cloudera --table mydata --hive-import --hive-table dynpart --check-column id --last-value $(hive -e "select max(id) from dynpart"); --hive-partition-key 'thisday' --hive-partition-value $DATE
done<tables.txt
Hope this Helps
I have a table called 'employee' in SQL Server :
ID NAME ADDRESS DESIGNATION
1 Jack XXX Clerk
2 John YYY Engineer
I have created an external table (emp) in hive and through sqoop import I imported data from employee to hive table using --query argument of sqoop. If I mention --query as 'select * from employee' then data gets inserted to hive table correctly.But if I mention --query as 'select ID,NAME,DESIGNATION' from employee' then data in DESIGNATION column of 'employee' table(rdbms) is getting inserted to address column of 'emp' table instead of getting inserted to designation column.When I run the below hive query:
select designation from emp;
I get values as :
NULL
NULL
instead of : Clerk
Engineer
But if I run the hive query as :
select address from emp;
I get values as :
Clerk
Engineer
instead of :NULL
NULL
Any ideas of fixing this incorrect data would be of great help.I am currently using 0.11 version of hive so I can't use hive insert queries which are available from 0.14 hive version.
ok,I show you a sample.
sqoop import --connect jdbc:mysql://host:port/db'?useUnicode=true&characterEncoding=utf-8' \
--username 'xxxx' \
--password 'xxxx' \
--table employee \
--columns 'ID,NAME,DESIGNATION' \
--where 'aaa=bbb' \
-m 1 \
--target-dir hdfs://nameservice1/dir \
--fields-terminated-by '\t' \
--hive-import \
--hive-overwrite \
--hive-drop-import-delims \
--null-non-string '\\N' \
--null-string '\\N' \
--hive-table 'hive_db.hive_tb' \
--hive-partition-key 'pt' \
--hive-partition-value '2016-01-20'
and some param is optional.
sqoop syntax detail:
http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_literal_sqoop_import_literal
The Sqoop statement will import the data to hdfs directory as(assuming field separator as ,)
1,Jack,Clerk
2,John,Engineer
So Address column will have DESIGNATION data and DESIGNATION column will be null
You can try --query "select ID,NAME,'',DESIGNATION from employee", this should work