I am stuck at one point with sqoop.
In my source I have one column which have one special character. But when I am pulling the data with sqoop, the special character is changed to something else.
In my source oracle table i have :-
jan 2005 �DSX�
but when it is sqooping data to hive table, it changed the special character to something else
jan 2005 �DSXÙ
Please suggest some solution so that I get exact same special character as it is in source (Oracle) table.
sqoop import \
--connect "jdbc:oracle:thin:#source connection details" \
--connection-manager org.apache.sqoop.manager.OracleManager \
--username abc \
--password xyz \
--fields-terminated-by '\001' \
--null-string '' \
--null-non-string '' \
--query "select column_name from wxy.ztable where \$CONDITIONS " \
--target-dir "db/dump/dir" \
--split-by "col1" \
-m 1
if you are seeing jan 2005 �DSX� this in your oracle table, probably your encoding for oracle table is also not set correctly. I don't have much experience with oracle so won't be able to tell you how to check, however you can check with your oracle DBA.
One thing I can tell you is, Hadoop using UTF-8 encoding, so you first need to convert your oracle to UTF-8 and then import the data.
Related
My requirement is to load the data from RDBMS into HDFS (backed by CDH 5.9.X) via sqoop (1.4.6) in avro format and then use an external hive(1.1) table to query the data.
Unfortunately, the data in RDBMS has some new line characters.
We all know that hive can't parse new line character in the data and the data mapping fails when selected the whole data via hive. However, hive's select count(*) works fine.
I used below options during sqoop import and checked, but didn't work:
--hive-drop-import-delims
--hive-delims-replacement
The above options work for text format. But storing data in text format is not a viable option for me.
The above options are converted properly in the Sqoop generated (codegen) POJO class's toString method (obviously as text format is working as expected), so I feel this method is not at all used during avro import. Probably because avro has not problem dealing with new line character, where as hive has.
I am surprised, don't anyone face such a common scenario, a table which has remark, comment field is prone to this problem.
Can anyone suggest me a solution please?
My command:
sqoop import \
-Dmapred.job.queue.name=XXXX \
--connect jdbc:oracle:thin:#Masked:61901/AgainMasked \
--table masked.masked \
--username masked \
--P \
--target-dir /user/masked/ \
--as-avrodatafile \
--map-column-java CREATED=String,LAST_UPD=String,END_DT=String,INFO_RECORD_DT=String,START_DT=String,DB_LAST_UPD=String,ADDR_LINE_3=String\
--hive-delims-replacement ' '
--null-string '\\N'
--null-non-string '\\N'
--fields-terminated-by '\001'
-m 1
This looks like an issue with avro serde. It is an open bug.
https://issues.apache.org/jira/browse/HIVE-14044.
Can you try the same in hive 2.0?
As mentioned by VJ, there is an open issue for new line character in avro.
What an alternate approach that you can try is
Sqoop the data into a hive staging table as a textfileformat.
Create an avro table.
Insert data from staging table to main avro table in hive.
As newline character is very well handled in textfileformat
I am trying to import a file into hive as parquet and the --map-column-hive column_name=timestamp is being ignored. The column 'column_name' is originally of type datetime in sql and it converts it into bigint in parquet. I want to convert it to timestamp format through sqoop but it is not working.
sqoop import \
--table table_name \
--driver com.microsoft.sqlserver.jdbc.SQLServerDriver \
--connect jdbc:sqlserver://servername \
--username user --password pw \
--map-column-hive column_name=timestamp\
--as-parquetfile \
--hive-import \
--hive-table table_name -m 1
When I view the table in hive, it still shows the column with its original datatype.
I tried column_name=string and that did not work either.
I think this may be an issue with converting files to parquet but I am not sure. Does anyone have a solution to fix this?
I get no errors when running the command, it just completes the import as if the command was did not exist.
Before hive 1.2 version Timestmap support in ParquetSerde is not avabile. Only binary data type support is available in 1.1.0.
Please check the link
Please upgrade your version to 1.2 and after ,it should work.
Please check the issue log and release notes below.
https://issues.apache.org/jira/browse/HIVE-6384
https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12329345&styleName=Text&projectId=12310843
I have imported data from sqoop to hive successfully. I have added an column in Oracle and again imported the particular column to hive using sqoop-import. But,it is appending to the first column data and remaining columns with null and no new column came in hive. Can anyone resolve the issue.
With out looking at your import statements, I am assuming that in your second import you are trying to append to the existing import but only importing new column using --columns and --append arguments. It will not work this way as it will append to the file at end of the file not at end of the each line.
you will need to overwrite the existing data in hdfs using --hive-overwrite; and alter hive table for adding additional column. OR just drop the hive table and use --create-hive-table in sqoop command.
so you import command should look like this:
sqoop --import \
--connect $CONNECTION_STR \
--username $USER \
--password $PASS \
--table $ORACLE_TABLE \
--hive-import \
--hive-overwrite \
--hive-table \
--hive-home $HIVE_HOME \
--hive-table $HIVE_TABLE
Change values to actual values of your environment
I'm very new to hive and sqoop, as my company has just adopted them. As such, I am trying to import data from a sql database into hdfs/hive. However, we still only have a few clusters so I am worried about importing all the data at once (19 million records in total). I have searched furiously for a solution but the only thing close to what I am looking for that I have found is using incremental import. However, this is not a solution as it imports everything newer than the first import, and I have historical data for 2 years.
Therefore, is there a way to append to a table that I am missing (so I can import a month at a time into the same table for example?
Here is the initial command I am using to insert the first chunk of data into the table.
sqoop import --driver com.microsoft.sqlserver.jdbc.SQLServerDriver \
--connect jdbc:sqlserver://******omitted******* \
--username **** \
--password ******* \
--hive-table <tablename> \
--m 1 \
--delete-target-dir \
--target-dir /apps/hive/warehouse/<dir to table> \
--hive-drop-import-delims \
--hive-import --query "select * from <old sql table> where record_id
<='000000001433106' and \$CONDITIONS"
Thanks for any help.
I'm importing a table from mysql to hive using Sqoop. Some columns are latin1 encoded. Is there any way to do either:
Set the encoding for those columns as latin1 in Hive. OR
Convert the columns to utf-8 while importing with sqoop?
In Hive --default-character-set is used to set the character set for whole database not specific to few columns. I was not able to find Sqoop parameter which will convert tables columns to utf-8 in fly rather the columns are expected to set type fixed.
$ sqoop import --connect jdbc:mysql://server.foo.com/db --table bar \
--direct -- --default-character-set=latin1
I believe you would need to convert Latin1 columns to utf-8 first in your MySql and then you can import from Sqoop. You can use the following script to convert the all the columns into utf-8, which I found here.
mysql --database=dbname -B -N -e "SHOW TABLES" | \
awk '{print "ALTER TABLE", $1, "CONVERT TO CHARACTER SET utf8 COLLATE \
utf8_general_ci;"}' | mysql --database=dbname &
Turned out the problem was unrelated. The column works fine regardless of encoding...but the table's schema had changed in mysql. I assumed that since I'm passing in the overwrite flag, sqoop would remake the table every time in Hive. Not so! The schema changes in mysql didn't get transferred to Hive, so the data in the md5 column was actually data from a different column.
The "fix" we settled on was, before every sqoop import check for schema changes, and if there was a change, drop the table and re-import. This forces a schema update in Hive.
Edit: my original sqoop command was something like:
sqoop import --connect jdbc:mysql://HOST:PORT/DB --username USERNAME --password PASSWORD --table uploads --hive-table uploads --hive-import --hive-overwrite --split-by id --num-mappers 8 --hive-drop-import-delims --null-string '\\N' --null-non-string '\\N'
But now I manually issue a drop table uploads to hive first if the schema changes.