Encoding columns in Hive - hadoop

I'm importing a table from mysql to hive using Sqoop. Some columns are latin1 encoded. Is there any way to do either:
Set the encoding for those columns as latin1 in Hive. OR
Convert the columns to utf-8 while importing with sqoop?

In Hive --default-character-set is used to set the character set for whole database not specific to few columns. I was not able to find Sqoop parameter which will convert tables columns to utf-8 in fly rather the columns are expected to set type fixed.
$ sqoop import --connect jdbc:mysql://server.foo.com/db --table bar \
--direct -- --default-character-set=latin1
I believe you would need to convert Latin1 columns to utf-8 first in your MySql and then you can import from Sqoop. You can use the following script to convert the all the columns into utf-8, which I found here.
mysql --database=dbname -B -N -e "SHOW TABLES" | \
awk '{print "ALTER TABLE", $1, "CONVERT TO CHARACTER SET utf8 COLLATE \
utf8_general_ci;"}' | mysql --database=dbname &

Turned out the problem was unrelated. The column works fine regardless of encoding...but the table's schema had changed in mysql. I assumed that since I'm passing in the overwrite flag, sqoop would remake the table every time in Hive. Not so! The schema changes in mysql didn't get transferred to Hive, so the data in the md5 column was actually data from a different column.
The "fix" we settled on was, before every sqoop import check for schema changes, and if there was a change, drop the table and re-import. This forces a schema update in Hive.
Edit: my original sqoop command was something like:
sqoop import --connect jdbc:mysql://HOST:PORT/DB --username USERNAME --password PASSWORD --table uploads --hive-table uploads --hive-import --hive-overwrite --split-by id --num-mappers 8 --hive-drop-import-delims --null-string '\\N' --null-non-string '\\N'
But now I manually issue a drop table uploads to hive first if the schema changes.

Related

Sqoop Import of Data having new line character in avro format and then query using hive

My requirement is to load the data from RDBMS into HDFS (backed by CDH 5.9.X) via sqoop (1.4.6) in avro format and then use an external hive(1.1) table to query the data.
Unfortunately, the data in RDBMS has some new line characters.
We all know that hive can't parse new line character in the data and the data mapping fails when selected the whole data via hive. However, hive's select count(*) works fine.
I used below options during sqoop import and checked, but didn't work:
--hive-drop-import-delims
--hive-delims-replacement
The above options work for text format. But storing data in text format is not a viable option for me.
The above options are converted properly in the Sqoop generated (codegen) POJO class's toString method (obviously as text format is working as expected), so I feel this method is not at all used during avro import. Probably because avro has not problem dealing with new line character, where as hive has.
I am surprised, don't anyone face such a common scenario, a table which has remark, comment field is prone to this problem.
Can anyone suggest me a solution please?
My command:
sqoop import \
-Dmapred.job.queue.name=XXXX \
--connect jdbc:oracle:thin:#Masked:61901/AgainMasked \
--table masked.masked \
--username masked \
--P \
--target-dir /user/masked/ \
--as-avrodatafile \
--map-column-java CREATED=String,LAST_UPD=String,END_DT=String,INFO_RECORD_DT=String,START_DT=String,DB_LAST_UPD=String,ADDR_LINE_3=String\
--hive-delims-replacement ' '
--null-string '\\N'
--null-non-string '\\N'
--fields-terminated-by '\001'
-m 1
This looks like an issue with avro serde. It is an open bug.
https://issues.apache.org/jira/browse/HIVE-14044.
Can you try the same in hive 2.0?
As mentioned by VJ, there is an open issue for new line character in avro.
What an alternate approach that you can try is
Sqoop the data into a hive staging table as a textfileformat.
Create an avro table.
Insert data from staging table to main avro table in hive.
As newline character is very well handled in textfileformat

Special characters are not proper after sqooping data into Hive from teradata

I'm trying to sqoop the teradata table into Hive using below "sqoop-import" command.
sqoop tdimport
-Dtdch.output.hdfs.avro.schema.file=/tmp/data/country.avsc --connect jdbc:teradata://tdserver/database=SALES --username tduser
--password tdpw --as-avrodatafile --target-dir /tmp/data/country_avro --table COUNTRY
--split-by SALESCOUNTRYCODE --num-mappers 1
The teradata table contains special characters in some columns.After sqooping into Hive, the special characters are not coming proper.
Is there any way to enable the special characters while firing the sqoop import command?
Do we need to use UTF-8, to resolve this issue ?
Can anyone please suggest me regarding this issue ...

sqoop-hive Import adding an extra column

I have imported data from sqoop to hive successfully. I have added an column in Oracle and again imported the particular column to hive using sqoop-import. But,it is appending to the first column data and remaining columns with null and no new column came in hive. Can anyone resolve the issue.
With out looking at your import statements, I am assuming that in your second import you are trying to append to the existing import but only importing new column using --columns and --append arguments. It will not work this way as it will append to the file at end of the file not at end of the each line.
you will need to overwrite the existing data in hdfs using --hive-overwrite; and alter hive table for adding additional column. OR just drop the hive table and use --create-hive-table in sqoop command.
so you import command should look like this:
sqoop --import \
--connect $CONNECTION_STR \
--username $USER \
--password $PASS \
--table $ORACLE_TABLE \
--hive-import \
--hive-overwrite \
--hive-table \
--hive-home $HIVE_HOME \
--hive-table $HIVE_TABLE
Change values to actual values of your environment

how to overwrite the data in hive using sqoop

I am trying to load data into an already existing table in hive via sqoop from mysql database. I am referring to the below guide for reference:-
http://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html#_importing_data_into_hive
--hive-import has been tried and tested successfully.
I created a hive table as below:-
create table sqoophive (id int, name string, location string)
row format delimited
fields terminated by '\t'
lines terminated by '\n'
stored as textfile;
Loaded the data as required.
I want to use --hive-overwrite option to overwrite the content in the above table. As per the guide mentioned above - "--hive-overwrite Overwrite existing data in the Hive table."
"If the Hive table already exists, you can specify the --hive-overwrite option to indicate that existing table in hive must be replaced."
So I tried the below queries separately to get the result:-
sqoop import --connect jdbc:mysql://localhost/test --username root --password 'hr' --table sample --hive-import --hive-overwrite --hive-table sqoophive -m 1 --fields-terminated-by '\t' --lines-terminated-by '\n'
sqoop import --connect jdbc:mysql://localhost/test --username root --password 'hr' --table sample --hive-overwrite --hive-table sqoophive -m 1 --fields-terminated-by '\t' --lines-terminated-by '\n'
but rather than replacing the content in the existing table it just created a file in the below path /user/<username>/<mysqltablename>
Can somebody please explain me where I am going wrong?
the first query should work fine. I didn't give fields terminated and lines terminated as the schema already exists.
the keywords --hive-import and --hive-overwrite should be there.
if only --hive-overwrite is there, it doesn't load data to the table. just copies to hdfs.
It's putting the _SUCCESS file in
/user/<username>/<mysqltablename>
You can change where that goes with --warehouse-dir
ex: --warehouse-dir /tmp
One would think that hive-overwrite would handle this, meaning remove that directory first. But for good reason Hive doesn't want to start removing dirs in HDFS. What if something else was put in there?
hive-overwrite is saying, "I'm going to overwrite the rows in Hive, not just add to the table." Thus you will not have duplicates.
You have to remove that directory and the _SUCCESS file first; or better yet, right after the import is successful.
hadoop fs -rm -R /user/<username>/<mysqltablename>
sqoop import with out --target-dir OR --warehouse-dir (for --hive-import) will import /user/<username>/<mysqltablename>:
By default, Sqoop will import a table named foo to a directory named
foo inside your home directory in HDFS. For example, if your username
is someuser, then the import tool will write to
/user/someuser/foo/(files). You can adjust the parent directory of the
import with the --warehouse-dir argument.
You can also explicitly choose the target directory with --target-dir param
but as #hrobertv said that --hive-overwrite does not delete existing dir but it overwrites the HDFS data location of hive table. if you want to save new data at same location as origin than you would have to delete the existing table dir first and then run sqoop import with specifying --target-dir OR --warehouse-dir for --hive-overwrite to store data at specific location as per your requirement...

How to use sqoop to export the default hive delimited output?

I have a hive query:
insert override directory /x
select ...
Then I'm try to export the data with sqoop
sqoop export --connect jdbc:mysql://mysqlm/site --username site --password site --table x_data --export-dir /x --input-fields-terminated-by 0x01 --lines-terminated-by '\n'
But this seems to fail to parse the fields according to delimiter
What am I missing?
I think the --input-fields-terminated-by 0x01 part doesn't work as expected?
I do not want to create additional tables in hive that contains the query results.
stack trace:
2013-09-24 05:39:21,705 ERROR org.apache.sqoop.mapreduce.TextExportMapper: Exception:
java.lang.NumberFormatException: For input string: "9-2"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
at java.lang.Integer.parseInt(Integer.java:458)
...
The vi view of output
16-09-2013 23^A1182^A-1^APub_X^A21782^AIT^A1^A0^A0^A0^A0^A0.0^A0.0^A0.0
16-09-2013 23^A1182^A6975^ASoMo Audience Corp^A2336143^AUS^A1^A1^A0^A0^A0^A0.2^A0.0^A0.0
16-09-2013 23^A1183^A-1^APub_UK, Inc.^A1564001^AGB^A1^A0^A0^A0^A0^A0.0^A0.0^A0.0
17-09-2013 00^A1120^A-1^APub_US^A911^A--^A181^A0^A0^A0^A0^A0.0^A0.0^A0.0
I've found the correct solution for that special character in bash
#!/bin/bash
# ... your script
hive_char=$( printf "\x01" )
sqoop export --connect jdbc:mysql://mysqlm/site --username site --password site --table x_data --export-dir /x --input-fields-terminated-by ${hive_char} --lines-terminated-by '\n'
The problem was in correct separator recognition (nothing to do with types and schema) and that was achieved by hive_char.
Another possibility to encode this special character in linux to command-line is to type Cntr+V+A
Using
--input-fields-terminated-by '\001' --lines-terminated-by '\n'
as flags in the sqoop export command seems to do the trick for me.
So, in your example, the full command would be:
sqoop export --connect jdbc:mysql://mysqlm/site --username site --password site --table x_data --export-dir /x --input-fields-terminated-by '\001' --lines-terminated-by '\n'
I think its the DataType mismatch with your RDBMS schema.
Try to find the column name of "9-2" value and check the datatype in RDBMS schema.
If its int or numeric then Sqoop will parse the value and insert. And as it seems "9-2" is not numeric value.
Let me know if this doesn't work.
It seems like sqoop is taking '0' as a delimiter .
You are getting an error because:-
First column in your mysql table could be varchar and second column is a number.
As per below string:-
16- 0 9-2 0 13 23^A1182^A-1^APub_X^A21782^AIT^A1^A0^A0^A0^A0^A0.0^A0.0^A0.0
Your first column parsed by sqoop is :-16-
and second column is:-9-2
So its better to specify a delimiter in quotes('0x01')
or
(Its always easy and has better control)use hive create table command as:-
create table tablename row format delimited fields terminated by '\t' as select ... and specify '\t' as delimiter in your sqoop command.

Resources