How to handle Junk characters in Sqoop - sqoop

When importing data from RDMS to Hadoop using sqoop. If my source system contains junk charactesrs how can we replace them
Eg: 1,punâ€,travel,

The definition of junk characters can vary based on data being stored and usage of the data. Sqoop import allows dropping Hive delimiters (via --hive-drop-import-delims option) or replacing Hive delimiters (via --hive-delims-replacement option). Other forms of data processing would need to be done after import job has landed data on Hadoop.
Per the Sqoop documentation:
--hive-drop-import-delims: Drops \n, \r, and \01 from string fields when importing to Hive.
--hive-delims-replacement: Replace \n, \r, and \01 from string fields with user defined string when importing to Hive.

Related

How to handle new line characters in hive?

I am exporting table from Teradata to Hive.. The table in the teradata Has a address field which has New line characters(\n).. initially I am exporting the table to mount filesystem path from Teradata and then I am loading the table into hive... Record counts are mismatching between teradata table and hive table, Since new line characters are presented in hive.
NOTE: I don't want to handle this through sqoop to bring the data I want to handle the new line characters while loading Into hive from local path.
I got this to work by creating an external table with the following options:
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001'
ESCAPED BY '\\'
STORED AS TEXTFILE;
Then I created a partition to the directory that contains the data files. (my table uses partitions)
i.e.
ALTER TABLE STG_HOLD_CR_LINE_FEED ADD PARTITION (part_key='part_week53') LOCATION '/ifs/test/schema.table/staging/';
NOTE: Be sure that when creating your data file you use '\' as the escape character.
Load data command in Hive only copies the data directly into the hdfs table location.
The only reason Hive would split a new line is if you only defined the table stored as TEXT, which by default uses new lines as record separators, not field separators.
To redefine the table you need something like
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' ESCAPED BY 'x'
LINES TERMINATED BY 'y'
Where, x and y are, hopefully, escape characters around fields containing new lines, and record delimiters, respectively

Table count is more than File record count in Hive

I'm using the SQL server exported file as the input of my hive table (having 40 columns). There are around 6 million rows in the data file, but when I load that file in the hive table, I find the record count more than row count in file. The table has 15 records more than that of the input text file.
I suspect the presence of new line characters \n in the data, but due to the huge volume of data I'm unable to manually check and remove these characters from the data file.
Is there any way by which I can manage my table count exactly equal to that of file count? Can I make my load query to consider those new line characters as data instead of record delimiter? or is there any other issue?
If you are sqooping input to hdfs/hive then you may use --hive-drop-import-delims or --hive-delims-replacement options of sqoop.
Hive will have problems using Sqoop-imported data if your database’s
rows contain string fields that have Hive’s default row delimiters (\n
and \r characters) or column delimiters (\01 characters) present in
them.
You can use the --hive-drop-import-delims option to drop those
characters on import to give Hive-compatible text data.
Alternatively, you can use the --hive-delims-replacement option to replace > those characters with a user-defined string on import to give
Hive-compatible text data.
These options should only be used if you
use Hive’s default delimiters and should not be used if different
delimiters are specified.
Sqoop User Guide
Alternatively, if you are copying files onto hdfs using some other method, then just run a replace script/command over the files.
It was as simple as to run a simple unix command and clean the source data.
sed -i 's/\r//g'
After applying this command on the dataset to remove carraige returns I was able to load the hive table with expected record count.

Sqoop not loading CLOB type data into hive table properly

I am trying to use Sqoop job for importing data from Oracle and one of the column in Oracle table is of data type CLOB which contains newline characters.
In this case, the option --hive-drop-import-delims is not working. Hive table doesn’t read the /n characters properly.
Please suggest how I can import CLOB data into target directory parsing all the characters properly.

handling newline character in hive

I have created a table in hive as
Create table(id int, Description String)
My data looks something as follows :
1|This will return corrupt data since there is a ',' in the first string.
some text
Change the data
2|There is prob in reading data
sometext
After the data is loaded into hive since the default line terminator is \n, the description column cannot be read by hive, Hence it displays a NULL value. Can anyone suggest how to handle newline before loading into hive.
I know this question is old, but you have a couple of options. You can't control this with FIELDS TERMINATED BY, because that only controls what terminates the fields, not the records. Records in Hive are hard-coded to be terminated by the newline character (even though there is a LINES TERMINATED BY clause, it is not implemented).
Write a custom InputFormat that uses a RecordReader that
understands non-newline delimited records. Look at the code for
LineReader/LineRecordReader and TextInputFormat.
Use a format
other than text/ASCII, like Parquet. I would recommend this
regardless, as text is probably the worst format you can store data
in anyway.
try adding the below property in hive-site.xml or you can just try for temporary hive session level.
hive.query.result.fileformat=SequenceFile
By default hive takes in NEWLINE ('\N') as delimiter .
You can change the delimiter using:
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";

sqoop imports a lot of NULL rows

I'm importing a table from mysql to hive. The table has 2115584 rows. During the import I see
13/03/20 18:34:31 INFO mapreduce.ImportJobBase: Retrieved 2115584 records.
But when I do a count(*) on the imported table I see that it has 49262250 rows. What is going on?
Update: the import works correctly when --direct is specified.
Figured it out. From the sqoop user manual:
Hive will have problems using Sqoop-imported data if your database’s rows contain string fields that have Hive’s default row delimiters (\n and \r characters) or column delimiters (\01 characters) present in them. You can use the --hive-drop-import-delims option to drop those characters on import to give Hive-compatible text data.
I just specified --hive-drop-import-delims and it works now.

Resources