Data (Single quotes and Doube Quotes) Mismatch in Hive - hadoop

While loading the file from mainframe into Hadoop in ORC format,some of the data loaded with Single Quotes(') and remaining with Double quotes(").But the complete source file is in Single Quote (').
To specify custom delimiters used Hive Cobol Serde.
Example:
Source data:
First_Name Last_name Address
Rev 'Har' O'Amy 4031 'B' Ave
Loaded into Hadoop as,some data with correct format(') and some with double quotes(") as below:
First_Name Last_name Address
Rev "Har" O"Amy 4031 "B" Ave
what could be the issue and how to solve this?

one possible issue might be delimiter given while your table creation
so try
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’ WITH SERDEPROPERTIES(“serialization.encoding”=’UTF-8′); while creating hive table and then load the data.
also try using udf given in this link to remove all special characters if you want your data clean https://github.com/ogrodnek/csv-serde

Related

How to not consider Row delimiter which is present in the column data in Informatica Source

I am having a csv file as source with ; as Column Delimiter,LF as Row Delimiter and my data is enclosed within "". If i get LF(Row delimiter) in data we should not consider it as Row Delimiter.My target is Oracle database.
How to get the required output below while using informatica.
Input:
"Ram";"Hyderabad"LF
"Sita";"Hyderabad,LF
INDIA-500084."LF
Required Output should be 2 rows only:
Name Address
Ram Hyderabad
Sita Hyderabad, INDIA-500084.
Wrong Output i am getting is 3 rows:
Name Address
Ram Hyderabad
Sita Hyderabad,
INDIA-500084.
Looks to me that you need to do a find & replace on your source before processing to get rid of those LF strings within double quotes.
Unfortunatelly, Informatica probably does a split on LF to rows first, so you need to reprocess the source before Informatica reads it. Try using Command as Source and use sed.
In the 'Config Object' tab of the session put the override for 'Custom Properties' as MatchQuotesPastEndOfLine=Yes;
This will read the lines even after the LF till it sees the end of quotes.

LINES TERMINATED BY only supports newline '\n' right now

I have files where the column is delimited by char(30) and the lines are delimited by char(31). I'm using these delimiters mainly because the columns may contain newline (\n), so the default line delimiter for hive is not useful for us.
I have tried to change the line delimiter in hive but get the error below:
LINES TERMINATED BY only supports newline '\n' right now.
Any suggestion?
Write custom SerDe may work?
is there any plan to enhance this functionality in hive in new releases?
thanks
Not sure if this helps, or is the best answer, but when faced with this issue, what we ended up doing is setting the 'textinputformat.record.delimiter' Map/Reduce java property to the value being used. In our case it was a string "{EOL}", but could be any unique string for all practical purposes.
We set this in our beeline shell which allowed us to pull back the fields correctly. It should be noted that once we did this, we converted the data to Avro as fast as possible so we didn't need to explain to every user, and the user's baby brother, to set the {EOL} line delimiter.
set textinputformat.record.delimiter={EOL};
Here is the full example.
#example CSV data (fields broken by '^' and end of lines broken by the String '{EOL}'
ID^TEXT
11111^Some THings WIth
New Lines in THem{EOL}11112^Some Other THings..,?{EOL}
111113^Some crazy thin
gs
just crazy{EOL}11114^And Some Normal THings.
#here is the CSV table we laid on top of the data
CREATE EXTERNAL TABLE CRAZY_DATA_CSV
(
ID STRING,
TEXT STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\136'
STORED AS TEXTFILE
LOCATION '/archive/CRAZY_DATA_CSV'
TBLPROPERTIES('skip.header.line.count'='1');
#here is the Avro table which we'll migrate into below.
CREATE EXTERNAL TABLE CRAZY_DATA_AVRO
(
ID STRING,
TEXT STRING
)
STORED AS AVRO
LOCATION '/archive/CRAZY_DATA_AVRO'
TBLPROPERTIES ('avro.schema.url'='hdfs://nameservice/archive/avro_schemas/CRAZY_DATA.avsc');
#And finally, the magic is here. We set the custom delimiter and import into our Avro table.
set textinputformat.record.delimiter={EOL};
INSERT INTO TABLE CRAZY_DATA_AVRO SELECT * from CRAZY_DATA_CSV;
I have worked it out by using the option during the extract --hive-delims-replacement ' ' in sqoop so the characters \n \001 \r are removed from the columns.

Hive doesn't separate tsv file properly

I have a TSV file and trying to load it to hive by;
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
STORED AS TEXTFILE
Problematic part here though is file contains "02\t\t\t" strings like this.
So that hive does not recognize it as tab separated and doesn't separate them.
I would like to know if there is a way to make hive understand "\t" in a string also should be field separated. I have read a book about it and saw that there are no free serdes for tsv as well.
Example Input Line:
8 fp\t\t\t dj\t\t\t 5 amz ep 02\t\t\t ar\t
Cheers,

Creating Hive table for handling fixed length file

I have a fixed length file in HDFS on top of which i have to create external table using regex.
My file is something like this:
12piyush34stack10
13pankaj21abcde41
I want it to convert it into a table like:
key_column Value_column
---------- -----------------
1234stack 12piyush34stack10
1321stack 13pankaj21abcde41
I tried even by substr using insert but i am unable to point to key_columns.
Please help with solving this problem.
I don't know why you've used regexp external table but the way cannot workout so as to need also use another substring operation.
If me , I would make a regexp serde table then create two columns(key_column , Value_column) and just specify serde option as follows:
SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" ="(\d\d)\w{6}(\d\d).*",
"output.format.string" = "%1$s%2$sstack %0$s"
)
The output option will write the space separated data to corresponding columns by order.
Haven't yet test it , mind the back slash may not interpreted right in java.

Delimeter files issues

I do have a flat file with not a fixed structure like
name,phone_num,Address
bob,8888,2nd main,5th floor,avenue road
Here the last column Address has the value 2nd main,5th floor,avenue road but since the same delimeter , is used for seperating columns also i am not getting any clue how to handle the same.
the structure of flat file may change from file to file.
How to handle such kind of flat files while importing using Informatica or SQL * Loader or UTL Files
I will not have any access to flat file just i should read the data from it but i can't edit the data in flat file.
Using SQLLoader
load data
append
into table schema.table
fields terminated by '~'
trailing nullcols
(
line BOUNDFILLER,
name "regexp_substr(:line, '^[^,]+')",
phone_num "regexp_substr(:line, '[^,]+', 1, 2)",
Address "regexp_replace(:line, '^.*?,.*?,')"
)
you need to change your source file to enclose the fields in an escape character eg:
name,phone_num,Address
bob,8888,^2nd main,5th floor,avenue road^
then in sql-loader you'd put:
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '^'
just pick a delimiter that doesn't normally appear in your data.
If you could get the source data enclosed within double quotes ( or any quotes for that matter) you can make use of 'Optional Quotes' option in Informatica while reading from Flat file

Resources