HIVE: apply delimiter until a specified column - hadoop

I am trying to move data from a file into a hive table. The data in the file looks something like this:-
StringA StringB StringC StringD StringE
where each string is separated by a space. The problem is that i want separate columns for StringA, StringB and StringC and one column for StringD onwards i.e. StringD and String E should be part of the same column. If i use
ROW DELIMITED BY FIELDS TERMINATED BY ' ', Hive would produce separate columns for StringD and StringE. (StringD and StringE contain space within themselves whereas other strings do not contain spaces within themselves)
Is there any special syntax in hive to achieve this or do i need to pre-process my data file in some way?

Use regular expresion
https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-ApacheWeblogData
you can define when use space as delimiter and when part of data

Related

Nifi Processor splittext escape character

I use splitText processor to split a multiple statement hql file into single hql statements on semicolon before sending them to PutHiveQL Processor
'
My problem is that need to concat som fields and seperate them with ; meaning that i want splittext to ignore that particular semicolon.
I tried to escape with
example.
drop table my table if exits;
create external table mytable as
select CONCAT_WS('\;\',field1,field2.field3) as concatfields
from oldtable;
Now this will result in followidng statements
flowfile1
drop table my table if exits;
flowfile2
create external table mytable as
select CONCAT_WS('\;
flowfile3
\',field1,field2.field3) as concatfields
from oldtable;
But clearly i want to escape my semicolon in CONCAT_WS('\;\',field1,field2.field3) as concatfields
is that possible ?
Are you using SplitContent? I thought SplitText only split on line boundaries. If you use SplitContent you should be able to split on ;\n (use Shift+Enter to input a newline character) and choose Keep Byte Sequence. This should split when a semicolon ends a line but keep your escaped semicolon intact.

LINES TERMINATED BY only supports newline '\n' right now

I have files where the column is delimited by char(30) and the lines are delimited by char(31). I'm using these delimiters mainly because the columns may contain newline (\n), so the default line delimiter for hive is not useful for us.
I have tried to change the line delimiter in hive but get the error below:
LINES TERMINATED BY only supports newline '\n' right now.
Any suggestion?
Write custom SerDe may work?
is there any plan to enhance this functionality in hive in new releases?
thanks
Not sure if this helps, or is the best answer, but when faced with this issue, what we ended up doing is setting the 'textinputformat.record.delimiter' Map/Reduce java property to the value being used. In our case it was a string "{EOL}", but could be any unique string for all practical purposes.
We set this in our beeline shell which allowed us to pull back the fields correctly. It should be noted that once we did this, we converted the data to Avro as fast as possible so we didn't need to explain to every user, and the user's baby brother, to set the {EOL} line delimiter.
set textinputformat.record.delimiter={EOL};
Here is the full example.
#example CSV data (fields broken by '^' and end of lines broken by the String '{EOL}'
ID^TEXT
11111^Some THings WIth
New Lines in THem{EOL}11112^Some Other THings..,?{EOL}
111113^Some crazy thin
gs
just crazy{EOL}11114^And Some Normal THings.
#here is the CSV table we laid on top of the data
CREATE EXTERNAL TABLE CRAZY_DATA_CSV
(
ID STRING,
TEXT STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\136'
STORED AS TEXTFILE
LOCATION '/archive/CRAZY_DATA_CSV'
TBLPROPERTIES('skip.header.line.count'='1');
#here is the Avro table which we'll migrate into below.
CREATE EXTERNAL TABLE CRAZY_DATA_AVRO
(
ID STRING,
TEXT STRING
)
STORED AS AVRO
LOCATION '/archive/CRAZY_DATA_AVRO'
TBLPROPERTIES ('avro.schema.url'='hdfs://nameservice/archive/avro_schemas/CRAZY_DATA.avsc');
#And finally, the magic is here. We set the custom delimiter and import into our Avro table.
set textinputformat.record.delimiter={EOL};
INSERT INTO TABLE CRAZY_DATA_AVRO SELECT * from CRAZY_DATA_CSV;
I have worked it out by using the option during the extract --hive-delims-replacement ' ' in sqoop so the characters \n \001 \r are removed from the columns.

Hadoop Remove unnecessary \n in the input files

I have a large input file, values are pipe delimited. And there are 20 values in a row. after 19th pipe, if new line character comes, that is a record.
But my input file is having \n not only after 19 pipes but also in the other values. sample line looks like this...
101101|this\nis my sample|12547|sample\nxyz|......(19th pipe)|end of record\n
I am new to Hadoop and I don't know how to divide lines to create key value pairs based on this condition.
Another related question I have is, input split happens at the client side and if I have to split the input file conditionally on the client side(one machine), will it not be very slow considering the large file? Please help.
In Hive NULL column values are represented as "\N" that's the default behaviour of Hive. This is done to differentiate NULL and "NULL" (string NULL).
If you don't want \N to appears to appear in your export you can use COALESCE UDF.
Roughly your query may look like this
SELECT
COALESCE (my_column, '') AS my_column
FROM
my_table

Hive doesn't separate tsv file properly

I have a TSV file and trying to load it to hive by;
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
STORED AS TEXTFILE
Problematic part here though is file contains "02\t\t\t" strings like this.
So that hive does not recognize it as tab separated and doesn't separate them.
I would like to know if there is a way to make hive understand "\t" in a string also should be field separated. I have read a book about it and saw that there are no free serdes for tsv as well.
Example Input Line:
8 fp\t\t\t dj\t\t\t 5 amz ep 02\t\t\t ar\t
Cheers,

Delimeter files issues

I do have a flat file with not a fixed structure like
name,phone_num,Address
bob,8888,2nd main,5th floor,avenue road
Here the last column Address has the value 2nd main,5th floor,avenue road but since the same delimeter , is used for seperating columns also i am not getting any clue how to handle the same.
the structure of flat file may change from file to file.
How to handle such kind of flat files while importing using Informatica or SQL * Loader or UTL Files
I will not have any access to flat file just i should read the data from it but i can't edit the data in flat file.
Using SQLLoader
load data
append
into table schema.table
fields terminated by '~'
trailing nullcols
(
line BOUNDFILLER,
name "regexp_substr(:line, '^[^,]+')",
phone_num "regexp_substr(:line, '[^,]+', 1, 2)",
Address "regexp_replace(:line, '^.*?,.*?,')"
)
you need to change your source file to enclose the fields in an escape character eg:
name,phone_num,Address
bob,8888,^2nd main,5th floor,avenue road^
then in sql-loader you'd put:
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '^'
just pick a delimiter that doesn't normally appear in your data.
If you could get the source data enclosed within double quotes ( or any quotes for that matter) you can make use of 'Optional Quotes' option in Informatica while reading from Flat file

Resources