Hadoop Remove unnecessary \n in the input files - hadoop

I have a large input file, values are pipe delimited. And there are 20 values in a row. after 19th pipe, if new line character comes, that is a record.
But my input file is having \n not only after 19 pipes but also in the other values. sample line looks like this...
101101|this\nis my sample|12547|sample\nxyz|......(19th pipe)|end of record\n
I am new to Hadoop and I don't know how to divide lines to create key value pairs based on this condition.
Another related question I have is, input split happens at the client side and if I have to split the input file conditionally on the client side(one machine), will it not be very slow considering the large file? Please help.

In Hive NULL column values are represented as "\N" that's the default behaviour of Hive. This is done to differentiate NULL and "NULL" (string NULL).
If you don't want \N to appears to appear in your export you can use COALESCE UDF.
Roughly your query may look like this
SELECT
COALESCE (my_column, '') AS my_column
FROM
my_table

Related

how to speed up sort in hive

I would like to speed up hive process,
but I do not know how to
do it.
The data is about 200GB and about 300000000 lines text data,
and I split it into 50file in advance, then 1 file is about 4GB.
I would like to get 1 file as a result of the sort then I select the number of reducer is 1 and the number of mapper is 50.
Each line of the data consists of word and frepuency.
The same word should be grouped and frepuency of it should be sumed.
All of files are gzip files.
It takes a few day to complete the process,
and I would like to speed up
it to a few hours if I can.
Which parameter should I chgange to speed up the process?
Thank you for your reply,
Yes, I define external Hive table pointing to HDFS location.
I show my pseudo code,
create external table A count int, word string,
row format delimited fields terminated by '\t',
location 'HDFS path';
select count, word from A group by word sort by count desc;

How to insert data into table in following scenario?

I am newbie in hadoop and I have to add data into table in hive.
I have data from FIX4.4 protocol, something like this...
8=FIX.4.4<SHO>9=85<SHO>35=A<SHO>34=524<SHO>49=SSGMdemo<SHO>52=20150410-15:25:55.795<SHO>56=Trumid<SHO>98=0<SHO>108=30<SHO>554=TruMid456<SHO>10=154<SHO>
8=FIX.4.4<SHO>9=69<SHO>35=A<SHO>34=1<SHO>49=Trumid<SHO>52=20150410-15:25:58.148<SHO>56=SSGMdemo<SHO>98=0<SHO>108=30<SHO>10=093<SHO>
8=FIX.4.4<SHO>9=66<SHO>35=2<SHO>34=2<SHO>49=Trumid<SHO>52=20150410-15:25:58.148<SHO>56=SSGMdemo<SHO>7=1<SHO>16=0<SHO>10=174<SHO>
8=FIX.4.4<SHO>9=110<SHO>35=5<SHO>34=525<SHO>49=SSGMdemo<SHO>52=20150410-15:25:58.164<SHO>56=Trumid<SHO>58=MsgSeqNum too low, expecting 361 but received 1<SHO>10=195<SHO>
Firstly, what i want is, in 8=FIX.4.4 8 as column name, and FIX.4.4 as value of that column, in 9=66 9 should be column name and 66 would be value of that column and so on.... and there are so many rows in raw file like this.
Secondly, same thing for another row, and that data would append in next row of table in hive.
Now what should i do that i am not able to think.
Any help would be appriciable.
I would first create a tab-separated-file containing this data. I suggested to use a regex in the comments but if that is not your strong suit you can just split on the <SHO> tag and =. Since you did not specify the language you want to use I will suggest a 'solution' in Python.
The code below shows you how to write one of your input lines to a CSV file.
This can easily be extended to support multiple of these lines or to append lines to the CSV files once it is already created.
import csv
input = "8=FIX.4.4<SHO>9=85<SHO>35=A<SHO>34=524<SHO>49=SSGMdemo<SHO>52=20150410-15:25:55.795<SHO>56=Trumid<SHO>98=0<SHO>108=30<SHO>554=TruMid456<SHO>10=154<SHO>"
l = input.split('<SHO>')[:-1] # Don't include last element since it's empty
list_of_pairs = map(lambda x: tuple(x.split('=')),l)
d = dict(list_of_pairs)
with open('test.tsv', 'wb') as c:
cw = csv.writer(c, delimiter='\t')
cw.writerow(d.keys()) # Comment this if you don't want to have a header
cw.writerow(d.values())
What this code does is first split the input line on <SHO> meaning it creates a list of col=val strings. What I does next is create a list of tuple pairs where each tuple is (col,val).
Then it creates a dictionary from this, which is not strictly necessary but might help you if you want to extend the code for more lines.
Next I create a tab-separated-value file test.tsv containing a header and the values in the next line.
This means now you have a file which Hive can understand.
I am sure you can find a lot of articles on importing CSV or tab-separated-value files, but I will give you an example of a generic Hive query you can use to import this file once it is in HDFS.
CREATE TABLE if not exists [database].[table]
([Col1] Integer, [Col2] Integer, [Col3] String,...)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
TBLPROPERTIES('skip.header.line.count'='1');
LOAD DATA inpath '[HDFS path]'
overwrite INTO TABLE [database].[table];
Hope this gives you a better idea on how to proceed.
Copy the file to HDFS and create an external table with a single column (C8), then use the below select statement to extract each columns
create external table tablename(
c8 string )
STORED AS TEXTFILE
location 'HDFS path';
select regexp_extract(c8,'8=(.*?)<SHO>',1) as c8,
regexp_extract(c8,'9=(.*?)<SHO>',1) as c9,
regexp_extract(c8,'35=(.*?)<SHO>',1) as c35,
regexp_extract(c8,'34=(.*?)<SHO>',1) as c34,
regexp_extract(c8,'49=(.*?)<SHO>',1) as c49,
regexp_extract(c8,'52=(.*?)<SHO>',1) as c52,
regexp_extract(c8,'56=(.*?)<SHO>',1) as c56,
regexp_extract(c8,'98=(.*?)<SHO>',1) as c98,
regexp_extract(c8,'108=(.*?)<SHO>',1) as c108,
regexp_extract(c8,'554=(.*?)<SHO>',1) as c554,
regexp_extract(c8,'35=(.*?)<SHO>',1) as c10
from tablename

HIVE: apply delimiter until a specified column

I am trying to move data from a file into a hive table. The data in the file looks something like this:-
StringA StringB StringC StringD StringE
where each string is separated by a space. The problem is that i want separate columns for StringA, StringB and StringC and one column for StringD onwards i.e. StringD and String E should be part of the same column. If i use
ROW DELIMITED BY FIELDS TERMINATED BY ' ', Hive would produce separate columns for StringD and StringE. (StringD and StringE contain space within themselves whereas other strings do not contain spaces within themselves)
Is there any special syntax in hive to achieve this or do i need to pre-process my data file in some way?
Use regular expresion
https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-ApacheWeblogData
you can define when use space as delimiter and when part of data

LINES TERMINATED BY only supports newline '\n' right now

I have files where the column is delimited by char(30) and the lines are delimited by char(31). I'm using these delimiters mainly because the columns may contain newline (\n), so the default line delimiter for hive is not useful for us.
I have tried to change the line delimiter in hive but get the error below:
LINES TERMINATED BY only supports newline '\n' right now.
Any suggestion?
Write custom SerDe may work?
is there any plan to enhance this functionality in hive in new releases?
thanks
Not sure if this helps, or is the best answer, but when faced with this issue, what we ended up doing is setting the 'textinputformat.record.delimiter' Map/Reduce java property to the value being used. In our case it was a string "{EOL}", but could be any unique string for all practical purposes.
We set this in our beeline shell which allowed us to pull back the fields correctly. It should be noted that once we did this, we converted the data to Avro as fast as possible so we didn't need to explain to every user, and the user's baby brother, to set the {EOL} line delimiter.
set textinputformat.record.delimiter={EOL};
Here is the full example.
#example CSV data (fields broken by '^' and end of lines broken by the String '{EOL}'
ID^TEXT
11111^Some THings WIth
New Lines in THem{EOL}11112^Some Other THings..,?{EOL}
111113^Some crazy thin
gs
just crazy{EOL}11114^And Some Normal THings.
#here is the CSV table we laid on top of the data
CREATE EXTERNAL TABLE CRAZY_DATA_CSV
(
ID STRING,
TEXT STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\136'
STORED AS TEXTFILE
LOCATION '/archive/CRAZY_DATA_CSV'
TBLPROPERTIES('skip.header.line.count'='1');
#here is the Avro table which we'll migrate into below.
CREATE EXTERNAL TABLE CRAZY_DATA_AVRO
(
ID STRING,
TEXT STRING
)
STORED AS AVRO
LOCATION '/archive/CRAZY_DATA_AVRO'
TBLPROPERTIES ('avro.schema.url'='hdfs://nameservice/archive/avro_schemas/CRAZY_DATA.avsc');
#And finally, the magic is here. We set the custom delimiter and import into our Avro table.
set textinputformat.record.delimiter={EOL};
INSERT INTO TABLE CRAZY_DATA_AVRO SELECT * from CRAZY_DATA_CSV;
I have worked it out by using the option during the extract --hive-delims-replacement ' ' in sqoop so the characters \n \001 \r are removed from the columns.

Showing only actual column data in SQL*Plus

I'm spooling out delimited text files from SQL*Plus, but every column is printed as the full size per its definition, rather than the data actually in that row.
For instance, a column defined as 10 characters, with a row value of "test", is printing out as "test " instead of "test". I can confirm this by selecting the column along with the value of its LENGTH function. It prints "test |4".
It kind of defeats the purpose of a delimiter if it forces me into fixed-width. Is there a SET option that will fix this, or some other way to make it print only the actual column data.
I don't want to add TRIM to every column, because if a value is actually stored with spaces I want to be able to keep them.
Thanks
I have seen many SQL*plus script, that create text files like this:
select A || ';' || B || ';' || C || ';' || D
from T
where ...
It's a strong indication to me that you can't just switch to variable length output with a SET command.
Instead of ';' you can of course use any other delimiter. And it's up to your query to properly escape any characters that could be confused with a delimiter or a line feed.
Generally, I'd forget SQL Plus as a method for getting CSV out of Oracle.
Tom Kyte has written a nice little Pro-C unloader
Personally I've written a utility which does similar but in perl

Resources