How to insert data into table in following scenario? - hadoop

I am newbie in hadoop and I have to add data into table in hive.
I have data from FIX4.4 protocol, something like this...
8=FIX.4.4<SHO>9=85<SHO>35=A<SHO>34=524<SHO>49=SSGMdemo<SHO>52=20150410-15:25:55.795<SHO>56=Trumid<SHO>98=0<SHO>108=30<SHO>554=TruMid456<SHO>10=154<SHO>
8=FIX.4.4<SHO>9=69<SHO>35=A<SHO>34=1<SHO>49=Trumid<SHO>52=20150410-15:25:58.148<SHO>56=SSGMdemo<SHO>98=0<SHO>108=30<SHO>10=093<SHO>
8=FIX.4.4<SHO>9=66<SHO>35=2<SHO>34=2<SHO>49=Trumid<SHO>52=20150410-15:25:58.148<SHO>56=SSGMdemo<SHO>7=1<SHO>16=0<SHO>10=174<SHO>
8=FIX.4.4<SHO>9=110<SHO>35=5<SHO>34=525<SHO>49=SSGMdemo<SHO>52=20150410-15:25:58.164<SHO>56=Trumid<SHO>58=MsgSeqNum too low, expecting 361 but received 1<SHO>10=195<SHO>
Firstly, what i want is, in 8=FIX.4.4 8 as column name, and FIX.4.4 as value of that column, in 9=66 9 should be column name and 66 would be value of that column and so on.... and there are so many rows in raw file like this.
Secondly, same thing for another row, and that data would append in next row of table in hive.
Now what should i do that i am not able to think.
Any help would be appriciable.

I would first create a tab-separated-file containing this data. I suggested to use a regex in the comments but if that is not your strong suit you can just split on the <SHO> tag and =. Since you did not specify the language you want to use I will suggest a 'solution' in Python.
The code below shows you how to write one of your input lines to a CSV file.
This can easily be extended to support multiple of these lines or to append lines to the CSV files once it is already created.
import csv
input = "8=FIX.4.4<SHO>9=85<SHO>35=A<SHO>34=524<SHO>49=SSGMdemo<SHO>52=20150410-15:25:55.795<SHO>56=Trumid<SHO>98=0<SHO>108=30<SHO>554=TruMid456<SHO>10=154<SHO>"
l = input.split('<SHO>')[:-1] # Don't include last element since it's empty
list_of_pairs = map(lambda x: tuple(x.split('=')),l)
d = dict(list_of_pairs)
with open('test.tsv', 'wb') as c:
cw = csv.writer(c, delimiter='\t')
cw.writerow(d.keys()) # Comment this if you don't want to have a header
cw.writerow(d.values())
What this code does is first split the input line on <SHO> meaning it creates a list of col=val strings. What I does next is create a list of tuple pairs where each tuple is (col,val).
Then it creates a dictionary from this, which is not strictly necessary but might help you if you want to extend the code for more lines.
Next I create a tab-separated-value file test.tsv containing a header and the values in the next line.
This means now you have a file which Hive can understand.
I am sure you can find a lot of articles on importing CSV or tab-separated-value files, but I will give you an example of a generic Hive query you can use to import this file once it is in HDFS.
CREATE TABLE if not exists [database].[table]
([Col1] Integer, [Col2] Integer, [Col3] String,...)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
TBLPROPERTIES('skip.header.line.count'='1');
LOAD DATA inpath '[HDFS path]'
overwrite INTO TABLE [database].[table];
Hope this gives you a better idea on how to proceed.

Copy the file to HDFS and create an external table with a single column (C8), then use the below select statement to extract each columns
create external table tablename(
c8 string )
STORED AS TEXTFILE
location 'HDFS path';
select regexp_extract(c8,'8=(.*?)<SHO>',1) as c8,
regexp_extract(c8,'9=(.*?)<SHO>',1) as c9,
regexp_extract(c8,'35=(.*?)<SHO>',1) as c35,
regexp_extract(c8,'34=(.*?)<SHO>',1) as c34,
regexp_extract(c8,'49=(.*?)<SHO>',1) as c49,
regexp_extract(c8,'52=(.*?)<SHO>',1) as c52,
regexp_extract(c8,'56=(.*?)<SHO>',1) as c56,
regexp_extract(c8,'98=(.*?)<SHO>',1) as c98,
regexp_extract(c8,'108=(.*?)<SHO>',1) as c108,
regexp_extract(c8,'554=(.*?)<SHO>',1) as c554,
regexp_extract(c8,'35=(.*?)<SHO>',1) as c10
from tablename

Related

How to perform an add functionality in sql loader file

I have a fixed length data file a.dat with below data in it
1234544550002200011000330006600000
my focus is on specific positions
POSITION(1:4)
POSITION(5:8)
and I want to add values in these 2 positions and insert it in a field named Qty in XYZ_Table.
I am trying to the following in my CTL file. But it fails, and I don't know how to pursue it further.
LOAD DATA
INFILE '$SOME_DATA/a.dat'
APPEND
PRESERVE BLANKS
INTO TABLE XYZ_Table
(QTY POSITION(1:4)+POSITION(5:8) "to_number(:QTY)")
I need to achieve this addition functionality in SQL Loader only.
If the above methodology is not possible, it would be great if you can help me with a different approach.
P.S: What I am trying to achieve is just one part of the bigger CTL file.
You need to identify the positions you want to add together but not load into their own columns as "BOUNDFILLER", which means don't load them but remember them for use in an expression later. Then use like this:
LOAD DATA
infile test.dat
append
preserve blanks
INTO TABLE X_test
TRAILING NULLCOLS
(val_1 BOUNDFILLER position(1:4)
,val_2 BOUNDFILLER position(5:8)
,qty ":val_1 + :val_2"
)

LINES TERMINATED BY only supports newline '\n' right now

I have files where the column is delimited by char(30) and the lines are delimited by char(31). I'm using these delimiters mainly because the columns may contain newline (\n), so the default line delimiter for hive is not useful for us.
I have tried to change the line delimiter in hive but get the error below:
LINES TERMINATED BY only supports newline '\n' right now.
Any suggestion?
Write custom SerDe may work?
is there any plan to enhance this functionality in hive in new releases?
thanks
Not sure if this helps, or is the best answer, but when faced with this issue, what we ended up doing is setting the 'textinputformat.record.delimiter' Map/Reduce java property to the value being used. In our case it was a string "{EOL}", but could be any unique string for all practical purposes.
We set this in our beeline shell which allowed us to pull back the fields correctly. It should be noted that once we did this, we converted the data to Avro as fast as possible so we didn't need to explain to every user, and the user's baby brother, to set the {EOL} line delimiter.
set textinputformat.record.delimiter={EOL};
Here is the full example.
#example CSV data (fields broken by '^' and end of lines broken by the String '{EOL}'
ID^TEXT
11111^Some THings WIth
New Lines in THem{EOL}11112^Some Other THings..,?{EOL}
111113^Some crazy thin
gs
just crazy{EOL}11114^And Some Normal THings.
#here is the CSV table we laid on top of the data
CREATE EXTERNAL TABLE CRAZY_DATA_CSV
(
ID STRING,
TEXT STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\136'
STORED AS TEXTFILE
LOCATION '/archive/CRAZY_DATA_CSV'
TBLPROPERTIES('skip.header.line.count'='1');
#here is the Avro table which we'll migrate into below.
CREATE EXTERNAL TABLE CRAZY_DATA_AVRO
(
ID STRING,
TEXT STRING
)
STORED AS AVRO
LOCATION '/archive/CRAZY_DATA_AVRO'
TBLPROPERTIES ('avro.schema.url'='hdfs://nameservice/archive/avro_schemas/CRAZY_DATA.avsc');
#And finally, the magic is here. We set the custom delimiter and import into our Avro table.
set textinputformat.record.delimiter={EOL};
INSERT INTO TABLE CRAZY_DATA_AVRO SELECT * from CRAZY_DATA_CSV;
I have worked it out by using the option during the extract --hive-delims-replacement ' ' in sqoop so the characters \n \001 \r are removed from the columns.

insert timestanp of INFILE into a column from SQLLOADER

I have a requirement as below,
Am calling sqlldr script via shell for the CSV files present in a folder, File name also has Timestamp attached with it.
I need to insert that timestamp into a column of table. Kindly suggest me how i can achieve this.
eg:
table:
t1(c1 varchar,c2 varchar,c3 timestamp);
control file :
load data
infile 'file.csv'
append
into table t1
fields terminated by "|" TRAILING NULLCOLS
( c1, c2)
csv_file : cat file_csv_101010112233.csv
1111|1
2222|2
OUTPUT :
select * from t1;
c1 c2 c3
1111 1 101010112233
2222 2 101010112233
Note : I dont want the sys timestamp
I think you will need a shell script wrapper around calling sqlldr. First alter the control file so the timestamp column has a placeholder like:
...
C3 CONSTANT REPLACE_ME,
...
And save it as a template.
The wrapper should back up the original csv file, get the timestamp from the filename, then use something like sed to replace the "REPLACE_ME" text in the template control file with saved timestamp data and save it to a working copy, then call Sqlldr using the working copy.
I was thinking of other ways to do this and came up with one. May not be feasible for your environment but something to keep in mind anyway.
If you can get the data file name into a column (maybe a load_log table for example that would get populated at the start of the load), you could assign it like this by calling a function that returns the name:
C3 "package.function"
More info: SQL*Loader Field List Reference

Hadoop Remove unnecessary \n in the input files

I have a large input file, values are pipe delimited. And there are 20 values in a row. after 19th pipe, if new line character comes, that is a record.
But my input file is having \n not only after 19 pipes but also in the other values. sample line looks like this...
101101|this\nis my sample|12547|sample\nxyz|......(19th pipe)|end of record\n
I am new to Hadoop and I don't know how to divide lines to create key value pairs based on this condition.
Another related question I have is, input split happens at the client side and if I have to split the input file conditionally on the client side(one machine), will it not be very slow considering the large file? Please help.
In Hive NULL column values are represented as "\N" that's the default behaviour of Hive. This is done to differentiate NULL and "NULL" (string NULL).
If you don't want \N to appears to appear in your export you can use COALESCE UDF.
Roughly your query may look like this
SELECT
COALESCE (my_column, '') AS my_column
FROM
my_table

bulk load UDT columns in Oracle

I have a table with the following structure:
create table my_table (
id integer,
point Point -- UDT made of two integers (x, y)
)
and i have a CSV file with the following data:
#id, point
1|(3, 5)
2|(7, 2)
3|(6, 2)
now i want to bulk load this CSV into my table, but i cant find any information about how to handle the UDT in Oracle sqlldr util. Is is possible to use the bulk load util when having UDT columns?
I don't know if sqlldr can do this, but personally I would use an external table.
Attach the file as an external table (the file must be on the database server), and then insert the contents of the external table into the destination table transforming the UDT into two values as you go. The following select from dual should help you with the translation:
select
regexp_substr('(5, 678)', '[[:digit:]]+', 1, 1) x_point,
regexp_substr('(5, 678)', '[[:digit:]]+', 1, 2) y_point
from dual;
UPDATE
In sqlldr, you can transform fields using standard SQL expressions:
LOAD DATA
INFILE 'data.dat'
BADFILE 'bad_orders.txt'
APPEND
INTO TABLE test_tab
FIELDS TERMINATED BY "|"
( info,
x_cord "regexp_substr(:x_cord, '[[:digit:]]+', 1, 1)",
)
The control file above will extract the first digit in the fields like (3, 4), but I cannot find a way to extract the second digit - ie I am not sure if it is possible to have the same field in the input file inserted into two columns.
If external tables are not an option for you, I would suggest either (1) transform the file before loading, using sed, awk, Perl etc or (2) SQLLDR the file into a temporary table and then have a second process to trandform the data and insert into your final table. Another option is to look at how the file is generated - could you generate it so that the field you need to transform is repeated in two fields in the file, eg:
data|(1, 2)|(1, 2)
Maybe someone else will chip in with a way to get sqlldr to do what you want.
Solved the problem after more research, because Oracle SQL*Loader has this feature, and it is used by specifying a column object, the following was the solution:
LOAD DATA
INFILE *
INTO TABLE my_table
FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY '"'
TRAILING NULLCOLS
(
id,
point column object
(
x,
y
)
)
BEGINDATA
1,3,5
2,7,2
3,6,2

Resources