Creating Hive table for handling fixed length file - hadoop

I have a fixed length file in HDFS on top of which i have to create external table using regex.
My file is something like this:
12piyush34stack10
13pankaj21abcde41
I want it to convert it into a table like:
key_column Value_column
---------- -----------------
1234stack 12piyush34stack10
1321stack 13pankaj21abcde41
I tried even by substr using insert but i am unable to point to key_columns.
Please help with solving this problem.

I don't know why you've used regexp external table but the way cannot workout so as to need also use another substring operation.
If me , I would make a regexp serde table then create two columns(key_column , Value_column) and just specify serde option as follows:
SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" ="(\d\d)\w{6}(\d\d).*",
"output.format.string" = "%1$s%2$sstack %0$s"
)
The output option will write the space separated data to corresponding columns by order.
Haven't yet test it , mind the back slash may not interpreted right in java.

Related

sum() function gives wrong answer in hiveql

I was playing around with a simple dataset that you can find here.
No matter what I do, calling the SUM() aggregate function on the 4th column of the given data set returns the wrong answer.
Here is the exact code that I have used:
create database beep_boop;
use beep_boop;
create table cause (year INT, sex STRING, cause STRING, value INT)
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as textfile
tblproperties("skip.header.line.count" = "1");
load data inpath '/user/verterse/CauseofDeath.csv' into table cause;
select sum(value) from cause;
The answer that I get is 11478567 as shown in the screenshot here.
But using the SUM() in MS Excel gives an answer of 12745563.
I tried deleting the table/database and recreating them from scratch. I tried uploading the csv file again. I tried using different datatypes like INT and BIGINT for the value column. I tried skipping and not skipping the header line. Nothing works. I also know that the file is being read completely because select count(*) from cause; returns a correct answer of 1016.
P.S.: I am new to Hadoop, Hive and big data in general.

How to insert data into table in following scenario?

I am newbie in hadoop and I have to add data into table in hive.
I have data from FIX4.4 protocol, something like this...
8=FIX.4.4<SHO>9=85<SHO>35=A<SHO>34=524<SHO>49=SSGMdemo<SHO>52=20150410-15:25:55.795<SHO>56=Trumid<SHO>98=0<SHO>108=30<SHO>554=TruMid456<SHO>10=154<SHO>
8=FIX.4.4<SHO>9=69<SHO>35=A<SHO>34=1<SHO>49=Trumid<SHO>52=20150410-15:25:58.148<SHO>56=SSGMdemo<SHO>98=0<SHO>108=30<SHO>10=093<SHO>
8=FIX.4.4<SHO>9=66<SHO>35=2<SHO>34=2<SHO>49=Trumid<SHO>52=20150410-15:25:58.148<SHO>56=SSGMdemo<SHO>7=1<SHO>16=0<SHO>10=174<SHO>
8=FIX.4.4<SHO>9=110<SHO>35=5<SHO>34=525<SHO>49=SSGMdemo<SHO>52=20150410-15:25:58.164<SHO>56=Trumid<SHO>58=MsgSeqNum too low, expecting 361 but received 1<SHO>10=195<SHO>
Firstly, what i want is, in 8=FIX.4.4 8 as column name, and FIX.4.4 as value of that column, in 9=66 9 should be column name and 66 would be value of that column and so on.... and there are so many rows in raw file like this.
Secondly, same thing for another row, and that data would append in next row of table in hive.
Now what should i do that i am not able to think.
Any help would be appriciable.
I would first create a tab-separated-file containing this data. I suggested to use a regex in the comments but if that is not your strong suit you can just split on the <SHO> tag and =. Since you did not specify the language you want to use I will suggest a 'solution' in Python.
The code below shows you how to write one of your input lines to a CSV file.
This can easily be extended to support multiple of these lines or to append lines to the CSV files once it is already created.
import csv
input = "8=FIX.4.4<SHO>9=85<SHO>35=A<SHO>34=524<SHO>49=SSGMdemo<SHO>52=20150410-15:25:55.795<SHO>56=Trumid<SHO>98=0<SHO>108=30<SHO>554=TruMid456<SHO>10=154<SHO>"
l = input.split('<SHO>')[:-1] # Don't include last element since it's empty
list_of_pairs = map(lambda x: tuple(x.split('=')),l)
d = dict(list_of_pairs)
with open('test.tsv', 'wb') as c:
cw = csv.writer(c, delimiter='\t')
cw.writerow(d.keys()) # Comment this if you don't want to have a header
cw.writerow(d.values())
What this code does is first split the input line on <SHO> meaning it creates a list of col=val strings. What I does next is create a list of tuple pairs where each tuple is (col,val).
Then it creates a dictionary from this, which is not strictly necessary but might help you if you want to extend the code for more lines.
Next I create a tab-separated-value file test.tsv containing a header and the values in the next line.
This means now you have a file which Hive can understand.
I am sure you can find a lot of articles on importing CSV or tab-separated-value files, but I will give you an example of a generic Hive query you can use to import this file once it is in HDFS.
CREATE TABLE if not exists [database].[table]
([Col1] Integer, [Col2] Integer, [Col3] String,...)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
TBLPROPERTIES('skip.header.line.count'='1');
LOAD DATA inpath '[HDFS path]'
overwrite INTO TABLE [database].[table];
Hope this gives you a better idea on how to proceed.
Copy the file to HDFS and create an external table with a single column (C8), then use the below select statement to extract each columns
create external table tablename(
c8 string )
STORED AS TEXTFILE
location 'HDFS path';
select regexp_extract(c8,'8=(.*?)<SHO>',1) as c8,
regexp_extract(c8,'9=(.*?)<SHO>',1) as c9,
regexp_extract(c8,'35=(.*?)<SHO>',1) as c35,
regexp_extract(c8,'34=(.*?)<SHO>',1) as c34,
regexp_extract(c8,'49=(.*?)<SHO>',1) as c49,
regexp_extract(c8,'52=(.*?)<SHO>',1) as c52,
regexp_extract(c8,'56=(.*?)<SHO>',1) as c56,
regexp_extract(c8,'98=(.*?)<SHO>',1) as c98,
regexp_extract(c8,'108=(.*?)<SHO>',1) as c108,
regexp_extract(c8,'554=(.*?)<SHO>',1) as c554,
regexp_extract(c8,'35=(.*?)<SHO>',1) as c10
from tablename

How to perform an add functionality in sql loader file

I have a fixed length data file a.dat with below data in it
1234544550002200011000330006600000
my focus is on specific positions
POSITION(1:4)
POSITION(5:8)
and I want to add values in these 2 positions and insert it in a field named Qty in XYZ_Table.
I am trying to the following in my CTL file. But it fails, and I don't know how to pursue it further.
LOAD DATA
INFILE '$SOME_DATA/a.dat'
APPEND
PRESERVE BLANKS
INTO TABLE XYZ_Table
(QTY POSITION(1:4)+POSITION(5:8) "to_number(:QTY)")
I need to achieve this addition functionality in SQL Loader only.
If the above methodology is not possible, it would be great if you can help me with a different approach.
P.S: What I am trying to achieve is just one part of the bigger CTL file.
You need to identify the positions you want to add together but not load into their own columns as "BOUNDFILLER", which means don't load them but remember them for use in an expression later. Then use like this:
LOAD DATA
infile test.dat
append
preserve blanks
INTO TABLE X_test
TRAILING NULLCOLS
(val_1 BOUNDFILLER position(1:4)
,val_2 BOUNDFILLER position(5:8)
,qty ":val_1 + :val_2"
)

LINES TERMINATED BY only supports newline '\n' right now

I have files where the column is delimited by char(30) and the lines are delimited by char(31). I'm using these delimiters mainly because the columns may contain newline (\n), so the default line delimiter for hive is not useful for us.
I have tried to change the line delimiter in hive but get the error below:
LINES TERMINATED BY only supports newline '\n' right now.
Any suggestion?
Write custom SerDe may work?
is there any plan to enhance this functionality in hive in new releases?
thanks
Not sure if this helps, or is the best answer, but when faced with this issue, what we ended up doing is setting the 'textinputformat.record.delimiter' Map/Reduce java property to the value being used. In our case it was a string "{EOL}", but could be any unique string for all practical purposes.
We set this in our beeline shell which allowed us to pull back the fields correctly. It should be noted that once we did this, we converted the data to Avro as fast as possible so we didn't need to explain to every user, and the user's baby brother, to set the {EOL} line delimiter.
set textinputformat.record.delimiter={EOL};
Here is the full example.
#example CSV data (fields broken by '^' and end of lines broken by the String '{EOL}'
ID^TEXT
11111^Some THings WIth
New Lines in THem{EOL}11112^Some Other THings..,?{EOL}
111113^Some crazy thin
gs
just crazy{EOL}11114^And Some Normal THings.
#here is the CSV table we laid on top of the data
CREATE EXTERNAL TABLE CRAZY_DATA_CSV
(
ID STRING,
TEXT STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\136'
STORED AS TEXTFILE
LOCATION '/archive/CRAZY_DATA_CSV'
TBLPROPERTIES('skip.header.line.count'='1');
#here is the Avro table which we'll migrate into below.
CREATE EXTERNAL TABLE CRAZY_DATA_AVRO
(
ID STRING,
TEXT STRING
)
STORED AS AVRO
LOCATION '/archive/CRAZY_DATA_AVRO'
TBLPROPERTIES ('avro.schema.url'='hdfs://nameservice/archive/avro_schemas/CRAZY_DATA.avsc');
#And finally, the magic is here. We set the custom delimiter and import into our Avro table.
set textinputformat.record.delimiter={EOL};
INSERT INTO TABLE CRAZY_DATA_AVRO SELECT * from CRAZY_DATA_CSV;
I have worked it out by using the option during the extract --hive-delims-replacement ' ' in sqoop so the characters \n \001 \r are removed from the columns.

bulk load UDT columns in Oracle

I have a table with the following structure:
create table my_table (
id integer,
point Point -- UDT made of two integers (x, y)
)
and i have a CSV file with the following data:
#id, point
1|(3, 5)
2|(7, 2)
3|(6, 2)
now i want to bulk load this CSV into my table, but i cant find any information about how to handle the UDT in Oracle sqlldr util. Is is possible to use the bulk load util when having UDT columns?
I don't know if sqlldr can do this, but personally I would use an external table.
Attach the file as an external table (the file must be on the database server), and then insert the contents of the external table into the destination table transforming the UDT into two values as you go. The following select from dual should help you with the translation:
select
regexp_substr('(5, 678)', '[[:digit:]]+', 1, 1) x_point,
regexp_substr('(5, 678)', '[[:digit:]]+', 1, 2) y_point
from dual;
UPDATE
In sqlldr, you can transform fields using standard SQL expressions:
LOAD DATA
INFILE 'data.dat'
BADFILE 'bad_orders.txt'
APPEND
INTO TABLE test_tab
FIELDS TERMINATED BY "|"
( info,
x_cord "regexp_substr(:x_cord, '[[:digit:]]+', 1, 1)",
)
The control file above will extract the first digit in the fields like (3, 4), but I cannot find a way to extract the second digit - ie I am not sure if it is possible to have the same field in the input file inserted into two columns.
If external tables are not an option for you, I would suggest either (1) transform the file before loading, using sed, awk, Perl etc or (2) SQLLDR the file into a temporary table and then have a second process to trandform the data and insert into your final table. Another option is to look at how the file is generated - could you generate it so that the field you need to transform is repeated in two fields in the file, eg:
data|(1, 2)|(1, 2)
Maybe someone else will chip in with a way to get sqlldr to do what you want.
Solved the problem after more research, because Oracle SQL*Loader has this feature, and it is used by specifying a column object, the following was the solution:
LOAD DATA
INFILE *
INTO TABLE my_table
FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY '"'
TRAILING NULLCOLS
(
id,
point column object
(
x,
y
)
)
BEGINDATA
1,3,5
2,7,2
3,6,2

Resources