String to Map Conversion Hive - hadoop

I have a table having four columns.
C1 C2 C3 C4
--------------------
x1 y1 z1 d1
x2 y2 z2 d2
Now I want convert it into map data type having key and value pairs and load into separate table.
create table test
(
level map<string,string>
)
row format delimited
COLLECTION ITEMS TERMINATED BY '&'
map keys terminated by '=';
Now I am using below sql to load data.
insert overwrite table test
select str_to_map(concat('level1=',c1,'&','level2=',c2,'&','level3=',c3,'&','level4=',c4) from input;
Select query on the table.
select * from test;
{"level1":"x1","level2":"y1","level3":"z1","level4":"d1=\\"}
{"level1":"x2","level2":"y2","level3":"z2","level4":"d2=\\"}
I didn't get why I am getting extra "=\ \" in last value.
I double check data but the issue persist.
Can you please help?

str_to_map(text, delimiter1, delimiter2) - Creates a map by parsing text
Split text into key-value pairs using two delimiters. The first delimiter seperates pairs, and the second delimiter sperates key and value. If only one parameter is given, default delimiters are used: ',' as delimiter1 and '=' as delimiter2.
You can get this info by running this command:
describe function extended str_to_map
In your syntax there are two errors:
insert overwrite table test
select str_to_map(concat('level1=',c1,'&','level2=',c2,'&','level3=',c3,'&','level4=',c4) from input;
First is, one bracket ) is missing.
Second is, its not an error basically, you have not given the delimiters so the function is taking default values for delimiters, That's why your are getting ',' in your result.
To get the output in current format you should try this query:
insert overwrite table test
select str_to_map(concat('level1=',c1,'&','level2=',c2,'&','level3=',c3,'&','level4=',c4),'&','=') from input;

Related

SQL Loader skips blank fields surrounded by \t if OPTIONALLY ENCLOSED is on

I am trying to load data from a file to Oracle DB using
SQL*Loader: Release 12.1.0.2.0
Table :
CREATE TABLE TEST_TABLE (
ID NUMBER(38) DEFAULT NULL,
X VARCHAR2(4000) DEFAULT NULL,
NUM NUMBER(38) DEFAULT NULL,
Y VARCHAR2(4000) DEFAULT NULL
);
Data testdata.txt:
ID X NUM Y
1 x1 0 y1
2 x2 0 "y2
."
3 0 y3
4 0 "y4
."
5 x5 0 y5
and written with tabs replaced with \t:
ID\tX\tNUM\tY
1\tx1\t0\ty1
2\tx2\t0\t"y2
."
3\t \t0\ty3
4\t \t0\t"y4
."
5\tx5\t0\ty5
So it is important that lines 3 and 4 (counting header as 0) contain a field that is simply a SPACE and lines 2 and 4 contain a quoted field containing the line separator \n
Control file:
OPTIONS (SKIP=1)LOAD DATA
CHARACTERSET we8iso8859p1
INFILE 'testdata.txt' "STR '\n'"
PRESERVE BLANKS
INTO TABLE TEST_TABLE
FIELDS CSV WITH EMBEDDED TERMINATED BY "\t" OPTIONALLY ENCLOSED BY '"'
(
ID,
X,
NUM,
Y
)
Result:
Record 3: Rejected - Error on table TEST_TABLE, column Y.
Column not found before end of logical record (use TRAILING NULLCOLS)
Record 4: Rejected - Error on table TEST_TABLE, column Y.
Column not found before end of logical record (use TRAILING NULLCOLS)
What I tried:
Using | as field separator makes the data load correctly - unfortunately I don't control the data.
Removing the NUM column makes the problem go away - this is not an option.
Using TRAILING NULLCOLS hides the error and loads faulty data
Not using PRESERVE BLANKS does not solve the problem and also ruins data
Replacing \t with X'09' in the control file changes nothing
Not using CSV WITH EMBEDDED or OPTIONALLY ENCLOSED moves the problem to failing on the quoted fields
Using data types in controlfile does nothing
Using NULLIF X=BLANKS does not solve the problem and would ruin data if it did.
Question:
How do I make SQL Loader read fields containing only BLANKS in a data file with fields separated by TAB and optionally enclosed by '"'

Hive parse and edit array to struct field

I've a requirement in hive complex data structure which I'm new to. I've tried few things which didn't work out. I'd like to know if there is a solution or I'm looking at a dead end.
Requirement :
Table1 and Table2 are of same create syntax. I want to select all columns from table1 and insert it into table2, where few column values will be modified. For struct field also, I can make it work using named_struct.
But if table1 has array> type, then I'm not sure how to make it work.
eg.,
CREATE TABLE IF NOT EXISTS table1 (
ID INT,
XYZ array<STRUCT<X:DOUBLE, Y:DOUBLE, Z:DOUBLE>>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '$'
MAP KEYS TERMINATED BY '#' ;
CREATE TABLE IF NOT EXISTS table2 (
ID INT,
XYZ array<STRUCT<X:DOUBLE, Y:DOUBLE, Z:DOUBLE>>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '$'
MAP KEYS TERMINATED BY '#' ;
hive> select * from table1 ;
OK
1 [{"x":1,"y":2,"z":3},{"x":4,"y":5,"z":6},{"x":7,"y":8,"z":9}]
2 [{"x":4,"y":5,"z":6},{"x":7,"y":8,"z":9}]
How can I update a struct field in array while inserting. Let's say if structField y is 5, then I want it to be inserted as 0.
For complex type struct you can use Brickhouse UDF.Download the jar and add it in your script.
add jar hdfs://path_where_jars_are_downloaded/brickhouse-0.6.0.jar
Create a collect function.
create temporary function collect_arrayofstructs as 'brickhouse.udf.collect.CollectUDAF';
Query:Replace the y value with 0
select ID, collect_arrayofstructs(
named_struct(
"x", x,
"y", 0,
"z", z,
)) as XYZ
from table1;

vertica copy command with null value for Integer

Is there any empty character i can put into the csv in order to put a null value
into an integer column
without using the ",X" pattern?
i.e. (X is a value and the first one is null)
Suppose you have a file /tmp/file.csv like this:
2016-01-10,100,abc
2016-02-21,,def
2017-01-01,300,ghi
and a target table defined as follows:
create table t1 ( dt date, id integer, txt char(10));
Then, the following command will insert NULL into "id" for the second column (the one having dt='2016-02-21'):
copy t1 from '/tmp/file.csv' delimiter ',' direct abort on error;
Now, if you want to use a special string to identify NULL values in your input file, let's say 'MYNULL':
2016-01-10,100,abc
2016-02-21,MYNULL,def
2017-01-01,300,ghi
Then... you have to run copy COPY this way:
copy t1 from '/tmp/file.csv' delimiter ',' null 'MYNULL' direct abort on error;

Import CSV which every cell terminated by newline

I have CSV file. The data looks like this :
PRICE_a
123
PRICE_b
500
PRICE_c
1000
PRICE_d
506
My XYZ Table is :
CREATE TABLE XYZ (
DESCRIPTION_1 VARCHAR2(25),
VALUE NUMBER
)
Do csv as above can be imported to the oracle?
How do I create a control.ctl file?
Here's how to do it without having to do any pre-processing. Use the CONCATENATE 2 clause to tell SQL-Loader to join every 2 lines together. This builds logical records but you have no separator between the 2 fields. No problem, but first understand how the data file is read and processed. SQL-Loader will read the data file a record at a time, and try to map each field in order from left to right to the fields as listed in the control file. See the control file below. Since the concatenated record it read matches with TEMP from the control file, and TEMP does not match a column in the table, it will not try to insert it. Instead, since it is defined as a BOUNDFILLER, that means don't try to do anything with it but save it for future use. There are no more data file fields to try to match, but the control file next lists a field name that matches a column name, DESCRIPTION_1, so it will apply the expression and insert it.
The expression says to apply the regexp_substr function to the saved string :TEMP (which we know is the entire record from the file) and return the substring of that record consisting of zero or more non-numeric characters from the start of the string where followed by zero or more numeric characters until the end of the string, and insert that into the DESCRIPTION_1 column.
The same is then done for the VALUE column, only returning the numeric part at the end of the string, skipping the non-numeric at the beginning of the string.
load data
infile 'xyz.dat'
CONCATENATE 2
into table XYZ
truncate
TRAILING NULLCOLS
(
TEMP BOUNDFILLER CHAR(30),
DESCRIPTION_1 EXPRESSION "REGEXP_SUBSTR(:TEMP, '^([^0-9]*)[0-9]*$', 1, 1, NULL, 1)",
VALUE EXPRESSION "REGEXP_SUBSTR(:TEMP, '^[^0-9]*([0-9]*)$', 1, 1, NULL, 1)"
)
Bada-boom, bada-bing:
SQL> select *
from XYZ
/
DESCRIPTION_1 VALUE
------------------------- ----------
PRICE_a 123
PRICE_b 500
PRICE_c 1000
PRICE_d 506
SQL>
Note that this is pretty dependent on the data following your example, and you should do some analysis of the data to make sure the regular expressions will work before putting this into production. Some tweaking will be required if the descriptions could contain numbers. If you can get the data to be properly formatted with a separator in a true CSV format, that would be much better.

How to specify custom string for NULL values in Hive table stored as text?

When storing Hive table in text format, for example this table:
CREATE EXTERNAL TABLE clustered_item_info
(
country_id int,
item_id string,
productgroup int,
category string
)
PARTITIONED BY (cluster_id int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '${hivevar:table_location}';
Fields with null values are represented as '\N' strings, also for numbers NaNs are represented as 'NaN' strings.
Does Hive provide a way to specify custom string to represent these special values?
I would like to use empty strings instead of '\N' and 0 instead of 'NaN' - I know this substitution can be done with streaming, but is there any way to it cleanly using Hive instead of writing extra code?
Other info:
I'm using Hive 0.8 if that matters...
Use this property while creating the table
CREATE TABLE IF NOT EXISTS abc
(
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
TBLPROPERTIES ("serialization.null.format"="")
oh, sorry. I read your question not clear
If you want to represented empty string instead of '\N', you can using COALESCE function:
INSERT OVERWRITE DIRECTORY 's3://bucket/result/'
SELECT NULL, COALESCE(NULL,"")
FROM data_table;

Resources