How to load nested collections in hive with more than 3 levels - hadoop

I'm struggling to load data into Hive, defined like this:
CREATE TABLE complexstructure (
id STRING,
date DATE,
day_data ARRAY<STRUCT<offset:INT,data:MAP<STRING,FLOAT>>>
) row format delimited
fields terminated by ','
collection items terminated by '|'
map keys terminated by ':';
The day_data field contains a complex structure difficult to load with load data inpath...
I've tried with '\004', ^D... a lot of options, but the data inside the map doesn't get loaded.
Here is my last try:
id_3054,2012-09- 22,3600000:TOT'\005'0.716'\004'PI'\005'0.093'\004'PII'\005'0.0'\004'PIII'\005'0.0'\004'PIV'\005'0.0'\004'RES'\005'0.0|7200000:TOT'\005'0.367'\004'PI'\005'0.066'\004'PII'\005'0.0'\004'PIII'\005'0.0'\004'PIV'\005'0.0'\004'RES'\005'0.0|10800000:TOT'\005'0.268'\004'PI'\005'0.02'\004'PII'\005'0.0'\004'PIII'\005'0.0'\004'PIV'\005'0.159'\004'RES'\005'0.0|14400000:TOT'\005'0.417'\004'PI'\005'0.002'\004'PII'\005'0.0'\004'PIII'\005'0.0'\004'PIV'\005'0.165'\004'RES'\005'0.0`
Before posting here, I've tried (many many) options, and this example doesn't work:
HIVE nested ARRAY in MAP data type
I'm using the image from HDP 2.2
Any help would be much appreciated
Thanks
Carlos

So finally I found a nice way to generate the file from java. The trick is that Hive uses the first 8 ASCII characters as separators, but you can only override the first three. From the fourth on, you need to generate thee actual ASCII charaters.
After many tests, I ended up editing my file with an HEX editor, and inserting the right value worked, but how can I do that in Java? Can't be more simple: just cast an int into char, and that will generate the corresponding ASCII character:
ASCII 4 -> ((char)4)
ASCII 5 -> ((char)5)
...
And so on.
Hope this helps!!
Carlos

You could store Hive table in Parquet or ORC format which support nested structures natively and more efficiently.

Related

retrieve data from Oracle XMLType with multiple or matrix type text node values

I'm very new to XML and I don't know how to extract XMLType from Oracle. My task is loop through a table column with XMLType datatype (please see attached imaged file) and check the length of value of text node. However, one type of text node is different from the other. One has multiple values separated with ';' and another is by matrix with multiple items and multiple records. I cannot even go pass this simple SQL statement that returns null vale:
SELECT extract( xs.xs_xml, '/arg0/item/text()' )
FROM XML_Shield xs
WHERE xs_id = 521521;
Looking in the internet for solution similar to my task but none fits my requirement. Most of the examples I see are one value text nodes and not matrix. I must admit this is out of my deep.
Can you guys please help me? I appreciate your assistance regarding this matter.
thanks...

how to replace characters in hive?

I have a string column description in a hive table which may contain tab characters '\t', these characters are however messing some views when connecting hive to an external application.
is there a simple way to get rid of all tab characters in that column?. I could run a simple python program to do it, but I want to find a better solution for this.
regexp_replace UDF performs my task. Below is the definition and usage from apache Wiki.
regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT):
This returns the string resulting from replacing all substrings in INITIAL_STRING
that match the java regular expression syntax defined in PATTERN with instances of REPLACEMENT,
e.g.: regexp_replace("foobar", "oo|ar", "") returns fb
Custom SerDe might be a way to do it. Or you could use some kind of mediation process with regex_replace:
create table tableB as
select
columnA
regexp_replace(description, '\\t', '') as description
from tableA
;
select translate(description,'\\t','') from myTable;
Translates the input string by replacing the characters present in the from string with the corresponding characters in the to string. This is similar to the translate function in PostgreSQL. If any of the parameters to this UDF are NULL, the result is NULL as well. (Available as of Hive 0.10.0, for string types)
Char/varchar support added as of Hive 0.14.0
You can also use translate(). If the third argument is too short, the corresponding characters from the second argument are deleted. Unlike regexp_replace() you don't need to worry about special characters.
Source code.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions
There is no OOTB feature at this moment which allows this. One way to achieve that could be to write a custom InputFormat and/or SerDe that will do this for you. You might this JIRA useful : https://issues.apache.org/jira/browse/HIVE-3751. (not related directly to your problem though).

ORACLE SQLLOADER, referencing calculated values

hope you're having a nice day. I'm learning how to use functions on SQL-LOADER and i have a question about it, lets say i have this table
table a
--------------
code
name
dept
birthdate
secret
the data.csv file contains this data
name
dept
birthdate
and i'm using this code to load data to it with SQLLOADER
LOAD DATA
INFILE "data.csv"
APPEND INTO TABLE a;
FIELDS TERMINATED BY ',' optionally enclosed by '"'
TRAILING NULLCOLS
(code "getCode(:name,:dept)",name,dept,birthdate,secret "getSecret(getCode(:name,:dept),birthdate)")
so this works like a charm it gets the values from my getCode and getSecret functions, however, i would like to reference the previously calculated value (by getCode) so i don't have to nest functions on getSecret, like this:
getSecret(**getCode(:name,:dept)**,birthdate)
i've tried to do it like this:
getSecret(**:code**,birthdate)
but it gets the original value from the file (meaning null) and not the calculated by the function (guess because it does it on the fly), so my question is if there is a way to avoid these nest calls for previously calculated values, so i don't have to loose performance recalculating the same values over and over again (the real table i'm using it's like 10 times bigger and nests a lot of functions for these previously calculated values, so i guess that's reducing performance)
any help would be appreciated, Thanks!!
complement
Sorry, but i haven't used external tables before (kinda new here), how could i implement this using this tables? (considering all the calculated values i need to get from functions i developed, tried trigger (SQL Loader, Trigger saturation?), killed database...)
I'm not aware of a way of doing this.
If you switched to using external tables you'd have a lot more freedom for this sort of thing -- common table expressions, leveraging subquery caching, that sort of stuff.

How does hive store sequencefile?

There is a hive internal table which is stored as sequence file, and the first column type is string and the field seperator is '\1',I want to process it using Mapreduce directly, and find out that the input key is BytesWritable. My question is how hive store data in sequencefile? Is the reason I get bytesWritable key is that the first column type is string? I doesn't configure map's key seperator as '\1', so I am puzzled by the second question
Hive does not treat the first column as a key for a SequenceFile. Rather, the key gets ignored completely. [1] [2]. So when you are writing your Mapper to operate on a Hive SequenceFile you should also disregard the Key. All of your columns will be part of the Value.
Just in case your Value is also a BytesWritable and you want it to be Text, try SequenceFileAsTextInputFormt (docs). The answer to this similar question question may help you get set up. You should be able to get a String from the Text with a simple toString(). Your seperator '\1' will come in here. Split your String on '\1' to separate it into your columns from Hive.

How can I do a double delimiter in Hive?

let's say I have some sample rows of data
site1^http://article1.com?datacoll=5|4|3|2|1&test=yes
site1^http://article1.com?test=yes
site1^http://article1.com?datacoll=5|4|3|2|1&test=yes
I want to create a table like so
create table clicklogs (sitename string, url string)
ROW format delimited fields terminated by '^';
As you can see I have some data in the url parameter I'd like to extract, namely
datacoll=5|4|3|2|1
I also want to work with those individual elements seperated by pipes so I can do group bys on them to show for example how many urls had a 2nd position of "4" which would be 2 rows in this case. So in this case I have the "url" field that has additional data I'd like to parse out and use in my queries.
The question is, what is the best way to do that in hive?
thanks!
First, use parse_url(string urlString, string partToExtract [, string keyToExtract]) to grab the data in question:
parse_url('http://article1.com?datacoll=5|4|3|2|1&test=yes', 'QUERY', 'datacol1')
This returns '5|4|3|2|1', which gets us halfway there. Now, use split(string str, string pat) to break those out of each sub-delimiter into an array:
split(parse_url(url, 'QUERY', 'datacol1'), '\|')
With the result of this, you should be able to grab the columns that you want.
See the UDF documentation for more built-in functions.
Note: I wasn't able to verify this works in Hive from where I am, sorry if there are some minor issues.
This looks very similar to something I've done a couple weeks ago, I think the best approach in your case would be to apply a pre-processing step (possibly with hadoop streaming), and change the prototype of your table to be:
create table clicklogs(sitename string, datacol Array<int>) row format delimited fields terminated by '^' collection items terminated by '|'
Once you have that you can easily manipulate your data in Hive using lateral views and the builtin explode. The following code should help you get the counts of URLs per col.
select col, count(1) from clicklogs lateral view explode(datacol) dataTable as col group by col

Resources