How can I do a double delimiter in Hive? - hadoop

let's say I have some sample rows of data
site1^http://article1.com?datacoll=5|4|3|2|1&test=yes
site1^http://article1.com?test=yes
site1^http://article1.com?datacoll=5|4|3|2|1&test=yes
I want to create a table like so
create table clicklogs (sitename string, url string)
ROW format delimited fields terminated by '^';
As you can see I have some data in the url parameter I'd like to extract, namely
datacoll=5|4|3|2|1
I also want to work with those individual elements seperated by pipes so I can do group bys on them to show for example how many urls had a 2nd position of "4" which would be 2 rows in this case. So in this case I have the "url" field that has additional data I'd like to parse out and use in my queries.
The question is, what is the best way to do that in hive?
thanks!

First, use parse_url(string urlString, string partToExtract [, string keyToExtract]) to grab the data in question:
parse_url('http://article1.com?datacoll=5|4|3|2|1&test=yes', 'QUERY', 'datacol1')
This returns '5|4|3|2|1', which gets us halfway there. Now, use split(string str, string pat) to break those out of each sub-delimiter into an array:
split(parse_url(url, 'QUERY', 'datacol1'), '\|')
With the result of this, you should be able to grab the columns that you want.
See the UDF documentation for more built-in functions.
Note: I wasn't able to verify this works in Hive from where I am, sorry if there are some minor issues.

This looks very similar to something I've done a couple weeks ago, I think the best approach in your case would be to apply a pre-processing step (possibly with hadoop streaming), and change the prototype of your table to be:
create table clicklogs(sitename string, datacol Array<int>) row format delimited fields terminated by '^' collection items terminated by '|'
Once you have that you can easily manipulate your data in Hive using lateral views and the builtin explode. The following code should help you get the counts of URLs per col.
select col, count(1) from clicklogs lateral view explode(datacol) dataTable as col group by col

Related

Informatica - Concatenate Max value from each colum present in multiple rows for same Primary Key

enter image description here
I have tried traditional approach of using Agg (Group By: ID, Store Name) and Max(Each Object) columns separately.
Then in next expression, Concat(Val1 Val2 Val3 || Val4).
How ever, I'm getting output as '0100'.
But, REQUIRED OUTPUT: 1100
Please let me know, how this can be done in IICS.
IICS is similar to the Powercenter on-prem.
First use an aggregator.
in Group By tab add ID, Store Name
in Aggregate tab add max(object1)... please note to set data type and length correctly.
Then use an expression transformation.
link ID, Store Name first.
Then concat the max_* columns using pipe -
out_max=max_col1||max_col2||... please note to set data type and length correctly.
This should generate correct output. I think you are having wrong output because of data length or data type of object fields. Make sure you trim spaces from object data before aggregator.

Spreadsheet - query-importrange sort by date and keep text in the same column

I am using 3 different spreadsheets which i have linked to a third spreadsheet where it shows up specific columns shorted by date asc (col2). The problem is that in the initial spreadsheets (where i importing the data from) the col30 (which i am trying to sort as col2 in final spreadsheet) has dates and text. What i need is that in the final spreadsheet to have the date sorted and to show also the text (in the col2 of final spreadsheet-which imports data from col30 of the 3 different spreadsheets).
The dates are sorted but neither the text appears nor the rest of the data which are in the same row with the date (on initial spreadsheets). The total data of the columns chosen when "Col6 CONTAINS '"&$B$1&"' are only appears when i put a date on col30 on initial spreadsheets. Otherwise, when it is no date but onlly text on col30 it doesn't return any variables.
Any suggestions? Thank you in advance.
What i have tried so far, which works without showing the text that i need to be shown:
=QUERY(QUERY({IMPORTRANGE("url1 ";"sheet1!A2:AJ1000");IMPORTRANGE("url2 ";"sheet2!A2:AJ1000");IMPORTRANGE("url3 ";"sheet3!A2:AJ1000")};"Select Col5,Col30,Col31,Col21,Col22,Col23,Col24,Col34,Col35,Col36 where Col6 CONTAINS '"&$B$1&"'");"Select * where Col2 is not null order by Col2")
Here is what I believe you are trying to achieve:
=QUERY(
{
IMPORTRANGE("1usAXftvFrpCHz7LN43avWrWqSIO14iKM-pgwuG9jMeE";"Sheet1!A2:AJ")\
ARRAYFORMULA(
TO_TEXT(
IMPORTRANGE("1usAXftvFrpCHz7LN43avWrWqSIO14iKM-pgwuG9jMeE";"Sheet1!AD2:AE")
)
)
};
"SELECT Col5,Col37,Col38,Col21,Col22,Col23,Col24,Col34,Col35,Col36 WHERE Col6='"&$B$1&"' AND Col37 is not null ORDER BY Col30, Col31"
)
Let's unpack the changes:
Remove the outer query. You don't need it. Instead add the condition and order by in the first query.
change the range to be an open ended one
Add columns with the text version of the dates / times.
The last point is important as query only supports a single type at a time. This means that when you were querying over the date and time, you were loosing the text (because they are of another type). Adding 2 more columns and forcing them to be text allows you to add them in the result without loosing information and keeping the originals allows you to order by them.
References
QUERY (Docs editors help)
TO_TEXT (Docs editors help)
ARRAYFORMULA (Docs editors help)

How to select a substring from Oracle blob field

I need to get part of a blob field which has some json data. one part of the blob is like this CustomData:{HDFC;1;0;sent} . I need separate values after CustomData like I need to get HDFC, 1, 0, sent separately.
This is what I have tried in two separate queries which works:
This gives me index of CustomData within payment_data blob field for example it returns 11000
select dbms_lob.instr(payment_data, utl_raw.cast_to_raw('CustomData'))
from table_x;
I am specifying 3rd parameter as what first query returned + length of test CustomData: to get {HDFC;1;0;sent}
select UTL_RAW.CAST_TO_VARCHAR2(dbms_lob.substr(payment_data,1000,11011))
from table_x;
Problem is I need to take dynamic offset in 2nd query and not run 1st query as individual. Specifying dynamic offset is not working with dbms_lob.substr() function. Any suggestions how can I combine these two queries into one?
Once I get {HDFC;1;0;sent}, I also need to get these delimited values separately, so combining these three into one would even be better if someone can help with it. I can use regexp_substr to get delimited text once I get first two combined.
If you want extract text data from blob first u need convert it to clob using dbms_lob.converttoclob.
If you have Oracle 12c or higher you may use JSON SQL functions, for example, JSON_TABLE.
If your Oracle version between 10 and 11 you may use regex functions or instr + substr if your version less than 10.

How to load nested collections in hive with more than 3 levels

I'm struggling to load data into Hive, defined like this:
CREATE TABLE complexstructure (
id STRING,
date DATE,
day_data ARRAY<STRUCT<offset:INT,data:MAP<STRING,FLOAT>>>
) row format delimited
fields terminated by ','
collection items terminated by '|'
map keys terminated by ':';
The day_data field contains a complex structure difficult to load with load data inpath...
I've tried with '\004', ^D... a lot of options, but the data inside the map doesn't get loaded.
Here is my last try:
id_3054,2012-09- 22,3600000:TOT'\005'0.716'\004'PI'\005'0.093'\004'PII'\005'0.0'\004'PIII'\005'0.0'\004'PIV'\005'0.0'\004'RES'\005'0.0|7200000:TOT'\005'0.367'\004'PI'\005'0.066'\004'PII'\005'0.0'\004'PIII'\005'0.0'\004'PIV'\005'0.0'\004'RES'\005'0.0|10800000:TOT'\005'0.268'\004'PI'\005'0.02'\004'PII'\005'0.0'\004'PIII'\005'0.0'\004'PIV'\005'0.159'\004'RES'\005'0.0|14400000:TOT'\005'0.417'\004'PI'\005'0.002'\004'PII'\005'0.0'\004'PIII'\005'0.0'\004'PIV'\005'0.165'\004'RES'\005'0.0`
Before posting here, I've tried (many many) options, and this example doesn't work:
HIVE nested ARRAY in MAP data type
I'm using the image from HDP 2.2
Any help would be much appreciated
Thanks
Carlos
So finally I found a nice way to generate the file from java. The trick is that Hive uses the first 8 ASCII characters as separators, but you can only override the first three. From the fourth on, you need to generate thee actual ASCII charaters.
After many tests, I ended up editing my file with an HEX editor, and inserting the right value worked, but how can I do that in Java? Can't be more simple: just cast an int into char, and that will generate the corresponding ASCII character:
ASCII 4 -> ((char)4)
ASCII 5 -> ((char)5)
...
And so on.
Hope this helps!!
Carlos
You could store Hive table in Parquet or ORC format which support nested structures natively and more efficiently.

ORACLE SQLLOADER, referencing calculated values

hope you're having a nice day. I'm learning how to use functions on SQL-LOADER and i have a question about it, lets say i have this table
table a
--------------
code
name
dept
birthdate
secret
the data.csv file contains this data
name
dept
birthdate
and i'm using this code to load data to it with SQLLOADER
LOAD DATA
INFILE "data.csv"
APPEND INTO TABLE a;
FIELDS TERMINATED BY ',' optionally enclosed by '"'
TRAILING NULLCOLS
(code "getCode(:name,:dept)",name,dept,birthdate,secret "getSecret(getCode(:name,:dept),birthdate)")
so this works like a charm it gets the values from my getCode and getSecret functions, however, i would like to reference the previously calculated value (by getCode) so i don't have to nest functions on getSecret, like this:
getSecret(**getCode(:name,:dept)**,birthdate)
i've tried to do it like this:
getSecret(**:code**,birthdate)
but it gets the original value from the file (meaning null) and not the calculated by the function (guess because it does it on the fly), so my question is if there is a way to avoid these nest calls for previously calculated values, so i don't have to loose performance recalculating the same values over and over again (the real table i'm using it's like 10 times bigger and nests a lot of functions for these previously calculated values, so i guess that's reducing performance)
any help would be appreciated, Thanks!!
complement
Sorry, but i haven't used external tables before (kinda new here), how could i implement this using this tables? (considering all the calculated values i need to get from functions i developed, tried trigger (SQL Loader, Trigger saturation?), killed database...)
I'm not aware of a way of doing this.
If you switched to using external tables you'd have a lot more freedom for this sort of thing -- common table expressions, leveraging subquery caching, that sort of stuff.

Resources