Filter Array in Hive - hadoop

An Apache hive table has the following column definition:
myvars:array<struct<index:bigint,value:string>>
An example for the corresponding data is:
"myvars":[
{"index":2,"value":"value1"}
, {"index":1,"value":"value2"}
, {"index":2,"value":"value3"}
]
How can this array be filtered to all elements where "index"==2.
In JavaScript I would do something like the following:
myvars.filter(function(d){return d.index==2;})
How can the same result be achieved with Apache Hive QL, preferably without lateral views?

In hive you have a set of Collection functions:
Collection
array_contains(Array<T> a, val)
array<K.V> map_keys(Map<K.V> a)
array<K.V> map_values(Map<K.V> a)
size(Map<K.V>|Array<T> a)
sort_array(Array<T> a)
in your query use
...
WHERE
array_contains(myvars,2)

I think if you are trying to extract all the values where index is 2, you want something like this:
SELECT DISTINCT value
FROM mytable
LATERAL VIEW EXPLODE(myvars) exploded_myvars AS idx, value
WHERE idx = 2;
If instead the data type was array<map<string,string>> it would be
SELECT DISTINCT mv["value"]
FROM mytable
LATERAL VIEW EXPLODE(myvars) exploded_myvars AS mv
WHERE mv["index"] = 2;

Related

Hive use variable from a query that return only one value

I would like to use something like:
set hivevar:average = select avg(nvl(val,0)) as average from table;
and Then use on another table like:
select * from table2 where value>=${average}
Is this possible in hive? if not how is the correct way on doing

PIG: How to remove '::' in the column name

I have a pig relation like below:
FINAL= {input_md5::type: chararray,input_md5::name: chararray,input_md5::id: long,input_md5::age: chararray,test_1:: type: chararray,test_2::name:chararray}
I am trying to store all columns for input_md5 relation to a hive table.
like all input_md5::type: chararray,input_md5::name: chararray,input_md5::id: long,input_md5::age: chararray not taking test_1:: type: chararray,test_2::name:chararray
is there any command in pig which filters only columns of input_md5.Something like below:
STORE= FOREACH FINAL GENERATE all input_md5::type .
I know that pig have :
FOREACH FINAL GENERATE all input_md5::type as type syntax, but i have many columns so I cannot use as in my code.
Because when i try:
STORE= FOREACH FINAL GENERATE input_md5::type .. bus_input_md5::name;
Pig throws an error:
org.apache.hive.hcatalog.common.HCatException : 2007 : Invalid column position in partition schema : Expected column <type> at position 1, found column <input_md5::type>
Thanks in advance,
Resolved this issue , below is the fix:
Create a relation with some filter condition as below:
DUMMY_RELATION= FILTER SOURCE_TABLE BY type== ''; (I took a column named type ,this can be filtered by any column in the table , all that matters is we need its schema)
FINAL_DATASET= UNION DUMMY_RELATION,SCHEMA_1,SCHEMA_2;
(this new DUMMY_RELATIONn should be placed 1st in the union)
Now you no more have :: operator And your column names would match hive table's column names, provided your source table (to DUMMY_RELATION) and target table have same column order.
Thanks to myself :)
I implemented Neethu's example this way. May have typos, but it shows how to implement this idea.
tableA = LOAD 'default.tableA' USING org.apache.hive.hcatalog.pig.HCatLoader();
tableB = LOAD 'default.tableB' USING org.apache.hive.hcatalog.pig.HCatLoader();
--load empty table
finalTable = LOAD 'default.finalTable' USING org.apache.hive.hcatalog.pig.HCatLoader();
--example operations that end up with '::' in column names
g = group tableB by (id);
j = JOIN tableA by id LEFT, g by group;
result = foreach j generate tableA::id, tableA::col2, g::tableB;
--union empty finalTable and result
result2 = union finalTable, result;
--bob's your uncle
STORE result2 INTO 'finalTable' USING org.apache.hive.hcatalog.pig.HCatStorer();
Thanks to Neethu!

Hive Joins query

I have two tables in hive:
Table 1:
1,Nail,maher,24,6.2
2,finn,egan,23,5.9
3,Hadm,Sha,28,6.0
4,bob,hope,55,7.2
Table 2 :
1,Nail,maher,24,6.2
2,finn,egan,23,5.9
3,Hadm,Sha,28,6.0
4,bob,hope,55,7.2
5,john,hill,22,5.5
6,todger,hommy,11,2.2
7,jim,cnt,99,9.9
8,will,hats,43,11.2
Is there any way in Hive to retrieve the new data in table 2 that doesn't exist in table 1??
In other Databases tools, you would use a inner left/right. But inner left/right doesn't exist in Hive and suggestions how this could be achieved?
If you are using Hive version >= 0.13 you can use this query:
SELECT * FROM A WHERE A.firstname, A.lastname ... IN (SELECT B.firstname, B.lastname ... FROM B);
But I'm not sure if Hive supports multiple coloumns in the IN clause.
If not something like this could work:
SELECT * FROM A WHERE A.firstname IN (SELECT B.firstname FROM B) AND A.lastname IN (SELECT b.lastname FROM B) ...;
It might be wiser to concatenate the fields together before testing for NOT IN:
SELECT *
FROM t2
WHERE CONCAT(t2.firstname, t2.lastname, CAST(t2.val1 as STRING), CAST(t2.val2 as STRING)) NOT IN
(SELECT CONCAT(t2.firstname, t2.lastname, CAST(t2.val1 as STRING), CAST(t2.val2 as STRING))
FROM t1)
Performing sequential NOT IN sub-queries may give you erroneous results.
From the above example, a new record with the values ('nail','egan',28, 7.2) would not show up as new with sequential NOT IN statements.

Assignment in Hive query

I have below query in which i need to assign one table column value to another table column.
Query:
SELECT A.aval,B.bval,B.bval1 FROM A JOIN B ON (A.aval = B.bval)
How do I assign one table column value to another table column in Hive?
Have tried
SELECT A.aval,B.bval,B.bval1, A.aval = B.bval1 FROM A JOIN B ON (A.aval = B.bval)
In results:
A.aval = B.bval1, returning false since its not assigning to A.aval.
I guess you want to write in a table ?
So You have to create a table (for example C) which contains all the fields you need.
And then you do :
INSERT [OVERWRITE] INTO TABLE C
SELECT A.aval,B.bval,B.bval1, A.aval
FROM A
JOIN B ON (A.aval = B.bval)
The result of the select will be inserted in the table C
insert overwrite table c SELECT A.aval,B.bval,B.bval1 FROM A JOIN B ON (A.aval = B.bval)

Oracle: function based index using dynamic values

I have one complex SQL queries. One of the simple part of the queries looks like:
Query 1:
SELECT *
FROM table1 t1, table2 t2
WHERE t1.number = t2.number
AND UPPER(t1.name) = UPPER(t2.name)
AND t1.prefix = p_in_prefix;
Query 2:
SELECT *
FROM table1 t1, table2 t2
WHERE t1.number = t2.number
AND UPPER(t1.name) = UPPER(p_in_prefix || t2.name)
AND t1.prefix = p_in_prefix;
I have function based index on table1 as (number, UPPER(name)). I have function based index on my table2 as (number, UPPER(NAME)). p_in_prefix is a input parameter (basically a number).
Because of these indexes my Query 1 runs efficiently. But Query 2 has a performance issue, as in Query 2, 't2.name' is prefixed with p_in_prefix.
I can not create function based index for Query 2 because p_in_prefix is a input parameter and I don't know while creating index, what values it might hold. How to resolve performace issue in this scenario? Any hint/idea would be appreciated. If you require more information, please let me know.
Thanks.
Use AND UPPER(t1.name) = UPPER(p_in_prefix) || UPPER(t2.name).
As you have a function based index as UPPER(NAME) of table2, you should have an operand with the same expression in the query in order to make use of the function based index.
Using UPPER(p_in_prefix || t2.name) will not use the function based index as this does not match the function expression UPPER(NAME). Note here that using UPPER(t2.name) does not cause any problems as t2 is just a column alias.
Along with this, you can also pass an optimizer hint in your query in order to instruct the optimizer to use the index.
For more information read "Oracle Database 11g SQL" by Jason Price.
Also read Oracle Docs here and here and for optimizer hints here.

Resources