Dynamic aggregation in SQL (Hive) - hadoop

I have two tables. Table A with 3 columns: userid, a start date, and end date. Table B with events and datetimestamps. I would like to aggregate Table B up to the datetimes between the start and end date based on Table A. So something like...
select a.userid, count(distinct b.eventid) as events
from table a
inner join table b
on a.userid=b.userid
and b.datetime between a.starttime and b.endtime
group by a.userid
But Hive doesn't like that... I'm using Hadoop HortonWorks. Would appreciate any guidance!

Move the between condition to where as only equality conditions in joins are supported prior to version 2.2.0.
From Hive documentation
Complex expressions in ON clause are supported, starting with Hive 2.2.0 (see HIVE-15211, HIVE-15251). Prior to that, Hive did not support join conditions that are not equality conditions.

Related

Will HIVE do a full table query with both partition conditions and not partition conditions?

I have a hive table partitioned by one date column name datetime
If I do a query like
select *
from table
where datetime = "2021-05-01"
and id in (1,2)
with extra and id in (1,2) condition, will hive do a full table search?
Is it possible to determine it by explainresult?
Partition pruning should work fine. To verify use EXPLAIN DEPENDENCY command, it will print input partitions in JSON array {"input_partitions":[...]}
See EXPLAIN DEPENDENCY docs.
EXPLAIN EXTENDED also prints used partitions.

Combine Multiple Hive Tables as single table in Hadoop

Hi I have multiple Hive tables around 15-20 tables. All the tables will be common schema . I Need to combine all the tables as single table.The single table should be queried from reporting tool, So performance is also needs to be care..
I tried like this..
create table new as
select * from table_a
union all
select * from table_b
Is there any other way to combine all the tables more efficient. Any help will be appreciated.
Hive would be processing in parallel if you set "hive.exec.parallel" as true. With "hive.exec.parallel.thread.number" you can specify the number of parallel threads. This would increase the overall efficiency.
If you are trying to merge table_A and table_b into a single one, the easiest way is to use the UNION ALL operator. You can find the syntax and use cases here - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union

Oracle Partition pruning not happening

I have a fact table with millions of records. The table is range partitioned on a date column.
FACT_AUM (ACCOUNT_ID VARCHAR2(30),MARKET_VALUE NUMBER(20,6), POSTING_DATE DATE);
I have another temp table
ACCOUNT_TMP (ACCOUNT_ID VARCHAR2(30), POSTING_DATE DATE);
When I run this query by hard coding the date I see partition pruning happens and the results come back quickly
SELECT A.ACCOUNT_ID, SUM(A.MARKET_VALUE) FROM
FACT_AUM A JOIN ACCOUNT_TMP B ON A.ACCOUNT_ID = B.ACCOUNT_ID
AND A.POSTING_DATE=TO_DATE('30-DEC-2016',DD-MON-YYYY') GROUP BY
A.ACCOUNT_ID;
when I run the following, I don't see partition pruning and the query keeps spinning
SELECT A.ACCOUNT_ID, SUM(A.MARKET_VALUE) FROM
FACT_AUM A JOIN ACCOUNT_TMP B ON A.ACCOUNT_ID = B.ACCOUNT_ID
AND A.POSTING_DATE = B.POSTING_DATE GROUP BY
A.ACCOUNT_ID;
Any insights on this would be helpful.
Oracle used partition pruning while you hard coded the value, because Oracle felt it would get benefit of doing the partition pruning there.
When you joined the fact table with your temporary ( i would reword it to staging) table, Oracle wouldn't be able to guess which all partitions would it have to hit for computing the answer. Please note Oracle will assess what would be the range of values available in the staging table.
But unless you provide stats of the tables involved, i couldn't dwell into more important topics of the table ordering and tables joins. For quick fix use an Order hint or nested loop hint.

How to populate columns of a new hive table from multiple existing tables?

I have created a new table in hive (T1) with columns c1,c2,c3,c4. I want to populate data into this table by querying from other existing tables(T2,T3).
E.g c1 and c2 come from a query run on T2 & the other columns c3 and c4 come from a query run on T3.
Is this possible in hive ? I have done immense research but still am unable to find a solution to this
Didn't something like this work?
create table T1 as
select t2.c1, t2.c2, t3.c3, t3.c4 from (some query against T2) t2 JOIN (some query against T3) t3
Obviously replace JOIN with whatever is needed. I assume some join between T2 and T3 is possible or else you wouldn't be putting their columns alongside each other in T1.
According to the hive documentation, you can use the following syntax to insert data:
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;
Be careful that:
Values must be provided for every column in the table. The standard SQL syntax that allows the user to insert values into only some columns is not yet supported. To mimic the standard SQL, nulls can be provided for columns the user does not wish to assign a value to.
So, I would make a JOIN between the two existing table, and then insert only the needed values in the target table playing around with SELECT. Or maybe creating a temporary table would allow you to have more control over the data. Just remember to handle the problem with NULL, as stated in the official documentation. This is just an idea, I guess there are other ways to achieve what you need, but could be a good place to start from.

How to select row data as column in Oracle

I have two tables like bellow shows figures
I need to select records as bellow shown figure. with AH_ID need to join in second table and ATT_ID will be the column header and ATT_DTL_STR_VALUE need to get as that column relevant value
Required output
Sounds like you have an Entity-Attribute-Value data model which relational DBs aren't the best at modeling. You may want to look into a key-value store.
However, as Justin suggested, if you're using 11g you can use th pivot clause as follows:
SELECT *
FROM (
SELECT T1.AH_ID, T1.AH_DESCRIPTION, T2.ATT_ID, T2.ATT_DTL_STR_VALUE
FROM T1
LEFT OUTER JOIN T2 ON T1.AH_ID = T2.AH_ID
)
PIVOT (MAX(ATT_DTL_STR_VALUE) FOR (ATT_ID) IN (1));
This statement requires you to hard-code in ATT_ID however there are ways to do it dynamically. More info can be found here.

Resources