Perfomance improvement for hive query - hadoop

I am using multiple union all and then doing the sum of each column, but this query runs like forever. I have 96GB memory cluster. Please tell me what should i do for performance improvement. Following is my query in hive.
total as
(
select * from
(
select * from table1
union all
select * from table2
union all
select * from table3
union all
select * from table4
union all
select * from table5
union all
select * from table6
union all
select * from table7
union all
select * from table8
union all
select * from table9
)p
)
Select * from
(
select
sum(col_1),
sum(col_2),
sum(col_3),
sum(col_4),
sum(col_5),
sum(col_6),
sum(col_7),
sum(col_8),
sum(col_9),
sum(col_10)
from total
)q;

Related

equivalent of distinct On in Oracle

How to translate the following query to Oracle SQL, as Oracle doesn't support distinct on()?
select distinct on (t.transaction_id) t.transaction_id as transactionId ,
t.transaction_status as transactionStatus ,
c.customer_id as customerId ,
c.customer_name as customerName,
You can use ANY_VALUE with group by for this:
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/any_value.html
Example: https://dbfiddle.uk/WUxvjv5J
with t (a,b,c) as (
select 1,10,1 from dual union all
select 1,10,2 from dual union all
select 1,10,3 from dual union all
select 1,20,4 from dual union all
select 1,20,5 from dual union all
select 1,30,7 from dual
)
select a,b,any_value(c)
from t
group by a,b;
Yes, Oracle has a full set of windowing functions you can use for this. The simplest is ROW_NUMBER:
SELECT *
FROM (SELECT x.col1,
x.col2,
x.col3,
ROW_NUMBER() OVER (PARTITION BY x.col1 ORDER BY x.col2 DESC) seq
FROM table x)
WHERE seq = 1
for each distinct col1, it will number the highest col2 value as seq=1, the next highest as seq=2, etc... so you can filter on 1 to get the desired row. You can used as complex ORDER BY logic as you need to pick the row you want. The key thing is that the ORDER BY goes inside the ROW_NUMBER OVER clause along with the distinct (PARTITION BY) definition, not outside in the main query block.

How to use the query builder of Symfony to make a date range counter and fill the gaps with zeros

I have a query that count the user by grouping them by sign up date.
return $this->createQueryBuilder('s')
->select(' date(s.created_at) as x, count(1) as y')
->where("s.created_at between datesub(now(), :months, 'Month') and now()")
->setParameter('months', $months)
->groupBy('x')
->orderBy('x')
->getQuery()
->getResult();
But their is currently gaps in my dataset.
So I have the sql request to fill the gaps, but I don't know how to create a complicated request with the Symfony's query builder.
SELECT ranger.ranger_date AS x, COALESCE(counter.counter_value, 0) as y
FROM (
SELECT DATE(s.created_at) AS counter_date, count(*) AS counter_value
FROM statistic AS s
WHERE s.created_at between DATE_SUB(NOW(), INTERVAL 3 MONTH) and now()
GROUP BY counter_date
) AS counter
RIGHT JOIN (
SELECT DATE(DATE_SUB(NOW(), INTERVAL units.i + tens.i * 10 + hundreds.i * 100 DAY)) AS ranger_date
FROM (SELECT 0 i UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9)units
CROSS JOIN (SELECT 0 i UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9)tens
CROSS JOIN (SELECT 0 i UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9)hundreds
WHERE DATE_SUB(NOW(), INTERVAL units.i + tens.i * 10 + hundreds.i * 100 DAY) BETWEEN DATE_SUB(NOW(), INTERVAL 3 MONTH) AND NOW()
) AS ranger
ON ranger.ranger_date = counter.counter_date
ORDER BY ranger.ranger_date
I have already tried with the createQuery method, but it did not work...
If your complex native sql query is successfully returning the result set you want:
You can simply prepare and execute the query as documented by Symfony.
If you need to hydrate entities then you can use the NativeQuery class.

Hive create table not insert data

I am running the below hive query. After the mapreduce is complete I see that no data is inserted.
create table t_123 as
select * from
(
select * from t1 union all
select * from t2 union all
select * from t3
) X
But if i just run the select query as below i get results. Data type of t1, t2 and t3 are same. Towards the end I get the below statement:
"numFiles = 27 , numRows = 0 and totalSize = 34567...."
select * from t1 union all
select * from t2 union all
select * from t3
Any thoughts what could be the issue. I'm running this using TEZ.

How to get count by using UNION operator

i'm trying to get total count by using UNION operator but it gives wrong count.
select count(*) as companyRatings from (
select count(*) hrs from (
select distinct hrs from companyA
)
union
select count(*) financehrs from (
select distinct finance_hrs from companyB
)
union
select count(*) hrids from (
select regexp_substr(hr_id,'[^/]+',1,3) hrid from companyZ
)
union
select count(*) cities from (
select regexp_substr(city,'[^/]+',1,3) city from companyY
)
);
individual query's working fine but total count not matching.
individual results here: 12 19 3 6
present total count: 31
Actual total count:40.
so there is any alternate solution without UNION operator?
To add values you'd use +. UNION is to add data sets.
select
(select count(distinct hrs) from companyA)
+
(select count(distinct finance_hrs) from companyB)
+
(select count(regexp_substr(hr_id,'[^/]+',1,3)) from companyZ)
+
(select count(regexp_substr(city,'[^/]+',1,3)) from companyY)
as total
from dual;
But I agree with juergen d; you should not have separate tables per company in the first place.
Edit. Updated query using Sum
select sum(cnt) as companyRatings from
(
select count(*) as cnt from (select distinct hrs from companyA)
union all
select count(*) as cnt from (select distinct finance_hrs from companyB)
union all
select count(*) as cnt from (select regexp_substr(hr_id,'[^/]+',1,3) hrid from companyZ)
union all
select count(*) as cnt from (select regexp_substr(city,'[^/]+',1,3) city from companyY)
)
Previous answer:
Try this
SELECT (
SELECT count(*) hrs
FROM (
SELECT DISTINCT hrs
FROM companyA
)
)
+
(
SELECT count(*) financehrs
FROM (
SELECT DISTINCT finance_hrs
FROM companyB
)
)
+
(
SELECT count(*) hrids
FROM (
SELECT regexp_substr(hr_id, '[^/]+', 1, 3) hrid
FROM companyZ
)
)
+
(
SELECT count(*) cities
FROM (
SELECT regexp_substr(city, '[^/]+', 1, 3) city
FROM companyY
)
)
AS total_count
FROM dual

Consolidating data from different views into a single sql statement

I have 4 different view which will give the data in same format. My requirement is to write a single query which will combine data from all these four views with data from another table 'Table1' in such way that if that data in 'Table1' is already present in any of the four view(using some id) then i should not add it to the end result.
For eg: View1, View2, View3,View4 , Table1
My end result should be
(View1+View2+View3+View4+(Table1-(View1+View2+View3+View4))
So the query which i have written is like below one
selet * from view1 union
select * from view2 union
select * from view3 union
select * from view4 union
select * from Table1 where Table1.Id Not in
(select Id from view1 union
select Id from view2 union
select Id from view3 union
select Id from view4 union)
Is there any better ways to frame this query which will improve the performance especially when there is a huge data
Have you tried using distinct? :
select distinct(*) from (
select * from view1 union
select * from view2 union
select * from view3 union
select * from view4 union
select * from Table1);
Actually the not in could be implemented as a minus. On a logical level you can achieve something like:
-- step 1
create or replace view v_basic
as
select *
from view1
union
select *
from view2
union
select *
from view3
union
select *
from view4;
-- step 2
create or replace view v_extension
as
select id
from table1
minus
select id
from v_basic
-- step 3
select *
from v_basic
union
(select *
from table1 t1
where exists (select *
from v_extension e1
where e1.id = t1.id)
Since the union operator distinctly will get the a complete record perhaps you will not have to bother whether an id appears twice. So if the id attribute is the main attribute that tells you whether a record should be retrieved from table1 then you can approach the problem as suggested in then answer. If a whole record holds the distinctive data then you could merge all queries with union operator. In that case =>
select *
from v_basic
union
select *
from table1
... should be enough

Resources