Cube Operators on Impala - performance

In doing benchmarks between Impala and PrestoDB, we noticed that building pivot tables is quite difficult in Imapala because it does not have Cube operators, like Presto does. Here are two examples in Presto:
The CUBE operator generates all possible grouping sets (i.e. a power set) for a given set of columns. For example, the query:`
SELECT origin_state, destination_state, sum(package_weight)
FROM shipping
GROUP BY CUBE (origin_state, destination_state);
is equivalent to:
SELECT origin_state, destination_state, sum(package_weight)
FROM shipping
GROUP BY GROUPING SETS (
(origin_state, destination_state),
(origin_state),
(destination_state),
());
Another example is the ROLLUP operator. Full documentation is here: https://prestodb.io/docs/current/sql/select.html.
It's not syntatic sugar because PRESTO perform one table scan for whole query - so using this operators you can build pivot table in one request Impala need to run 2-3 queries.
Is there a way in which we can do this with one query / table-scan in Impala instaead of 3? Otherwise the performance becomes terrible on creating any type of pivot table.

We can use impala windo functions but instead of single column output, you will get 3 columns.
SELECT origin_state,
destination_state,
SUM(package_weight) OVER (PARTITION BY origin_state, destination_state) AS pkgwgrbyorganddest,
SUM(package_weight) OVER (PARTITION BY origin_state) AS pkgwgrbyorg,
SUM(package_weight) OVER (PARTITION BY destination_state) AS pkgwgrbydest
FROM shipping;

Related

Query in DB2 vs Oracle

There is a query having multiple inner joins. It involves two views, of which one view is based on four tables, and total there are four tables(including two views).
The same query with the same amount of data in the source tables runs in both, Oracle and DB2. In DB2, surprisingly, it takes 2 minutes to load 3 million records. While in Oracle, it is taking two hours. Same indexes are on all source tables in both the environments. Is the behavior of views (when used in joins) different in both environments (Oracle vs DB2)?
a dummy query I am sharing :-
INSERT INTO TABLE_A
SELECT
adf.column1,
adf.column2,
dd.column3,
SUM(otl.column4) column4,
SUM(otl.column5) column5,
(Case when SUM(otl.column5) = 0 then 0
else round(CAST(SUM(otl.column4) AS DECIMAL(19,2)) /abs(CAST(SUM(otl.column4) AS DECIMAL(18,2))),4)
end) taxl_unrlz_cgl_pct
FROM
view_a adf
INNER JOIN table_b hr on hr.hh_ref_id = adf.hh_ref_id
AND hr.col_typ_cd = 'FIRM'
AND hr.col_end_dt = TO_DATE('1/1/2900','MM/DD/YYYY')
INNER JOIN dw.table_c ar on ar.colb_id = adf.colb_id
AND ar.col_cd = '#'
AND ar.col_num BETWEEN 10000000 AND 89999999
AND ar.col_dt IS NULL
INNER JOIN table_d dd on dd.col_id = adf.col_id
INNER JOIN view2 otl ON otl.cola_id = ar.cola_id
GROUP BY adf.column1, adf.column2, dd.column3;
Technically, both DB2 and Oracle will try to rewrite the query in most efficient way possible using the base query that you have coded. But one of the common (but not frequent) issues that I have seen when using multi-table view is DBMS not being able to rewrite the query using underlying tables. So depending on complexity of the view itself and sometime the additional joins, DBMS may not be able to rewrite the query to use the underlying tables properly and hence resulting in not being able to use the indexes on the underlying tables used in the view. When this happens, the view itself acts like a materialized table (work table) and query goes for table scan on the materialized table.
There is no consistent pattern on when such issue can happen. So you will need to check on a case by case basis.
Since you are mentioning about 2 hrs vs 2 minutes, in most probability that might be the case. So you will need to check the access path on both Oracle and DB2. But you will also need to make sure that stats are updated and access path is based on latest stats on DBMS. Else it won't be apples to apples compare.

Why are subqueries in selects slow compared to joins?

I have two queries. The first retrieves some aggregate from another table as a column, using a subquery in the select (returns a string concatenation of a column for all rows).
The second query does the same by having a subselect in the from and then join the results. This second query however is doing the aggregate on the complete table before joining, but it is much faster (286ms vs 7645ms).
I don't understand why the subquery is so much slower, while the second query does the aggregate on a table with 175k rows (on postgresql 9.5). Using a subselect is much easier to integrate in a query builder, so I would like to use that, and the second query will slow down when the number of records increase. Is there a way to increase the speed of a subselect?
Query 1:
select kp_No,
(select string_agg(description,E'\n') from (select nt_Text as description from fgeNote where nt_kp_No=fgeContact.kp_No order by nt_No DESC limit 3) as subquery) as description
from fgeContact
where kp_k_No=729;
Explain: https://explain.depesz.com/s/8sL
Query 2:
select kp_No, NoteSummary
from fgeContact
LEFT JOIN
(select nt_kp_No, string_agg(nt_Text,E'\n') as NoteSummary
from
(select nt_kp_No, nt_Text from fgeNote ORDER BY nt_No DESC) as sortquery
group by nt_kp_No) as joinquery
ON joinquery.nt_kp_No=kp_No
where kp_k_No=729;
Explain: https://explain.depesz.com/s/yk9W
This is because in the second query you are retrieving all registers in a single scan while in the first one, the subquery is executed one time per each selected register of the master table so, each time, the table should be scanned again.
Even with indexed scans, it is usually more expensive than scanning the whole table, even sequentially (in fact, sequential scan is much faster than indexed scan when when selecting more than a few registers because indexing implies some overhead) and picking only for interesting registers.
But that depends to the actual data distribution too. Its perfecly possible that, for a different value of kp_k_No, the second query would become faster if the table contains only one or a few rows with that value for this parameter.
It's a matter of test and guess the different situations that will happen...

Oracle SQL sub query vs inner join

At first, I seen the select statement on Oracle Docs.
I have some question about oracle select behaviour, when my query contain select,join,where.
see this below for information:
My sample table:
[ P_IMAGE_ID ]
IMAGE_ID (PK)
FILE_NAME
FILE_TYPE
...
...
[ P_IMG_TAG ]
IMG_TAG_ID (PK)
IMAGE_ID (FK)
TAG
...
...
My requirement are: get distinct of image when it's tag is "70702".
Method 1: Select -> Join -> Where -> Distinct
SELECT DISTINCT PID.IMAGE_ID
, PID.FILE_NAME
FROM P_IMAGE_ID PID
INNER JOIN P_IMG_TAG PTAG
ON PTAG.IMAGE_ID = PID.IMAGE_ID
WHERE PTAG.TAG = '70702';
I think the query behaviour should be like:
join table -> hint where cause -> distinct select
I use Oracle SQL developer to get the explain plan:
Method 1 cost 76.
Method 2: Select -> Where -> Where -> Distinct
SELECT DISTINCT PID.IMAGE_ID
, PID.FILE_NAME
FROM P_IMAGE_ID PID
WHERE PID.IMAGE_ID IN
(
SELECT PTAG.IMAGE_ID
FROM P_IMG_TAG PTAG
WHERE PTAG.TAG = '70702'
);
I think the second query behaviour should be like:
hint where cause -> hint where cause -> distinct select
I use Oracle SQL developer to get the explain plan too:
Method 2 cost 76 too. Why?
I believe when I try where cause first for reduce the database process and avoid join table that query performance should be better than the table join query, but now when I test it, I am confused, why 2 method cost are equal ?
Or am I misunderstood something ?
List of my question here:
Why 2 method above cost are equal ?
If the result of sub select Tag = '70702' more than thousand or million or more, use join table should be better alright ?
If the result of sub select Tag = '70702' are least, use sub select for reduce data query process is better alright ?
When I use method 1 Select -> Join -> Where -> Distinct mean the database process table joining before hint where cause alright ?
Someone told me when i move hint cause Tag = '70702' into join cause
(ie. INNER JOIN P_IMG_TAG PTAG ON PAT.IMAGE_ID = PID.IMAGE_ID AND PTAG.TAG = '70702' ) it's performance may be better that's alright ?
I read topic subselect vs outer join and subquery or inner join but both are for SQL Server, I don't sure that may be like Oracle database.
The DBMS takes your query and executes something. But it doesn't execute steps that correspond to SQL statement parts in the order they appear in an SQL statement.
Read about "relational query optimization", which could just as well be called "relational query implementation". Eg for Oracle.
Any language processor takes declarations and calls as input and implements the described behaviour in terms of internal data structures and operations, maybe through one or more levels of "intermediate code" running on a "virtual machine", eventually down to physical machines. But even just staying in the input language, SQL queries can be rearranged into other SQL queries that return the same value but perform significantly better under simple and general implementation assumptions. Just as you know that your question's queries always return the same thing for a given database, the DBMS can know. Part of how it knows is that there are many rules for taking a relational algebra expression and generating a different but same-valued expression. Certain rewrite rules apply under certain limited circumstances. There are rules that take into consideration SQL-level relational things like primary keys, unique columns, foreign keys and other constraints. Other rules use implementation-oriented SQL-level things like indexes and statistics. This is the "relational query rewriting" part of relational query optimization.
Even when two different but equivalent queries generate different plans, the cost can be similar because the plans are so similar. Here, both a HASH and SORT index are UNIQUE. (It would be interesting to know what the few top plans were for each of your queries. It is quite likely that those few are the same for both, but that the plan that is more directly derived from the particular input expression is the one that is offered when there's little difference.)
The way to get the DBMS to find good query plans is to write the most natural expression of a query that you can find.

Write a nested select statement with a where clause in Hive

I have a requirement to do a nested select within a where clause in a Hive query. A sample code snippet would be as follows;
select *
from TableA
where TA_timestamp > (select timestmp from TableB where id="hourDim")
Is this possible or am I doing something wrong here, because I am getting an error while running the above script ?!
To further elaborate on what I am trying to do, there is a cassandra keyspace that I publish statistics with a timestamp. Periodically (hourly for example) this stats will be summarized using hive, once summarized that data will be stored separately with the corresponding hour. So when the query runs for the second time (and consecutive runs) the query should only run on the new data (i.e. - timestamp > previous_execution_timestamp). I am trying to do that by storing the latest executed timestamp in a separate hive table, and then use that value to filter out the raw stats.
Can this be achieved this using hive ?!
Subqueries inside a WHERE clause are not supported in Hive:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries
However, often you can use a JOIN statement instead to get to the same result:
https://karmasphere.com/hive-queries-on-table-data#join_syntax
For example, this query:
SELECT a.KEY, a.value
FROM a
WHERE a.KEY IN
(SELECT b.KEY FROM B);
can be rewritten to:
SELECT a.KEY, a.val
FROM a LEFT SEMI JOIN b ON (a.KEY = b.KEY)
Looking at the business requirements underlying your question, it occurs that you might get more efficient results by partitioning your Hive table using hour. If the data can be written to use this factor as the partition key, then your query to update the summary will be much faster and require fewer resources.
Partitions can get out of hand when they reach the scale of millions, but this seems like a case that will not tease that limitation.
It will work if you put in :
select *
from TableA
where TA_timestamp in (select timestmp from TableB where id="hourDim")
EXPLANATION : As > , < , = need one exact figure in the right side, while here we are getting multiple values which can be taken only with 'IN' clause.

oracle-inline view

Why inline views are used..??
There are many different reasons for using inline views. Some things can't be done without inline views, for example:
1) Filtering on the results of an analytic function:
select ename from
( select ename, rank() over (order by sal desc) rnk
from emp
)
where rnk < 4;
2) Using ROWNUM on ordered results:
select ename, ROWNUM from
( select ename
from emp
order by ename
);
Other times they just make it easier to write the SQL you want to write.
The inline view is a construct in Oracle SQL where you can place a query in the SQL FROM, clause, just as if the query was a table name.
Inline views provide
Bind variables can be introduced inside the statement to limit the data
Better control over the tuning
Visibility into the code
To get top N ordered rows.
SELECT name, salary,
FROM (SELECT name, salary
FROM emp
ORDER BY salary DESC)
WHERE rownum <= 10;
An inline view can be regarded as an intermediate result set that contributes to the required data set in some way. Sometimes it is entirely a matter of improving maintainability of the code, and sometimes it is logically neccessary.
From the Oracle Database Concepts document there are the inline view concept definition:
An inline view is not a schema object.
It is a subquery with an alias
(correlation name) that you can use
like a view within a SQL statement.
About the subqueries look in Using Subqueries from the Oracle SQL Reference manual. It have a very nice pedagogic information.
Anyway, today is preferred to use the Subquery Factoring Clause that is a more powerfull way of use inline views.
As an example of all together:
WITH
dept_costs AS (
SELECT department_name, SUM(salary) dept_total
FROM employees e, departments d
WHERE e.department_id = d.department_id
GROUP BY department_name),
avg_cost AS
SELECT * FROM dept_costs
WHERE dept_total >
(SELECT avg FROM (SELECT SUM(dept_total)/COUNT(*) avg
FROM dept_costs)
)
ORDER BY department_name;
There you can see one of all:
An inline view query: SELECT SUM...
A correlated subquery: SELECT avg FROM...
A subquery factoring: dept_costs AS (...
What are they used for?:
To avoid creating an intermediate view object: CREATE VIEW ...
To simplify some queries that a view cannot be helpfull. For instance, when the view need to filter from the main query.
You will often use inline views to break your query up into logical parts which helps both readability and makes writing more complex queries a bit easier.
Jva and Tony Andrews provided some good examples of simple cases where this is useful such as Top-N or Pagination queries where you may want to perform a query and order its results before using that as a part of a larger query which in turn might feed a query doing some other processing, where the logic for these individual queries would be difficult to achieve in a single query.
Another case they can be very useful is if you are writing a query that joins various tables together and want to perform aggregation on some of the tables, separating group functions and the processing into different inline views before performing the joins makes managing cardinality a lot easier. If you want some examples, I would be happy to provide them to make it more clear.
Factored subqueries (where you list your queries in the WITH clause at the start of the query) and inline views also often bring performance benefits. If you need to access the results of the subquery multiple times, you only need to run it once and it can be materialized as a Global Temporary Table (how the optimizer acts isn't totally black and white so I won't go into it here but you can do your own research - for example, see http://jonathanlewis.wordpress.com/2007/07/26/subquery-factoring-2/)

Resources