How to reduce execution time on query with right join? - h2

My query with right join works fine if I don't have too much data. But when I have 500 rows in my table or more, the query can take 5 minutes or even more.
How can I reduce or improve the execution time of this query:
SELECT dlocation.USER_NAME,dtransaction.USER_NAME
FROM dlocation RIGHT JOIN dtransaction
ON (dtransaction.locationid= dlocation.id) and (dtransaction.isinternational =
dlocation.isinternational) and (dtransaction.USER_NAME= dlocation.USER_NAME)
WHERE dtransaction.typeId = 'Charge' and (dtransaction.USER_NAME is null or
dlocation.USER_NAME is null)

The performance of your query can be improved by adding the following indexes:
create index ix1 on dtransaction (typeId, USER_NAME);
create index ix2 on dlocation (id, isinternational, USER_NAME);

You will need to request H2 to show the execution plan for such query.
For this you need the EXPLAIN command [http://h2database.com/html/commands.html#explain]. You just need to add it in front of the query under performance investigation.
As this is a join, you need to make sure you've no tablescan left in your execution plan. In your case, you likely need to add indexes.

Related

Oracle11g select query with pagination

I am facing a big performance problem when trying to get a list of objects with pagination from an oracle11g database.
As far as I know and as much as I have checked online, the only way to achieve pagination in oracle11g is the following :
Example : [page=1, size=100]
SELECT * FROM
(
SELECT pagination.*, rownum r__ FROM
(
select * from "TABLE_NAME" t
inner join X on X.id = t.id
inner join .....
where ......
order
) pagination
WHERE rownum <= 200
)
WHERE r__ > 100
The problem in this query, is that the most inner query fetching data from the table "TABLE_NAME" is returning a huge amount of data and causing the overall query to take 8 seconds (there are around 2 Million records returned after applying the where clause, and it contains 9 or 10 join clause).
The reason of this is that the most inner query is fetching all the data that respects the where clause and then the second query is getting the 200 rows, and the third to exclude the first 100 to get the second pages' data we need.
Isn't there a way to do that in one query, in a way to fetch the second pages' data that we need without having to do all these steps and cause performance issues?
Thank you!!
It depends on your sorting options (order by ...): database needs to sort whole dataset before applying outer where rownum<200 because of your order by clause.
It will fetch only 200 rows if you remove your order by clause. In some cases oracle can avoid sort operations (for example, if oracle can use some indexes to get requested data in the required order). Btw, Oracle uses optimized sorting operations in case of rownum<N predicates: it doesn't sort full dataset, it just gets top N records instead.
You can investigate sort operations deeper using sort trace event: alter session set events '10032 trace name context forever, level 10';
Furthermore, sometimes it's better to use analytic functions like
select *
from (
select
t1.*
,t2.*
,row_number()over([partition by ...] order by ...) rn
from t1
,t2
where ...
)
where rn <=200
and rn>=100
because in some specific cases Oracle can transform your query to push sorting and sort filter predicates to the earliest possible steps.

Why are subqueries in selects slow compared to joins?

I have two queries. The first retrieves some aggregate from another table as a column, using a subquery in the select (returns a string concatenation of a column for all rows).
The second query does the same by having a subselect in the from and then join the results. This second query however is doing the aggregate on the complete table before joining, but it is much faster (286ms vs 7645ms).
I don't understand why the subquery is so much slower, while the second query does the aggregate on a table with 175k rows (on postgresql 9.5). Using a subselect is much easier to integrate in a query builder, so I would like to use that, and the second query will slow down when the number of records increase. Is there a way to increase the speed of a subselect?
Query 1:
select kp_No,
(select string_agg(description,E'\n') from (select nt_Text as description from fgeNote where nt_kp_No=fgeContact.kp_No order by nt_No DESC limit 3) as subquery) as description
from fgeContact
where kp_k_No=729;
Explain: https://explain.depesz.com/s/8sL
Query 2:
select kp_No, NoteSummary
from fgeContact
LEFT JOIN
(select nt_kp_No, string_agg(nt_Text,E'\n') as NoteSummary
from
(select nt_kp_No, nt_Text from fgeNote ORDER BY nt_No DESC) as sortquery
group by nt_kp_No) as joinquery
ON joinquery.nt_kp_No=kp_No
where kp_k_No=729;
Explain: https://explain.depesz.com/s/yk9W
This is because in the second query you are retrieving all registers in a single scan while in the first one, the subquery is executed one time per each selected register of the master table so, each time, the table should be scanned again.
Even with indexed scans, it is usually more expensive than scanning the whole table, even sequentially (in fact, sequential scan is much faster than indexed scan when when selecting more than a few registers because indexing implies some overhead) and picking only for interesting registers.
But that depends to the actual data distribution too. Its perfecly possible that, for a different value of kp_k_No, the second query would become faster if the table contains only one or a few rows with that value for this parameter.
It's a matter of test and guess the different situations that will happen...

Write a nested select statement with a where clause in Hive

I have a requirement to do a nested select within a where clause in a Hive query. A sample code snippet would be as follows;
select *
from TableA
where TA_timestamp > (select timestmp from TableB where id="hourDim")
Is this possible or am I doing something wrong here, because I am getting an error while running the above script ?!
To further elaborate on what I am trying to do, there is a cassandra keyspace that I publish statistics with a timestamp. Periodically (hourly for example) this stats will be summarized using hive, once summarized that data will be stored separately with the corresponding hour. So when the query runs for the second time (and consecutive runs) the query should only run on the new data (i.e. - timestamp > previous_execution_timestamp). I am trying to do that by storing the latest executed timestamp in a separate hive table, and then use that value to filter out the raw stats.
Can this be achieved this using hive ?!
Subqueries inside a WHERE clause are not supported in Hive:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries
However, often you can use a JOIN statement instead to get to the same result:
https://karmasphere.com/hive-queries-on-table-data#join_syntax
For example, this query:
SELECT a.KEY, a.value
FROM a
WHERE a.KEY IN
(SELECT b.KEY FROM B);
can be rewritten to:
SELECT a.KEY, a.val
FROM a LEFT SEMI JOIN b ON (a.KEY = b.KEY)
Looking at the business requirements underlying your question, it occurs that you might get more efficient results by partitioning your Hive table using hour. If the data can be written to use this factor as the partition key, then your query to update the summary will be much faster and require fewer resources.
Partitions can get out of hand when they reach the scale of millions, but this seems like a case that will not tease that limitation.
It will work if you put in :
select *
from TableA
where TA_timestamp in (select timestmp from TableB where id="hourDim")
EXPLANATION : As > , < , = need one exact figure in the right side, while here we are getting multiple values which can be taken only with 'IN' clause.

Data Densification Oracle query not efficient and slowly

I'm trying to cope with data densification for reporting purposes. I created two dimension tables (time & skills) and one data table (calls). Now since during certain time there are no calls in the data table, I will not get a time series including all the days. I now have studied many samples in the Internet how to cope with data densification and came up the the solution below.
Query works as intended, just it takes quite long and I have the feeling it is quite inefficient. Could you please advice me how to speed up query execution time?
Thank you and best regards,
Alex
SELECT DISTINCT
DAY_ID,
DAY_SHORT,
WEEK_ID,
MONTH_ID,
QUARTER_ID,
YEAR_ID,
AREA,
FIRMA,
PRODUCT,
PRODUCT_FAMILY,
PRODUCT_WFM,
LANGUAGE,
NVL(NCO,0) NCO,
NVL(NCH,0) NCH,
NVL(NCH60,0) NCH60,
NVL(LOST,0) LOST
FROM (
SELECT
DS.AREA,
DS.FIRMA,
DS.PRODUCT,
DS.PRODUCT_FAMILY,
DS.PRODUCT_WFM,
DS.LANGUAGE,
SUM(NVL(CH.HANDLED,0)+NVL(CH.LOST,0)) AS NCO,
SUM(CH.HANDLED) AS NCH,
SUM(CH.HANDLED_IN_SL) AS NCH60,
SUM(CH.LOST) AS LOST,
CH.DELIVER_DATE,
CH.SKILL_NAME
FROM
WFM.WFM_TBL_DIMENSION_SKILL DS
LEFT JOIN
OPS.VW_CALL_HISTORY CH
ON
DS.SPLIT_NAME=CH.SKILL_NAME
GROUP BY
DS.AREA,
DS.FIRMA,
DS.PRODUCT,
DS.PRODUCT_FAMILY,
DS.PRODUCT_WFM,
DS.LANGUAGE,
CH.DELIVER_DATE,
CH.SKILL_NAME
) temp_values
PARTITION BY
(
temp_values.AREA,
temp_values.FIRMA,
temp_values.PRODUCT,
temp_values.PRODUCT_FAMILY,
temp_values.PRODUCT_WFM,
temp_values.LANGUAGE,
temp_values.DELIVER_DATE,
temp_values.SKILL_NAME
)
RIGHT OUTER JOIN (
SELECT
DAY_ID,
DAY_SHORT,
WEEK_ID,
MONTH_ID,
QUARTER_ID,
YEAR_ID
FROM
WFM.WFM_TBL_DIMENSION_TIME
WHERE
DAY_ID BETWEEN(SELECT MIN(DELIVER_DATE) FROM OPS.VW_CALL_HISTORY) and TRUNC(sysdate-1)
) temp_time
ON
temp_values.DELIVER_DATE=temp_time.DAY_ID
Have a look at the execution plan and check which steps take very long. Use EXPLAIN PLAN to get it. Look for full table scans, see if indexes could help. Make sure you have up-to-date stats on the tables.
Since you are talking about dimension tables, this code is assumed to be from a data warehousing database. If it is, do you use partitions? Parallel DML? Are you using EE?
I reduced the arguments in PARTITION BY () to a single primary key (temp_values.SKILL_NAME) and joined the missing information from the skill dimension with a LEFT OUTER JOIN at the end of the above described query. In that way no more equal duplications are produced which leds me reduce SELECT DISTINCT to SELECT.
Additionally I added foreign & primary keys and let the query run in parallel mode.
It helps me to reduce execution time by over 80%, which is sufficient. Thanks guys!

Getting count from large tables

I was trying to get the count from a table with millions of entries. My query looks somewhat like this:
Select count(*)
from Users
where status = 'A' and office_id = '000111' and user_type = 'C'
Status can be A or C, User Type can be C or R.
Status, Office_id and User_type are Strings
The result has around 10 million rows, and its taking a lot of time. I just want the total count.
Would appreciate if anyone could tell me why its taking this much time, and workaround if any.
Do let me know in case of any more details required.
The database engine is Oracle 11g
Edit: I Added index for all three columnns. Still theres no improvement. Also tried the below query, but it always returns the total count in the table without checking the conditions.
SELECT COUNT(office_id_key)
FROM Users
WHERE EXISTS (SELECT * FROM Users WHERE status = 'A' AND office_id = '000111' AND user_type = 'C')
Why not just simply create indexes on the table on age and place this way your search will be faster then simply scanning the entire table for these values.
CREATE INDEX age_index ON Employee(age);
CREATE INDEX place_index ON Employee(place);
This should speed up the process.
AMENDED BASED ON QUERY CHANGE
CREATE INDEX status_index ON Users(status);
CREATE INDEX office_id_index ON Users(office_id);
CREATE INDEX user_type_index ON Users(user_type);
You'll want to create the following multi-column index on the Users table to improve the query:
(office_id, status, user_type)
The database can use a "covering" index with COUNT(*). Create the index with the columns in that order, due to cardinality.
After adding the indexes, I think changing where to where exists and a subquery may help as well.
Edit2: removed exists as it was returning all valid, usually the subquery has multiple joins, but I guess the case with one table returns all true. I read that count is optimized to act similar to exists when it has only one table and no where clause, so I treat the results as a table. Hopefully, this will have the same quick results.
select count(1) from
(select 1 from Employee where age = '25' and place = 'bricksgate')
Edit: When you use 'where exists' the db server doesn't load your data into memory and also takes advantage of the indexes because you will be reading values from the indexes not doing costly table lookups. You may also want to change count(*) to count(place) - that way it will limit the fields to an indexed field as well.
In your original query, your data was doing table lookups and then loading them into memory just to be counted.
count(1) works faster than count(*)

Resources