Query on SNOWFLAKE.ACCOUNT_USAGE schema taking long time to run - performance

I am trying to run a simple query against any of the tables contained in SNOWFLAKE.ACCOUNT_USAGE schema but for some reason it is taking up a long time to run even if I try to limit it to show only the first row, like the following example:
SELECT * FROM "SNOWFLAKE"."ACCOUNT_USAGE"."ACCESS_HISTORY" limit 1;
Is that a normal behavior? If not, can someone help me to figure out why this is happening?

Its always a good practice to add a WHERE condition so the optimizer can make use of query pruning.
If you know your objects were accessed within the past say 24 hours, can you add a date filter and see if that helps?
SELECT * FROM "SNOWFLAKE"."ACCOUNT_USAGE"."ACCESS_HISTORY"
WHERE QUERY_START_TIME > CURRENT_DATE() - 1
limit 1;
More info on mirco-partitions and query pruning: https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions.html#query-pruning

Related

Performance tuning tips -Plsql/sql-Database

We are facing performance issue in production. Mv refersh program is running for long, almost 13 to 14 hours.
In the MV refersh program is trying to refersh 5 MV. Among that one of the MV is running for long.
Below is the MV script which is running for long.
SELECT rcvt.transaction_id,
rsh.shipment_num,
rsh.shipped_date,
rsh.expected_receipt_date,
(select rcvt1.transaction_date from rcv_transactions rcvt1
where rcvt1.po_line_id = rcvt.po_line_id
AND rcvt1.transaction_type = 'RETURN TO VENDOR'
and rcvt1.parent_transaction_id=rcvt.transaction_id
)transaction_date
FROM rcv_transactions rcvt,
rcv_shipment_headers rsh,
rcv_shipment_lines rsl
WHERE 1 =1
AND rcvt.shipment_header_id =rsl.shipment_header_id
AND rcvt.shipment_line_id =rsl.shipment_line_id
AND rsl.shipment_header_id =rsh.shipment_header_id
AND rcvt.transaction_type = 'RECEIVE';
Shipment table contains millions of records and above query is trying to extract almost 60 to 70% of the data. We are suspecting data load is the reason.
We are trying to improve the performance for the above script.So we added date filter to restrict the data.
SELECT rcvt.transaction_id,
rsh.shipment_num,
rsh.shipped_date,
rsh.expected_receipt_date,
(select rcvt1.transaction_date from rcv_transactions rcvt1
where rcvt1.po_line_id = rcvt.po_line_id
AND rcvt1.transaction_type = 'RETURN TO VENDOR'
and rcvt1.parent_transaction_id=rcvt.transaction_id
)transaction_date
FROM rcv_transactions rcvt,
rcv_shipment_headers rsh,
rcv_shipment_lines rsl
WHERE 1 =1
AND rcvt.shipment_header_id =rsl.shipment_header_id
AND rcvt.shipment_line_id =rsl.shipment_line_id
AND rsl.shipment_header_id =rsh.shipment_header_id
AND rcvt.transaction_type = 'RECEIVE'
AND TRUNC(rsh.creation_date) >= NVL(TRUNC((sysdate - profile_value),'MM'),TRUNC(rsh.creation_date) );
For 1 year profile, it shows some improvement but if we give for 2 years range its more worse than previous query.
Any suggestions to improve the performance.
Pls help
I'd pull out that scalar subquery into a regular outer join.
Costing for scalar subqueries can be poor and you are forcing it to do a lot of single record lookups (presumably via index) rather than giving it other options.
"The main query then has a scalar subquery in the select list.
Oracle therefore shows two independent plans in the plan table. One for the driving query – which has a cost of two, and one for the scalar subquery, which has a cost of 2083 each time it executes.
But Oracle does not “know” how many times the scalar subquery will run (even though in many cases it could predict a worst-case scenario), and does not make any cost allowance whatsoever for its execution in the total cost of the query."

Sum on Count in hql - error Not yet supported place for UDAF 'count'

I'm new here so please be gentle, my first question after using this website for a long time regards the below:
I'm trying to create a sum of count of events in the past 30 days:
select key, sum((COALESCE(count(*),0)))
from table
Where date>= '2016-08-13'
And date<= '2016-09-11'
group by key;
but the sum doesn't seem to work. i'm looking at the last 30 days, and i would like to count any row that exists for each key, and then sum the counts (i need to count on a daily basis and then sum all day's count).
If you can offer any other way to deal with this issue i'm open for suggestions!
Many thanks,
Shira
You can't nest aggregate functions in HQL (or SQL). However, if you just want a count of records falling within range for each key, then you can simply just use COUNT(*):
select key, count(*)
from table
where date >= '2016-08-13' and
date <= '2016-09-11'
group by key;
It looks like there was a couple things wrong with your code.
I've written this for you, haven't tested it but it passes the syntax test.
SELECT COUNT(key) AS Counting FROM tblname
WHERE date>= '2016-08-13'
AND date<= '2016-09-11'
GROUP BY key;
And this might help you. You should definitely be using COUNT for this query.
I'm not sure if it's related but there might be an issue with calling a field 'key' I kept receiving syntax errors for it.
Hope I was able to help!
-Krypton

Optimizing DB2 query - why does bash "time" output stay the same?

I have a DB2 query on the TPC-H schema, which I'm trying to optimize with indexes. I can tell from db2expln output that the estimated costs are significantly lower (300%) when indexes are available.
However I'm also trying to measure execution time and it doesn't change significantly when the query is executed using indexes.
I'm working in an SSH terminal, executing my queries and writing output to a file like
(time db2 "SELECT Q.name FROM
(SELECT custkey, name FROM customer WHERE nationkey = 22) Q WHERE Q.custkey IN
(SELECT custkey FROM orders B WHERE B.orderkey IN
(SELECT orderkey FROM lineitem WHERE receiptdate BETWEEN '1992-06-11' AND '1992-07-11'))") &> output.txt
I did 10 measurements each: 1) without indexes, 2) with index on lineitem.receiptdate, 3) with indexes on lineitem.receiptdate and customer.nationkey,
calculated average time and standard deviation, all are within the same range.
I executed RUNSTATS ON TABLE schemaname.tablename AND DETAILED INDEXES ALL after index creation.
I read this post about output of the time command, from what I understand sys+user time should be relevant for my measurement. There is no change in added sys+user time, not either in real.
sys + user is around 44 ms, real is around 1s.
Any hints why I cannot see a change in time? Am I interpreting time output wrong? Are the optimizer estimations in db2expln misleading?
Disclaimer: I'm supposed to give a presentation about this at university, so it's technically homework, but as it's more of a comprehension question and not "please make my code work" I hope it's appropriate to post it here. Also, I know the query could be simplified but my question is not about this.
The optimizer estimates timerons (measurement of internal costs) and these timerons cannot be translated one to one into query execution time. So a difference of 300% in timerons does not mean you will see a 300% difference in runtime.
Measuring time for one or more SQL statements I recommend to use db2batch with the option
-i complete
`SELECT f1.name FROM customer f1
WHERE f1.nationkey = 22 and exists (
select * from orders f2 inner join lineitem f3 on f2.orderkey=f3.orderkey
where f1.custkey=f2.custkey and f3.receiptdate BETWEEN '1992-06-11' AND '1992-07-11'
)`

percentile_approx in hive returning zero

I have been trying to check the percentile_approx for a set of users. The intention behind this is to get the top 25% of customers in the data set. So, in order to check that, I ran the following HIVE query.
select percentile_approx(amount, 0.75)
from sales
However, the value returned from this query is 0.0. I am not sure what the problem is. When I run this query over a sample of few records the result is what is expected.
Can anyone please shed some light on this?
Note - I am trying to find the percentile in a data set containing more than 3.3 M records.
select percentile_approx(cast(amount as double), ARRAY(0.75))
from sales
Try this method
Generally percentile_approx() works on integer type data. Please make sure that you have applied this on the column which has integers.

date difference in terms of days using mongotemplate

I have 3 columns in my mongodb named as days (long), startDate (java.util.Date), endDate (java.util.Date). What all I want to fetch the records between startDate and (endDate-days) OR (endDate-startDate) <= days.
Can you please let me know how could i achieve this using mongoTemplate spring.
I don't want to fetch all the records from table and then resolve this on java side since in future my table may have million of records.
Thanks
Jitender
There is no way to do this in the query on the DB side (the end minus start part). What I recommend if this is an important feature for your application is that you alter the schema to maintain in the document the delta between the two fields in the format you need it. Since you can update that field when you update endDate (or if you populate both dates at the same time you can just compute the field then).
If you receive this data in bulk from another source, or if you do multi-updates of the endDate then you will probably need another job to run and periodically compute the delta of the documents where it's not computed (then you can start with always setting delta to 99999 and update it in this job to accurate value once endDate is set).
While you can use $where clause, it will be a very slow full collection scan so I would not suggest its use - it's probably better to come up with a more performant alternative even if it requires altering the schema.
http://docs.mongodb.org/manual/reference/operator/where/

Resources