PostgreSQL - How to decrease select statement excution time - performance

My Postgres version: "PostgreSQL 9.4.1, compiled by Visual C++ build
1800, 32-bit"
The table which i am going to deal with; contains columns
eventtime - timestamp without timezone
serialnumber - character varying(32)
sourceid - integer
and 4 other columns
here is my select statement
SELECT eventtime, serialnumber
FROM t_el_eventlog
WHERE
eventtime at time zone 'CET' > CURRENT_DATE and
sourceid = '14';
the excution time for the above query is 59647ms
And in my r script i have 5 these kind of queries (excution time = 59647ms*5).
Without using time zone 'CET', the excution time is very less - but in my case I must use time zone 'CET' and if I am right the high excution time is beacuse of these timezone.
my query plan
text query
explain analyze query(without timezone)
Is there anyway that I can decrease the query excution time for my select statement

Since the distribution of the values is unknown to me, there is no clear way of solving the problem.
But one problem is obvious: There is an index for the eventtime column, but since the query operates with a function over that column, the index can't be used.
eventtime in time zone 'UTC' > CURRENT_DATE
Either the index has to be dropped and recreated with that function or the query has to be rewritten.
Recreate the index (example):
CREATE INDEX ON t_el_eventlog (timezone('UTC'::text, eventtime));
(this is the same as eventtime in time zone 'UTC')
This matches the filter with the function, the index can be used.
I suspect the sourceid not having a great distribution, not having very much different values. In that case, dropping the index on sourceid AND dropping the index on eventtime with creating a new index over eventtime and sourceid could be an idea:
CREATE INDEX ON t_el_eventlog (timezone('UTC'::text, eventtime), sourceid);
This is what the theory is telling us. I made a few tests around that, with a table with around 10 Million Rows, eventtime distribution within 36 hours and only 20 different sourceids (1..20). Distribution is very random. The best results were in an index over eventtime, sourceid (no function index) and adjusting the query.
CREATE INDEX ON t_el_eventlog (eventtime, sourceid);
-- make sure there is no index on source id. we need to force postgres to this index.
-- make sure, postgres learns about our index
ANALYZE; VACUUM;
-- use timezone function on current date (guessing timezone is CET)
SELECT * FROM t_el_eventlog
WHERE eventtime > timezone('CET',CURRENT_DATE) AND sourceid = 14;
With the table having 10'000'000 rows, this query returns me about 500'000 rows in only 400ms. (instead of about 1400 up to 1700 in all other combinations).
Finding the best match between the indexes and the query is the quest. I suggest some research, a recommendation is http://use-the-index-luke.com
this is what the query plan looks like with the last approach:
Index Only Scan using evlog_eventtime_sourceid_idx on evlog (cost=0.45..218195.13 rows=424534 width=0)
Index Cond: ((eventtime > timezone('CET'::text, (('now'::cstring)::date)::timestamp with time zone)) AND (sourceid = 14))
as you can see, this is a perfect match...

Related

Frequency Histogram in Clickhouse with unique and non unique data

I have a event table with created_at(DateTime), userid(String), eventid(String) column. Here userid can be repetitive while eventid is always unique uuid.
I am looking to build both unique and non unique frequency histogram.
This is for both eventid and userid on basis of given three input
start_datetime
end_datetime and
interval (1 min, 1 hr, 1 day, 7 day, 1 month).
Here, bucket will be decided by (end_datetime - start_datetime)/interval.
Output comes as start_datetime, end_datetime and frequency.
For any interval, if data is not available then start_datetime and end_datetime comes but with frequency as 0.
How can I build a generic query for this?
I looked in histogram function but could not find any documentation for this. While trying it, i could not understand relation behind the input and output.
count(distinct XXX) is deprecated.
More useful uniq(XXX) or uniqExact(XXX)
I got it work using following. Here, toStartOfMonth can be changed to other similar functions in CH.
select toStartOfMonth(`timestamp`) interval_data , count(distinct uid) count_data
from g94157d29.event1
where `timestamp` >= toDateTime('2018-11-01 00:00:00') and `timestamp` <= toDateTime('2018-12-31 00:00:00')
GROUP BY interval_data;
and
select toStartOfMonth(`timestamp`) interval_data , count(*) count_data
from g94157d29.event1
where `timestamp` >= toDateTime('2018-11-01 00:00:00') and `timestamp` <= toDateTime('2018-12-31 00:00:00')
GROUP BY interval_data;
But performance is very low for >2 billion records each month in event table where toYYYYMM(timestamp) is partition and toYYYYMMDD(timestamp) is order by.
Distinct count query takes > 30GB of space and 30 sec of time. Yet didn't complete.
While, General count query takes 10-20 sec to complete.

clickhouse - how get count datetime per 1minute or 1day ,

I have a table in Clickhouse. for keep statistics and metrics.
and structure is:
datetime|metric_name|metric_value
I want to keep statistics and limit number of accesses in 1 minute, 1 hour, 1 day and so on. So I need event counts in last minute, hour or day for every metric_name and I want to prepare statistics in a chart.
I do not know how to make a query. I get the count of metrics statistics based on the exact for example 1 minute, 1 hour, 1 day and so on.
I used to work on inflxdb:
SELECT SUM(value) FROM `TABLE` WHERE `metric_name`=`metric_value` AND time >= now() - 1h GROUP BY time(5m) fill(0)
In fact, I want to get the number of each metric per 5 minutes in the previous 1 hour.
I do not know how to use aggregations for this problem
ClickHouse has functions for generating Date/DateTime group buckets such as toStartOfWeek, toStartOfHour, toStartOfFiveMinute. You can also use intDiv function to manually divide value ranges. However the fill feature is still in the roadmap.
For example, you can rewrite the influx sql without the fill in ClickHouse like this,
SELECT SUM(value) FROM `TABLE` WHERE `metric_name`=`metric_value` AND
time >= now() - 1h GROUP BY toStartOfFiveMinute(time)
You can also refer to this discussion https://github.com/yandex/ClickHouse/issues/379
update
There is a timeSlots function that can help generating empty buckets. Here is a working example
SELECT
slot,
metric_value_sum
FROM
(
SELECT
toStartOfFiveMinute(datetime) AS slot,
SUM(metric_value) AS metric_value_sum
FROM metrics
WHERE (metric_name = 'k1') AND (datetime >= (now() - toIntervalHour(1)))
GROUP BY slot
)
ANY RIGHT JOIN
(
SELECT arrayJoin(timeSlots(now() - toIntervalHour(1), toUInt32(3600), 300)) AS slot
) USING (slot)

SQLite SELECT with max() performance

I have a table with about 1.5 million rows and three columns. Column 'timestamp' is of type REAL and indexed. I am accessing the SQLite database via PHP PDO.
The following three selects run in less than a millisecond:
select timestamp from trades
select timestamp + 1 from trades
select max(timestamp) from trades
The following select needs almost half a second:
select max(timestamp) + 1 from trades
Why is that?
EDIT:
Lasse has asked for a "explain query plan", I have run this within a PHP PDO query since I have no direct SQLite3 command line tool access at the moment. I guess it does not matter, here is the result:
explain query plan select max(timestamp) + 1 from trades:
[selectid] => 0
[order] => 0
[from] => 0
[detail] => SCAN TABLE trades (~1000000 rows)
explain query plan select max(timestamp) from trades:
[selectid] => 0
[order] => 0
[from] => 0
[detail] => SEARCH TABLE trades USING COVERING INDEX tradesTimestampIdx (~1 rows)
The reason this query
select max(timestamp) + 1 from trades
takes so long is that the query engine must, for each record, compute the MAX value and then add one to it. Computing the MAX value involves doing a full table scan, and this must be repeated for each record because you are adding one to the value.
In the query
select timestamp + 1 from trades
you are doing a calculation for each record, but the engine only needs to scan the entire table once. And in this query
select max(timestamp) from trades
the engine does have to scan the entire table, however it also does so only once.
From the SQLite documentation:
Queries that contain a single MIN() or MAX() aggregate function whose argument is the left-most column of an index might be satisfied by doing a single index lookup rather than by scanning the entire table.
I emphasized might from the documentation, because it appears that a full table scan may be necessary for a query of the form SELECT MAX(x)+1 FROM table
if column x be not the left-most column of an index.

How to avoid expensive Cartesian product using row generator

I'm working on a query (Oracle 11g) that does a lot of date manipulation. Using a row generator, I'm examining each date within a range of dates for each record in another table. Through another query, I know that my row generator needs to generate 8500 dates, and this amount will grow by 365 days each year. Also, the table that I'm examining has about 18000 records, and this table is expected to grow by several thousand records a year.
The problem comes when joining the row generator to the other table to get the range of dates for each record. SQLTuning Advisor says that there's an expensive Cartesian product, which makes sense given that the query currently could generate up to 8500 x 18000 records. Here's the query in its stripped down form, without all the date logic etc.:
with n as (
select level n
from dual
connect by level <= 8500
)
select t.id, t.origdate + n origdate
from (
select id, origdate, closeddate
from my_table
) t
join n on origdate + n - 1 <= closeddate -- here's the problem join
order by t.id, t.origdate;
Is there an alternate way to join these two tables without the Cartesian product?
I need to calculate the elapsed time for each of these records, disallowing weekends and federal holidays, so that I can sort on the elapsed time. Also, the pagination for the table is done server-side, so we can't just load into the table and sort client-side.
The maximum age of a record in the system right now is 3656 days, and the average is 560, so it's not quite as bad as 8500 x 18000; but it's still bad.
I've just about resigned myself to adding a field to store the opendays, computing it once and storing the elapsed time, and creating a scheduled task to update all open records every night.
I think that you would get better performance if you rewrite the join condition slightly:
with n as (
select level n
from dual
connect by level <= 8500
)
select t.id, t.origdate + n origdate
from (
select id, origdate, closeddate
from my_table
) t
join n on Closeddate - Origdate + 1 <= n --you could even create a function-based index
order by t.id, t.origdate;

Oracle optimizing query involving date calculation

Database
Table1
Id
Table2Id
...
Table2
Id
StartTime
Duration //in hours
Query
select * from Table1 join Table2 on Table2Id = Table2.Id
where starttime < :starttime and starttime + Duration/24 > :endtime
This query is currently taking about 2 seconds to run which is too long. There is an index on the id columns and a function index on Start_time+duration/24 In Sql Developer the query plan shows no indexes being used. The query returns 475 rows for my test start and end times. Table2 has ~800k rows Table1 has ~200k rows
If the duration/24 calculation is removed from the query, replaced with a static value the query time is reduced by half. This does not retrieve the exact same data, but leads me to believe that the division is expensive.
I have also tested adding an endtime column to Table2 that is populated with (starttime + duration/24) The column was prepopulated via a single update, if it would be used in production I would populate it via an update trigger.
select * from Table1 join Table2 on Table2Id = Table2.Id
where starttime < :starttime and endtime > :endtime
This query will run in about 600ms and it uses an index for the join. It is less then ideal because of the additional column with redundant data.
Are there any methods of making this query faster?
Create a function index on both starttime and the expression starttime + Duration/24:
create index myindex on table2(starttime, starttime + Duration / 24);
A compound index on the entire predicate of your query should be selected, whereas individually indexed the optimizer is likely deciding that repeated table accesses by rowid based on a scan of one of those indexes is actually slower than a full table scan.
Also make sure that you're not doing an implicit conversion from varchar to date, by ensuring that you're passing DATEs in your bind variables.
Try lowering the optimizer_index_cost_adj system parameter. I believe the default is 100. Try setting that to 10 and see if your index is selected.
Consider partitioning the table by starttime.
You have two criteria with range predicates (greater than/less than). An index range scan can start at one point in the index and end at another.
For a compound index on starttime and "Starttime+duration/24", since the leading column is starttime and the predicate is "less than bind value", it will start at the left most edge of the index (earliest starttime) and range scan all rows up to the point where the starttime reaches the limit. For each of those matches, it can evaluate the calculated value for "Starttime+duration/24" on the index against the bind value and pass or reject the row. I'd suspect most of the data in the table is old, so most entries have an old starttime and you'd end up scanning most of the index.
For a compound index on "Starttime+duration/24" and starttime, since the leading column is the function and the predicate is "greater than bindvalue", it will start partway through the index and work its way to the end. For each of those matches, it can evaluate the starttime on the index against the bind value and pass or reject the row. If the enddate passed in is recent, I suspect this would actually involve a much smaller amount of the index being scanned.
Even without the starttime as a second column on the index, the existing function based index on "Starttime+duration/24" should still be useful and used. Check the explain plan to make sure the bindvalue is either a date or converted to a date. If it is converted, make sure the appropriate format mask is used (eg an entered value of '1/Jun/09' may be converted to year 0009, so Oracle will see the condition as very relaxed and would tend not to use the index - plus the result could be wrong).
"In Sql Developer the query plan shows no indexes being used. " If the index wasn't being used to find the table2 rows, I suspect the optimizer thought most/all of table2 would be returned [which it obviously isn't, by your numbers]. I'd guess that it though most of table1 would be returned, and thus neither of your predicates did a lot of filtering. As I said above, I think the "less than" predicate isn't selective, but the "greater than" should be. Look at the explain plan, especially the ROWS value, to see what Oracle thinks
PS.
Adjusting the value means the optimizer changes the basis for its estimates. If a journey planner says you'll take six hours for a trip because it assumes an average speed of 50, if you tell it to assume an average of 100 it will comes out with three hours. it won't actually affect the speed you travel at, or how long it takes to actually make the journey.
So you only want to change that value to make it more accurately reflect the actual value for your database (or session).
Oracle would not use indexes if the selectivity of the where clause is not very good. Index would be used if the number of rows returned would be some percentage of the total number of rows in the table (the percentage varies, since oracle will count the cost of reading the index as well as reading the tables).
Also, when the index columns are modified in where clause, the index would get disabled. For example, UPPERCASE(some_index_column), would disable the usage of the index on some_index_column. This is why starttime + Duration/24 > :endtime does not use the Index.
Can you try this
select * from Table1 join Table2 on Table1.Id = Table2.Table1Id
where starttime < :starttime and starttime > :endtime - Duration/24
This should allow the use of the Index and there is no need for an additional column.

Resources