PostgreSQL pagination along with row number indexing - oracle

Okay, so I am working on query porting of an oracle DB to postgres. My query needs to give me numbered records, along with pagination.
Consider the following oracle code:
select * from (
select RS.*, ROWNUM as RN from (
select * from STUDENTS order by GRADES
) RS where ROWNUM <= (#{startIndex} + #{pageSize})
) where RN > #{startIndex}
Notice that there are 2 uses of ROWNUM here:
To provide a row number to each row in query result.
For pagination.
I need to port such a query to postgres.
I know how to paginate using LIMIT and OFFSET commands for pagination, but I am not able to provide a global row number (each row in query result gets a unique row number).
On the other hand, I was able to find ROW_NUMBER() command which can provide me with the global row numbers, but it is not reccommended for pagination purposes, since the number of tuples in my DB are very large.
How do I write a similar code in postgres?

The solution looks much simpler in PostgreSQL:
SELECT *,
row_number() OVER (ORDER BY grades, id) AS rn
FROM students
ORDER BY grades, id
OFFSET $1 LIMIT $2;
Here, id stands for the primary key and is used to disambiguate between equal grades.
That query is efficient if there is an index on grades and the offset is not too high:
EXPLAIN (ANALYZE)
SELECT *,
row_number() OVER (ORDER BY grades, id) AS rn
FROM students
ORDER BY grades, id
OFFSET 10 LIMIT 20;
QUERY PLAN
-------------------------------------------------------------------
Limit (cost=1.01..2.49 rows=20 width=20)
(actual time=0.204..0.365 rows=20 loops=1)
-> WindowAgg (cost=0.28..74.25 rows=1000 width=20)
(actual time=0.109..0.334 rows=30 loops=1)
-> Index Scan using students_grades_idx on students
(cost=0.28..59.25 rows=1000 width=12)
(actual time=0.085..0.204 rows=30 loops=1)
Planning time: 0.515 ms
Execution time: 0.627 ms
(5 rows)
Observe the actual values in the plan.
Pagination with OFFSET is always inefficient with large offsets; consider keyset pagination.

Related

postgresql hot updates table not working

I have a table:
CREATE TABLE my_table
(
id bigint NOT NULL,
data1 character varying(255),
data2 character varying(100000),
double1 double precision,
double2 double precision,
id2 bigint
);
With index on id2 (id2 is foreign key).
and i have a query:
update my_table set double2 = :param where id2 = :id2;
This query uses index on id2, but it works very-very slow.
I expected that my query will use HOT updates, but it is not true.
I checked HOT updates by query:
SELECT pg_stat_get_xact_tuples_hot_updated('my_table'::regclass::oid);
and it always returns zero.
What am I doing wrong? How i can speedup my update query?
Version of postgres is 9.4.11.
UPD:
execution plan for update:
Update on my_table (cost=0.56..97681.01 rows=34633 width=90) (actual time=42082.915..42082.915 rows=0 loops=1)
-> Index Scan using my_index on my_table (cost=0.56..97681.01 rows=34633 width=90) (actual time=0.110..330.563 rows=97128 loops=1)
Output: id, data1, data2, 0.5::double precision, double1, id2, ctid
Index Cond: (my_table.id2 = 379262689897216::bigint)
Planning time: 1.246 ms
Execution time: 42082.986 ms
The requirements for HOT updates are:
that you're updating only fields that aren't used in any indexes
that the page that contains the row you're updating has extra space in it (fillfactor should be less than 100)
which based on your comments, you seem to be doing.
But one thing I noticed is that you said you're using pg_stat_get_xact_tuples_hot_updated to check if HOT updates are happening; be aware that this function returns only the number of HOT-updated rows in the current transaction, not from all time. My guess is HOT updates are happening, but you used the wrong function to detect them. If instead you use
SELECT pg_stat_get_tuples_hot_updated('my_table'::regclass::oid);
you can get the total number of HOT-updated rows for all time.

Oracle tuning for query with query annidate

i am trying to better a query. I have a dataset of ticket opened. Every ticket has different rows, every row rappresent an update of the ticket. There is a field (dt_update) that differs it every row.
I have this indexs in the st_remedy_full_light.
IDX_ASSIGNMENT (ASSIGNMENT)
IDX_REMEDY_INC_ID (REMEDY_INC_ID)
IDX_REMDULL_LIGHT_DTUPD (DT_UPDATE)
Now, the query is performed in 8 second. Is high for me.
WITH last_ticket AS
( SELECT *
FROM st_remedy_full_light a
WHERE a.dt_update IN
( SELECT MAX(dt_update)
FROM st_remedy_full_light
WHERE remedy_inc_id = a.remedy_inc_id
)
)
SELECT remedy_inc_id, ASSIGNMENT FROM last_ticket
This is the plan
How i could to better this query?
P.S. This is just a part of a big query
Additional information:
- The table st_remedy_full_light contain 529.507 rows
You could try:
WITH last_ticket AS
( SELECT remedy_inc_id, ASSIGNMENT,
rank() over (partition by remedy_inc_id order by dt_update desc) rn
FROM st_remedy_full_light a
)
SELECT remedy_inc_id, ASSIGNMENT FROM last_ticket
where rn = 1;
The best alternative query, which is also much easier to execute, is this:
select remedy_inc_id
, max(assignment) keep (dense_rank last order by dt_update)
from st_remedy_full_light
group by remedy_inc_id
This will use only one full table scan and a (hash/sort) group by, no self joins.
Don't bother about indexed access, as you'll probably find a full table scan is most appropriate here. Unless the table is really wide and a composite index on all columns used (remedy_inc_id,dt_update,assignment) would be significantly quicker to read than the table.

Why are indexed ORDER BY queries matching many rows a LOT faster than queries matching only a few?

Okay so I have the following query:
explain analyze SELECT seller_region FROM "products"
WHERE "products"."seller_region" = 'Bremen'
AND "products"."state" = 'active'
ORDER BY products.rank DESC,
products.score ASC NULLS LAST,
GREATEST(products.created_at, products.price_last_updated_at) DESC
LIMIT 14 OFFSET 0
The query filtering matches around 11.000 rows. If we look at the query planner, we can see that the query uses the index index_products_active_for_default_order and is very fast:
Limit (cost=0.43..9767.16 rows=14 width=36) (actual time=1.576..6.711 rows=14 loops=1)
-> Index Scan using index_products_active_for_default_order on products (cost=0.43..4951034.14 rows=7097 width=36) (actual time=1.576..6.709 rows=14 loops=1)
Filter: ((seller_region)::text = 'Bremen'::text)
Rows Removed by Filter: 3525
Total runtime: 6.724 ms
Now if I replace 'Bremen' with 'Sachsen' like so in the query:
explain analyze SELECT seller_region FROM "products"
WHERE "products"."seller_region" = 'Sachsen'
AND "products"."state" = 'active'
ORDER BY products.rank DESC,
products.score ASC NULLS LAST,
GREATEST(products.created_at, products.price_last_updated_at) DESC
LIMIT 14 OFFSET 0
The same query only matches around 70 rows and is now consistently very very slow, even though it uses the same index in the exact same way:
Limit (cost=0.43..1755.00 rows=14 width=36) (actual time=2.498..1831.737 rows=14 loops=1)
-> Index Scan using index_products_active_for_default_order on products (cost=0.43..4951034.14 rows=39505 width=36) (actual time=2.496..1831.727 rows=14 loops=1)
Filter: ((seller_region)::text = 'Sachsen'::text)
Rows Removed by Filter: 963360
Total runtime: 1831.760 ms
I don't understand why this happens? I would out of intuition think the the query matching more rows would be slower, but it's the other way around. I Have have tested this with other queries on other columns on my tables as well, and the phenomenon is the same. Two similar queries with the same ordering as the ones above, renders those that matches more rows 100's of times faster than those where the filtering only match a few. Why is this, and how can I avoid this behavior?
PS: I'm using postgres 9.3 and the index is defined as follows:
CREATE INDEX index_products_active_for_default_order
ON products
USING btree
(rank DESC, score COLLATE pg_catalog."default", (GREATEST(created_at, price_last_updated_at)) DESC)
WHERE state::text = 'active'::text;
That is because the first 14 matching rows for Bremen are found in the first 3539 index rows, while for Sachsen 963374 rows have to be scanned.
I recommend an index on (seller_region, rank).

Slow query time in postgresql when UTC milliseconds is stored as bigint

We are migrating from a time series database (ECHO historian) to a open source database basically due to price factor. Our choice was PostgreSQL as there are no open source time series database. What we used to store in the ECHO was just time and value pairs.
Now here is the problem. The table that I created in postgre consists of 2 columns. First is of "bigint" type to store the time in UTC milliseconds(13 digit number) and second is the value whose data type is set to "real" type. I had filled up around 3.6 million rows (Spread across a time range of 30 days) of data and when I query for a small time range (say 1 day) the query takes 4 seconds but for the same time range in ECHO the response time is 150 millisecs!.
This is a huge difference. Having a bigint for time seems to be the reason for the slowness but not sure. Could you please suggest how the query time can be improved.
I also read about using the data type "timestamp" and "timestamptz" and looks like we need to store the date and time as regular format and not UTC seconds. Can this help to speed up my query time?
Here is my table definition :
Table "public. MFC2 Flow_LCL "
Column | Type | Modifiers | Storage | Stats target | Description
----------+--------+-----------+---------+--------------+-------------
the_time | bigint | | plain | |
value | real | | plain | |
Indexes:
"MFC2 Flow_LCL _time_idx" btree (the_time)
Has OIDs: no
Currently i am storing the time in UTC milliseconds (using bigint). The challenge here is there could be duplicate time value pairs.
This is the query i am using (called through a simple API which will pass table name, start and end time)
PGresult *res;
int rec_count;
std::string sSQL;
sSQL.append("SELECT * FROM ");
sSQL.append(" \" ");
sSQL.append(table);
sSQL.append(" \" ");
sSQL.append(" WHERE");
sSQL.append(" time >= ");
CString sTime;
sTime.Format("%I64d",startTime);
sSQL.append(sTime);
sSQL.append(" AND time <= ");
CString eTime;
eTime.Format("%I64d",endTime);
sSQL.append(eTime);
sSQL.append(" ORDER BY time ");
res = PQexec(conn, sSQL.c_str());
Your time series database, if it works like a competitor I examined once, stores data in the order of the "time" column automatically in a heap-like structure. Postgres does not. As a result, you are doing an O(n) search [n=number of rows in table]: the entire table must be read to look for rows matching your time filter. A Primary Key on the timestamp (which creates a unique index) or, if timestamps are not unique, a regular index will give you binary O(log n) searches for single records and improved performance for all queries retrieving less than about 5% of the table. Postgres will estimate the crossover point between where an index scan or a full table scan is better.
You probably also want to CLUSTER (PG Docs) the table on that index.
Also, follow the advice above not to use time or other SQL reserved words as column names. Even when it is legal, it's asking for trouble.
[This would be better as a comment, but it is too long for that.]
Are you really planning for the year 2038 problem already? Why not just use an int for time as in standard UNIX?
SET search_path=tmp;
-- -------------------------------------------
-- create table and populate it with 10M rows
-- -------------------------------------------
DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;
CREATE TABLE old_echo
( the_time timestamp NOT NULL PRIMARY KEY
, payload DOUBLE PRECISION NOT NULL
);
INSERT INTO old_echo (the_time, payload)
SELECT now() - (gs * interval '1 msec')
, random()
FROM generate_series(1,10000000) gs
;
-- DELETE FROM old_echo WHERE random() < 0.8;
VACUUM ANALYZE old_echo;
SELECT MIN(the_time) AS first
, MAX(the_time) AS last
, (MAX(the_time) - MIN(the_time))::interval AS width
FROM old_echo
;
EXPLAIN ANALYZE
SELECT *
FROM old_echo oe
JOIN (
SELECT MIN(the_time) AS first
, MAX(the_time) AS last
, (MAX(the_time) - MIN(the_time))::interval AS width
, ((MAX(the_time) - MIN(the_time))/2)::interval AS half
FROM old_echo
) mima ON 1=1
WHERE oe.the_time >= mima.first + mima.half
AND oe.the_time < mima.first + mima.half + '1 sec':: interval
;
RESULT:
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.06..59433.67 rows=1111124 width=64) (actual time=0.101..1.307 rows=1000 loops=1)
-> Result (cost=0.06..0.07 rows=1 width=0) (actual time=0.049..0.050 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Limit (cost=0.00..0.03 rows=1 width=8) (actual time=0.022..0.022 rows=1 loops=1)
-> Index Scan using old_echo_pkey on old_echo (cost=0.00..284873.62 rows=10000115 width=8) (actual time=0.021..0.021 rows=1 loops=1)
Index Cond: (the_time IS NOT NULL)
InitPlan 2 (returns $1)
-> Limit (cost=0.00..0.03 rows=1 width=8) (actual time=0.009..0.010 rows=1 loops=1)
-> Index Scan Backward using old_echo_pkey on old_echo (cost=0.00..284873.62 rows=10000115 width=8) (actual time=0.009..0.009 rows=1 loops=1)
Index Cond: (the_time IS NOT NULL)
-> Index Scan using old_echo_pkey on old_echo oe (cost=0.01..34433.30 rows=1111124 width=16) (actual time=0.042..0.764 rows=1000 loops=1)
Index Cond: ((the_time >= (($0) + ((($1 - $0) / 2::double precision)))) AND (the_time < ((($0) + ((($1 - $0) / 2::double precision))) + '00:00:01'::interval)))
Total runtime: 1.504 ms
(13 rows)
UPDATE: since the timestamp appears to be non-unique (btw: what do duplicates mean in that case?) I added an extra key column. An ugly hack, but it works here. query time 11ms for 10M -80% rows. (number of rows hit 210/222067):
CREATE TABLE old_echo
( the_time timestamp NOT NULL
, the_seq SERIAL NOT NULL -- to catch the duplicate keys
, payload DOUBLE PRECISION NOT NULL
, PRIMARY KEY(the_time, the_seq)
);
-- Adding the random will cause some timestamps to be non-unique.
-- (and others to be non-existent)
INSERT INTO old_echo (the_time, payload)
SELECT now() - ((gs+random()*1000::integer) * interval '1 msec')
, random()
FROM generate_series(1,10000000) gs
;
DELETE FROM old_echo WHERE random() < 0.8;

Postgresql - using index to filter rows in table with above 100mln rows

I have table with above 100mln rows. I have to count, and extract rows as in following query. The query runs verly long. Explain shows that query doesn't use b-tree index which is created on "created_date" column.
I found on stackoverflow some explanation, that b-trees indicies are useless to filter when table has many rows.
There is an advice to Cluster Index. Should i Cluster table on "created_date" index, if i also use often query, where i ORDER BY id?
What would you advice me to faster queries? Maybe should i read more about sharding?
explain SELECT count(r.id) FROM results_new r
WHERE r.searches_id = 4351940 AND (created_date between '2008-01-01' and '2012-12-13')
Limit (cost=1045863.78..1045863.79 rows=1 width=4)
-> Aggregate (cost=1045863.78..1045863.79 rows=1 width=4)
-> Index Scan using results_new_searches_id_idx on results_new r (cost=0.00..1045012.38 rows=340560 width=4)"
Index Cond: (searches_id = 4351940)"
Filter: ((created_date >= '2008-01-01 00:00:00'::timestamp without time zone) AND (created_date <= '2012-12-13 00:00:00'::timestamp without time zone))
From the look of it, the database has decided that a lookup for one searches_id will produce fewer rows to go through than a lookup for the created_date range. (and that combining the result of two index scans with a bitmap isn't worthwhile...)
If you need this query often, then consider creating an index on searches_id, created_date and then both conditions should go into the index condition.

Resources