Postgresql Update big table slowly (16 million rows)

Postgresql Update big table slowly (16 million rows) - performance

I have a big table including around 16 million rows:
POST table:
+--------------------------------+-----------------------------+-----------+
| Column | Type | Modifiers |
+--------------------------------+-----------------------------+-----------+
| key | character varying(36) | not null |
| category_id | integer | |
| owner_id | integer | |
+--------------------------------+-----------------------------+-----------+
key's value contains 36 random characters.
INDEXES: btree(key), btree(owner_id), btree(category_id)
My GOAL is adding is_deleted attribute to this table, updating owner_id(previously is None) from another table. (around 2 million rows affected)
I tried updating multiple rows with format:
update post_tbl as p
set owner_id = c.owner_id,
is_deleted = True
from (values
(92485, 137736),
(948130, 250745),
(423832, 1164883),
(685966, 521767),
...
) as c(owner_id, id)
where p.key in c.id;
WITH number_of_rows are 100, 200, 500, 1000, 2000. However, it seems really slow ~ around 10 rows per second.
So that with 2000000 rows, it might take roughly 50 hours.
I also tried to change postgresql variables:
synchronous_commit=off;
wal_buffers=32;
shared_buffers=1GB;
random_page_cost=0.005;
seq_page_cost=0.005;
But not much change. (plus I tried to delete btree index in owner_id but nothing changes)
Could you please tell me how can I speed up this update? Thanks.

Related

Deleting duplicate database records by date in Laravel

I'm currently working on a Laravel 8 application, backed by a PostgreSQL database, in which I'm generating a Cost model for various different items. My intention was to record a maximum of one Cost->value per day, per item; however, due to some issues with overlapping jobs and the way in which I was using the updateOrCreate() method, I've ended up with multiple Cost records per day for each item.
I've since fixed the logic so that I'm no longer getting multiple records per day, but I'd like to now go back and clean up all of the duplicate records.
Is there an efficient way to delete all of the duplicate records per item, leaving the newest record for each day, i.e.: leaving no more than one record per item, per day? While I'm sure this seems pretty straight-forward, I can't seem to land on the correct logic either directly in SQL, or through Laravel and PHP.
Maybe relevant info: Currently, there's ~50k records in the table.
Example table
// Example database table migration
Schema::create('costs', function (Blueprint $table) {
$table->id();
$table->string('item');
$table->decimal('value');
$table->date('created_at');
$table->timestamp('updated_at');
});
Rough Example (Before)
id,item,value,created_at,updated_at
510,item1,12,2021-07-02,2021-07-02 16:45:17 126.5010838402907751
500,item1,13,2021-07-02,2021-07-02 16:45:05 126.5010838402907751
490,item1,13,2021-07-02,2021-07-02 16:45:01 126.5010838402907751
480,item2,12,2021-07-02,2021-07-02 16:44:59 126.5010838402907751
470,item2,14,2021-07-02,2021-07-02 16:44:55 126.5010838402907751
460,item2,12,2021-07-02,2021-07-02 16:44:54 126.5010838402907751
450,item2,11,2021-07-02,2021-07-02 16:44:53 126.5010838402907751
Rough Example (Desired End-State)
id,item,value,created_at,updated_at
510,item1,12,2021-07-02,2021-07-02 16:45:17 126.5010838402907751
480,item2,12,2021-07-02,2021-07-02 16:44:59 126.5010838402907751

You could use EXISTS():
select * from meuk;
DELETE FROM meuk d
WHERE EXISTS (
SELECT * FROM meuk x
WHERE x.item = d.item -- same item
AND x.updated_at::date = d.updated_at::date -- same date
AND x.updated_at > d.updated_at -- but: more recent
);
select * from meuk;
Results:
DROP TABLE
CREATE TABLE
COPY 7
VACUUM
id | item | value | created_at | updated_at
-----+-------+-------+------------+---------------------
510 | item1 | 12 | 2021-07-02 | 2021-07-02 16:45:17
500 | item1 | 13 | 2021-07-02 | 2021-07-02 16:45:05
490 | item1 | 13 | 2021-07-02 | 2021-07-02 16:45:01
480 | item2 | 12 | 2021-07-02 | 2021-07-02 16:44:59
470 | item2 | 14 | 2021-07-02 | 2021-07-02 16:44:55
460 | item2 | 12 | 2021-07-02 | 2021-07-02 16:44:54
450 | item2 | 11 | 2021-07-02 | 2021-07-02 16:44:53
(7 rows)
DELETE 5
id | item | value | created_at | updated_at
-----+-------+-------+------------+---------------------
510 | item1 | 12 | 2021-07-02 | 2021-07-02 16:45:17
480 | item2 | 12 | 2021-07-02 | 2021-07-02 16:44:59
(2 rows)
A different approach, using window functions. The idea is to number all records on the same {item,day} downward, and preserve only the first:
DELETE FROM meuk d
USING (
SELECT item,updated_at
, row_number() OVER (PARTITION BY item,updated_at::date
ORDER BY item,updated_at DESC
) rn
FROM meuk x
) xx
WHERE xx.item = d.item
AND xx.updated_at = d.updated_at
AND xx.rn > 1
;
Do note that this procedure always involves a self-join: the fate of a record depends on the existence of other records in the same table.

There's a hairy SQL query here https://stackoverflow.com/a/1313293/1346367 ; the simpler one is based on joining the table with itself on costs1.id < costs2.id. The < or > indicates whether you like to keep the oldest or the newest value. Sadly there is not an easy way (you cannot trust an ORDER BY on a GROUP BY statement If i recall correctly).
Since I cannot explain to you in detail how this query works, I give you a Laravel/PHP solution which is inefficient but comprehensible:
$keepIds = [];
// Loop the table (without Eloquent for performance benefit).
foreach(DB::table('costs')->orderBy('id', 'ASC')->get() as $cost) {
// Keep overwriting the index such that the last overwrite will contain the end result.
$keepIds[$cost->item] = $cost->id;
}
// Remove elements that you do not want to keep.
DB::table('costs')->whereNotIn('id', array_values($keepIds))->delete();
I'm not sure if that last query will work properly though with a very big array; it might throw an SQL error.
Note that you can play with the orderBy to chose whether you want to keep the newest or the oldest records.

How to select nth row in CockroachDB?

If I use something like a SERIAL (which is a random number) for my table's primary key, how can I select a numbered row from my table? In MySQL, I just use the auto incremented ID to select a specific row, but not sure how to approach the problem with an arbitrary numbering sequence.
For reference, here is the table I'm working with:
+--------------------+------+-------+
| id | name | score |
+--------------------+------+-------+
| 235451721728983041 | ABC | 1000 |
| 235451721729015809 | EDF | 1100 |
| 235451721729048577 | GHI | 1200 |
| 235451721729081345 | JKL | 900 |
+--------------------+------+-------+

Using the LIMIT and OFFSET clauses will return the nth row. For example SELECT * FROM tbl ORDER BY col1 LIMIT 1 OFFSET 9 returns the 10th row.
Note that it’s important to include the ORDER BY clause here because you care about the order of the results (if you don’t include ORDER BY, it’s possible that the results are arbitrarily ordered).
If you care about the order in which things were inserted, you could ORDER BY the SERIAL column (id in your case), though it’s not always the case because transaction contention and other things could cause the generated SERIAL values to not be strictly ordered.

How to get rid of FULL TABLE SCAN in oracle

I have one query and it is giving me full table scan while doing explain plan , so will you tell me how to get rid of it.
output:
|* 9 | INDEX UNIQUE SCAN | GL_PERIODS_U1 | 1 | | | 1 (0)|
|* 10 | TABLE ACCESS FULL | GL_PERIODS | 12 | 372 | | 6 (0)|
|* 11 | TABLE ACCESS BY INDEX ROWID | GL_JE_HEADERS | 1 | 37 | | 670 (0)|
|* 12 | INDEX RANGE SCAN | GL_JE_HEADERS_N2 | 3096 | | | 11 (0)|
|* 13 | TABLE ACCESS BY INDEX ROWID | GL_JE_BATCHES | 1 | 8 | | 2 (0)|
|* 14 | INDEX UNIQUE SCAN | GL_JE_BATCHES_U1 | 1 | | | 1 (0)|
|* 15 | INDEX RANGE SCAN | GL_JE_LINES_U1 | 746 | | | 4 (0)|
| 16 | TABLE ACCESS FULL | GL_CODE_COMBINATIONS | 1851K| 30M| | 13023 (1)|
My query :
explain plan for
select cc.segment1,
cc.segment2,
h.currency_code,
SUM(NVL(l.accounted_dr,0) - NVL(l.accounted_cr,0))
from gl_code_combinations cc
,gl_je_lines l
,gl_je_headers h
,gl_je_batches b
,gl_periods p1
,gl_periods p2
where cc.code_combination_id = l.code_combination_id
AND b.je_batch_id = h.je_batch_id
AND b.status = 'P'
AND l.je_header_id = h.je_header_id
AND h.je_category = 'Revaluation'
AND h.period_name = p1.period_name
AND p1.period_set_name = 'Equant Master'
AND p2.period_name = 'SEP-16'
AND p2.period_set_name = 'Equant Master'
AND p1.start_date <= p2.end_date
AND h.set_of_books_id = '1429'
GROUP BY cc.segment1,
cc.segment2,
h.currency_code
please suggest

I see you are using the Oracle e-Business Suite data model. In that model, GL_PERIODS, being the table of accounting periods (usually weeks or months), is usually fairly small. Further, you are telling it you want every period prior to September 2016, which is likely to be almost all the periods in your "Equant Master" period set. Depending on how many other period sets you have defined, your full table scan may very well be the optimal (fastest running) plan.
As others have correctly pointed out, full table scans aren't necessarily worse or slower than other access paths.
To determine if your FTS really is a problem, you can use DBMS_XPLAN to get timings of how long each step in your plan is taking. Like this:
First, tell Oracle to keep track of plan-step-level statistics for your session
alter session set statistics_level = ALL;
Make sure you turn of DBMS_OUTPUT / server output
Run your query to completion (i.e., scroll to the bottom of the result set)
Finally, run this query:
SELECT *
FROM TABLE (DBMS_XPLAN.display_cursor (null, null,
'ALLSTATS LAST'));
The output will tell you exactly why your query is taking so long (if it is taking long). It is much more accurate than just picking out all the full table scans in your explain plan.

First thing, why do you want to avoid full table scan? All full table scans are not bad.
You are joining on the same table cc.code_combination_id = l.code_combination_id. I don't think there is a away to avoid full table scan on these type of joins.
To understand this, I created test tables and data.
create table I1(n number primary key, v varchar2(10));
create table I2(n number primary key, v varchar2(10));
and a map table
create table MAP(n number primary key, i1 number referencing I1(n),
i2 number referencing I2(n));
I created index on map table.
create index map_index_i1 on map(i1);
create index map_index_i2 on map(i2);
Here is the sample data that I inserted.
SQL> select * from i1;
N V
1 ONE
2 TWO
5 FIVE
SQL> select * from i2;
N V
3 THREE
4 FOUR
5 FIVE
SQL> select * from map;
N I1 I2
1 1 3
2 1 4
5 5 5
I do gathered the statistics. Then, I executed the query which uses I1 and I2 from map table.
explain plan for
select map.n,i1.v
from i1,map
where map.i2 = map.i1
and i1.n=5
Remember, we have index on I1 and I2 of map table. I thought the optimizer might use the index, but unfortunately it didn't.
Full table scan
Because the condition map.i2 = map.i1 means compare every record of map table's I2 column with I1.
Next, I used one of the indexed columns in the where condition and now it picked the index.
explain plan for
select map.n,i1.v
from i1,map
where map.i2 = map.i1
and i1.n=5
and map.i1=5
Index scan
Have a look at ASK Tom's pages for full table scans. Unfortunately, I couldn't paste the source an link since I have less than 10 reputation !!

Adding an Index degraded execution time

I have a table like this:
myTable (id, group_id, run_date, table2_id, description)
I also have a index like this:
index myTable_grp_i on myTable (group_id)
I used to run a query like this:
select * from myTable t where t.group_id=3 and t.run_date='20120512';
and it worked fine and everyone was happy.
Until I added another index:
index myTable_tab2_i on myTable (table2_id)
My life became miserable... it's taking almost as 5 times longer to run !!!
execution plan looks the same (with or without the new index):
--------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost
--------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 220 | 17019
|* 1 | TABLE ACCESS BY INDEX ROWID| MYTABLE | 1 | 220 | 17019
|* 2 | INDEX RANGE SCAN | MYTABLE_GRP_I | 17056 | | 61
--------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("T"."RUN_DATE"='20120512')
2 - access("T"."GROUP_ID"=3)
I have almost no hair left on my head, why should another index which is not used, on a column which is not in the where clause make a difference ...
I will update the things I checked:
a. I removed the new index and it run faster
b. I added the new index in 2 more different environments and the same thing happen
c. I changed MYTABLE_GRP_I to be on columns run_date and group_id - this made it run fast as a lightning !!
But still why does it happen ?

Will this type of pagination scale?

I need to paginate on a set of models that can/will become large. The results have to be sorted so that the latest entries are the ones that appear on the first page (and then, we can go all the way to the start using 'next' links).
The query to retrieve the first page is the following, 4 is the number of entries I need per page:
SELECT "relationships".* FROM "relationships" WHERE ("relationships".followed_id = 1) ORDER BY created_at DESC LIMIT 4 OFFSET 0;
Since this needs to be sorted and since the number of entries is likely to become large, am I going to run into serious performance issues?
What are my options to make it faster?
My understanding is that an index on 'followed_id' will simply help the where clause. My concern is on the 'order by'

Create an index that contains these two fields in this order (followed_id, created_at)
Now, how large is the large we are talking about here? If it will be of the order of millions.. How about something like the one that follows..
Create an index on keys followed_id, created_at, id (This might change depending upon the fields in select, where and order by clause. I have tailor-made this to your question)
SELECT relationships.*
FROM relationships
JOIN (SELECT id
FROM relationships
WHERE followed_id = 1
ORDER BY created_at
LIMIT 10 OFFSET 10) itable
ON relationships.id = itable.id
ORDER BY relationships.created_at
An explain would yield this:
+----+-------------+---------------+------+---------------+-------------+---------+------+------+-----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+------+---------------+-------------+---------+------+------+-----------------------------------------------------+
| 1 | PRIMARY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Impossible WHERE noticed after reading const tables |
| 2 | DERIVED | relationships | ref | sample_rel2 | sample_rel2 | 5 | | 1 | Using where; Using index |
+----+-------------+---------------+------+---------------+-------------+---------+------+------+-----------------------------------------------------+
If you examine carefully, the sub-query containing the order, limit and offset clauses will operate on the index directly instead of the table and finally join with the table to fetch the 10 records.
It makes a difference when at one point your query makes a call like limit 10 offset 10000. It will retrieve all the 10000 records from the table and fetch the first 10. This trick should restrict the traversal to just the index.
An important note: I tested this in MySQL. Other database might have subtle differences in behavior, but the concept holds good no matter what.

you can index these fields. but it depends:
you can assume (mostly) that the created_at is already ordered. So that might by unnecessary. But that more depends on you app.
anyway you should index followed_id (unless its the primary key)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Postgresql Update big table slowly (16 million rows) - performance

Related

Deleting duplicate database records by date in Laravel

How to select nth row in CockroachDB?

How to get rid of FULL TABLE SCAN in oracle

Adding an Index degraded execution time

Will this type of pagination scale?

Categories

Resources