Deleting duplicate database records by date in Laravel - laravel

I'm currently working on a Laravel 8 application, backed by a PostgreSQL database, in which I'm generating a Cost model for various different items. My intention was to record a maximum of one Cost->value per day, per item; however, due to some issues with overlapping jobs and the way in which I was using the updateOrCreate() method, I've ended up with multiple Cost records per day for each item.
I've since fixed the logic so that I'm no longer getting multiple records per day, but I'd like to now go back and clean up all of the duplicate records.
Is there an efficient way to delete all of the duplicate records per item, leaving the newest record for each day, i.e.: leaving no more than one record per item, per day? While I'm sure this seems pretty straight-forward, I can't seem to land on the correct logic either directly in SQL, or through Laravel and PHP.
Maybe relevant info: Currently, there's ~50k records in the table.
Example table
// Example database table migration
Schema::create('costs', function (Blueprint $table) {
$table->id();
$table->string('item');
$table->decimal('value');
$table->date('created_at');
$table->timestamp('updated_at');
});
Rough Example (Before)
id,item,value,created_at,updated_at
510,item1,12,2021-07-02,2021-07-02 16:45:17 126.5010838402907751
500,item1,13,2021-07-02,2021-07-02 16:45:05 126.5010838402907751
490,item1,13,2021-07-02,2021-07-02 16:45:01 126.5010838402907751
480,item2,12,2021-07-02,2021-07-02 16:44:59 126.5010838402907751
470,item2,14,2021-07-02,2021-07-02 16:44:55 126.5010838402907751
460,item2,12,2021-07-02,2021-07-02 16:44:54 126.5010838402907751
450,item2,11,2021-07-02,2021-07-02 16:44:53 126.5010838402907751
Rough Example (Desired End-State)
id,item,value,created_at,updated_at
510,item1,12,2021-07-02,2021-07-02 16:45:17 126.5010838402907751
480,item2,12,2021-07-02,2021-07-02 16:44:59 126.5010838402907751

You could use EXISTS():
select * from meuk;
DELETE FROM meuk d
WHERE EXISTS (
SELECT * FROM meuk x
WHERE x.item = d.item -- same item
AND x.updated_at::date = d.updated_at::date -- same date
AND x.updated_at > d.updated_at -- but: more recent
);
select * from meuk;
Results:
DROP TABLE
CREATE TABLE
COPY 7
VACUUM
id | item | value | created_at | updated_at
-----+-------+-------+------------+---------------------
510 | item1 | 12 | 2021-07-02 | 2021-07-02 16:45:17
500 | item1 | 13 | 2021-07-02 | 2021-07-02 16:45:05
490 | item1 | 13 | 2021-07-02 | 2021-07-02 16:45:01
480 | item2 | 12 | 2021-07-02 | 2021-07-02 16:44:59
470 | item2 | 14 | 2021-07-02 | 2021-07-02 16:44:55
460 | item2 | 12 | 2021-07-02 | 2021-07-02 16:44:54
450 | item2 | 11 | 2021-07-02 | 2021-07-02 16:44:53
(7 rows)
DELETE 5
id | item | value | created_at | updated_at
-----+-------+-------+------------+---------------------
510 | item1 | 12 | 2021-07-02 | 2021-07-02 16:45:17
480 | item2 | 12 | 2021-07-02 | 2021-07-02 16:44:59
(2 rows)
A different approach, using window functions. The idea is to number all records on the same {item,day} downward, and preserve only the first:
DELETE FROM meuk d
USING (
SELECT item,updated_at
, row_number() OVER (PARTITION BY item,updated_at::date
ORDER BY item,updated_at DESC
) rn
FROM meuk x
) xx
WHERE xx.item = d.item
AND xx.updated_at = d.updated_at
AND xx.rn > 1
;
Do note that this procedure always involves a self-join: the fate of a record depends on the existence of other records in the same table.

There's a hairy SQL query here https://stackoverflow.com/a/1313293/1346367 ; the simpler one is based on joining the table with itself on costs1.id < costs2.id. The < or > indicates whether you like to keep the oldest or the newest value. Sadly there is not an easy way (you cannot trust an ORDER BY on a GROUP BY statement If i recall correctly).
Since I cannot explain to you in detail how this query works, I give you a Laravel/PHP solution which is inefficient but comprehensible:
$keepIds = [];
// Loop the table (without Eloquent for performance benefit).
foreach(DB::table('costs')->orderBy('id', 'ASC')->get() as $cost) {
// Keep overwriting the index such that the last overwrite will contain the end result.
$keepIds[$cost->item] = $cost->id;
}
// Remove elements that you do not want to keep.
DB::table('costs')->whereNotIn('id', array_values($keepIds))->delete();
I'm not sure if that last query will work properly though with a very big array; it might throw an SQL error.
Note that you can play with the orderBy to chose whether you want to keep the newest or the oldest records.

Related

How to get rid of FULL TABLE SCAN in oracle

I have one query and it is giving me full table scan while doing explain plan , so will you tell me how to get rid of it.
output:
|* 9 | INDEX UNIQUE SCAN | GL_PERIODS_U1 | 1 | | | 1 (0)|
|* 10 | TABLE ACCESS FULL | GL_PERIODS | 12 | 372 | | 6 (0)|
|* 11 | TABLE ACCESS BY INDEX ROWID | GL_JE_HEADERS | 1 | 37 | | 670 (0)|
|* 12 | INDEX RANGE SCAN | GL_JE_HEADERS_N2 | 3096 | | | 11 (0)|
|* 13 | TABLE ACCESS BY INDEX ROWID | GL_JE_BATCHES | 1 | 8 | | 2 (0)|
|* 14 | INDEX UNIQUE SCAN | GL_JE_BATCHES_U1 | 1 | | | 1 (0)|
|* 15 | INDEX RANGE SCAN | GL_JE_LINES_U1 | 746 | | | 4 (0)|
| 16 | TABLE ACCESS FULL | GL_CODE_COMBINATIONS | 1851K| 30M| | 13023 (1)|
My query :
explain plan for
select cc.segment1,
cc.segment2,
h.currency_code,
SUM(NVL(l.accounted_dr,0) - NVL(l.accounted_cr,0))
from gl_code_combinations cc
,gl_je_lines l
,gl_je_headers h
,gl_je_batches b
,gl_periods p1
,gl_periods p2
where cc.code_combination_id = l.code_combination_id
AND b.je_batch_id = h.je_batch_id
AND b.status = 'P'
AND l.je_header_id = h.je_header_id
AND h.je_category = 'Revaluation'
AND h.period_name = p1.period_name
AND p1.period_set_name = 'Equant Master'
AND p2.period_name = 'SEP-16'
AND p2.period_set_name = 'Equant Master'
AND p1.start_date <= p2.end_date
AND h.set_of_books_id = '1429'
GROUP BY cc.segment1,
cc.segment2,
h.currency_code
please suggest
I see you are using the Oracle e-Business Suite data model. In that model, GL_PERIODS, being the table of accounting periods (usually weeks or months), is usually fairly small. Further, you are telling it you want every period prior to September 2016, which is likely to be almost all the periods in your "Equant Master" period set. Depending on how many other period sets you have defined, your full table scan may very well be the optimal (fastest running) plan.
As others have correctly pointed out, full table scans aren't necessarily worse or slower than other access paths.
To determine if your FTS really is a problem, you can use DBMS_XPLAN to get timings of how long each step in your plan is taking. Like this:
First, tell Oracle to keep track of plan-step-level statistics for your session
alter session set statistics_level = ALL;
Make sure you turn of DBMS_OUTPUT / server output
Run your query to completion (i.e., scroll to the bottom of the result set)
Finally, run this query:
SELECT *
FROM TABLE (DBMS_XPLAN.display_cursor (null, null,
'ALLSTATS LAST'));
The output will tell you exactly why your query is taking so long (if it is taking long). It is much more accurate than just picking out all the full table scans in your explain plan.
First thing, why do you want to avoid full table scan? All full table scans are not bad.
You are joining on the same table cc.code_combination_id = l.code_combination_id. I don't think there is a away to avoid full table scan on these type of joins.
To understand this, I created test tables and data.
create table I1(n number primary key, v varchar2(10));
create table I2(n number primary key, v varchar2(10));
and a map table
create table MAP(n number primary key, i1 number referencing I1(n),
i2 number referencing I2(n));
I created index on map table.
create index map_index_i1 on map(i1);
create index map_index_i2 on map(i2);
Here is the sample data that I inserted.
SQL> select * from i1;
N V
1 ONE
2 TWO
5 FIVE
SQL> select * from i2;
N V
3 THREE
4 FOUR
5 FIVE
SQL> select * from map;
N I1 I2
1 1 3
2 1 4
5 5 5
I do gathered the statistics. Then, I executed the query which uses I1 and I2 from map table.
explain plan for
select map.n,i1.v
from i1,map
where map.i2 = map.i1
and i1.n=5
Remember, we have index on I1 and I2 of map table. I thought the optimizer might use the index, but unfortunately it didn't.
Full table scan
Because the condition map.i2 = map.i1 means compare every record of map table's I2 column with I1.
Next, I used one of the indexed columns in the where condition and now it picked the index.
explain plan for
select map.n,i1.v
from i1,map
where map.i2 = map.i1
and i1.n=5
and map.i1=5
Index scan
Have a look at ASK Tom's pages for full table scans. Unfortunately, I couldn't paste the source an link since I have less than 10 reputation !!

ORDER BY subquery and ROWNUM goes against relational philosophy?

Oracle 's ROWNUM is applied before ORDER BY. In order to put ROWNUM according to a sorted column, the following subquery is proposed in all documentations and texts.
select *
from (
select *
from table
order by price
)
where rownum <= 7
That bugs me. As I understand, table input into FROM is relational, hence no order is stored, meaning the order in the subquery is not respected when seen by FROM.
I cannot remember the exact scenarios but this fact of "ORDER BY has no effect in the outer query" I have read more than once. Examples are in-line subqueries, subquery for INSERT, ORDER BY of PARTITION clause, etc. For example in
OVER (PARTITION BY name ORDER BY salary)
the salary order will not be respected in outer query, and if we want salary to be sorted at outer query output, another ORDER BY need to be added in the outer query.
Some insights from everyone on why the relational property is not respected here and order is stored in the subquery ?
The ORDER BY in this context is in effect Oracle's proprietary syntax for generating an "ordered" row number on a (logically) unordered set of rows. This is a poorly designed feature in my opinion but the equivalent ISO standard SQL ROW_NUMBER() function (also valid in Oracle) may make it clearer what is happening:
select *
from (
select ROW_NUMBER() OVER (ORDER BY price) rn, *
from table
) t
where rn <= 7;
In this example the ORDER BY goes where it more logically belongs: as part of the specification of a derived row number attribute. This is more powerful than Oracle's version because you can specify several different orderings defining different row numbers in the same result. The actual ordering of rows returned by this query is undefined. I believe that's also true in your Oracle-specific version of the query because no guarantee of ordering is made when you use ORDER BY in that way.
It's worth remembering that Oracle is not a Relational DBMS. In common with other SQL DBMSs Oracle departs from the relational model in some fundamental ways. Features like implicit ordering and DISTINCT exist in the product precisely because of the non-relational nature of the SQL model of data and the consequent need to work around keyless tables with duplicate rows.
Not surprisingly really, Oracle treats this as a bit of a special case. You can see that from the execution plan. With the naive (incorrect/indeterminate) version of the limit that crops up sometimes, you get SORT ORDER BY and COUNT STOPKEY operations:
select *
from my_table
where rownum <= 7
order by price;
--------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 13 | 3 (34)| 00:00:01 |
| 1 | SORT ORDER BY | | 1 | 13 | 3 (34)| 00:00:01 |
|* 2 | COUNT STOPKEY | | | | | |
| 3 | TABLE ACCESS FULL| MY_TABLE | 1 | 13 | 2 (0)| 00:00:01 |
--------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter(ROWNUM<=7)
If you just use an ordered subquery, with no limit, you only get the SORT ORDER BY operation:
select *
from (
select *
from my_table
order by price
);
-------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 13 | 3 (34)| 00:00:01 |
| 1 | SORT ORDER BY | | 1 | 13 | 3 (34)| 00:00:01 |
| 2 | TABLE ACCESS FULL| MY_TABLE | 1 | 13 | 2 (0)| 00:00:01 |
-------------------------------------------------------------------------------
With the usual subquery/ROWNUM construct you get something different,
select *
from (
select *
from my_table
order by price
)
where rownum <= 7;
------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 13 | 3 (34)| 00:00:01 |
|* 1 | COUNT STOPKEY | | | | | |
| 2 | VIEW | | 1 | 13 | 3 (34)| 00:00:01 |
|* 3 | SORT ORDER BY STOPKEY| | 1 | 13 | 3 (34)| 00:00:01 |
| 4 | TABLE ACCESS FULL | MY_TABLE | 1 | 13 | 2 (0)| 00:00:01 |
------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(ROWNUM<=7)
3 - filter(ROWNUM<=7)
The COUNT STOPKEY operation is still there for the outer query, but the inner query (inline view, or derived table) now has a SORT ORDER BY STOPKEY instead of the simple SORT ORDER BY. This is all hidden away in the internals so I'm speculating, but it looks like the stop key - i.e. the row number limit - is being pushed into the subquery processing, so in effect the subquery may only end up with seven rows anyway - though the plan's ROWS value doesn't reflect that (but then you get the same plan with a different limit), and it still feels the need to apply the COUNT STOPKEY operation separately.
Tom Kyte covered similar ground in an Oracle Magazine article, when talking about "Top- N Query Processing with ROWNUM" (emphasis added):
There are two ways to approach this:
- Have the client application run that query and fetch just the first N rows.
- Use that query as an inline view, and use ROWNUM to limit the results, as in SELECT * FROM ( your_query_here ) WHERE ROWNUM <= N.
The second approach is by far superior to the first, for two reasons. The lesser of the two reasons is that it requires less work by the client, because the database takes care of limiting the result set. The more important reason is the special processing the database can do to give you just the top N rows. Using the top- N query means that you have given the database extra information. You have told it, "I'm interested only in getting N rows; I'll never consider the rest." Now, that doesn't sound too earth-shattering until you think about sorting—how sorts work and what the server would need to do.
... and then goes on to outline what it's actually doing, rather more authoritatively than I can.
Interestingly I don't think the order of the final result set is actually guaranteed; it always seems to work, but arguably you should still have an ORDER BY on the outer query too to make it complete. It looks like the order isn't really stored in the subquery, it just happens to be produced like that. (I very much doubt that will ever change as it would break too many things; this ends up looking similar to a table collection expression which also always seems to retain its ordering - breaking that would stop dbms_xplan working though. I'm sure there are other examples.)
Just for comparison, this is what the ROW_NUMBER() equivalent does:
select *
from (
select ROW_NUMBER() OVER (ORDER BY price) rn, my_table.*
from my_table
) t
where rn <= 7;
-------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2 | 52 | 4 (25)| 00:00:01 |
|* 1 | VIEW | | 2 | 52 | 4 (25)| 00:00:01 |
|* 2 | WINDOW SORT PUSHED RANK| | 2 | 26 | 4 (25)| 00:00:01 |
| 3 | TABLE ACCESS FULL | MY_TABLE | 2 | 26 | 3 (0)| 00:00:01 |
-------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("RN"<=7)
2 - filter(ROW_NUMBER() OVER ( ORDER BY "PRICE")<=7)
Adding to sqlvogel's good answer :
"As I understand, table input into FROM is relational"
No, table input into FROM is not relational. It is not relational because "table input" are tables and tables are not relations. The myriads of quirks and oddities in SQL eventually all boil down to that simple fact : the core building brick in SQL is the table, and a table is not a relation. To sum up the differences :
Tables can contain duplicate rows, relations cannot. (As a consequence, SQL offers bag algebra, not relational algebra. As another consequence, it is as good as impossible for SQL to even define equality comparison for its most basic building brick !!! How would you compare tables for equality given that you might have to deal with duplicate rows ?)
Tables can contain unnamed columns, relations cannot. SELECT X+Y FROM ... As a consequence, SQL is forced into "column identity by ordinal position", and as a consequence of that, you get all sorts of quirks, e.g. in SELECT A,B FROM ... UNION SELECT B,A FROM ...
Tables can contain duplicate column names, relations cannot. A.ID and B.ID in a table are not distinct column names. The part before the dot is not part of the name, it is a "scope identifier", and that scope identifier "disappears" once you're "outside the SELECT" it appears/is introduced in. You can verify this with a nested SELECT : SELECT A.ID FROM (SELECT A.ID, B.ID FROM ...). It won't work (unless your particular implementation departs from the standard in order to make it work).
Various SQL constructs leave people with the impression that tables do have an ordering to rows. The ORDER BY clause, obviously, but also the GROUP BY clause (which can be made to work only by introducing rather dodgy concepts of "intermediate tables with rows grouped together"). Relations simply are not like that.
Tables can contain NULLs, relations cannot. This one has been beaten to death.
There should be some more, but I don't remember them off the tip of the hat.

Postgresql Update big table slowly (16 million rows)

I have a big table including around 16 million rows:
POST table:
+--------------------------------+-----------------------------+-----------+
| Column | Type | Modifiers |
+--------------------------------+-----------------------------+-----------+
| key | character varying(36) | not null |
| category_id | integer | |
| owner_id | integer | |
+--------------------------------+-----------------------------+-----------+
key's value contains 36 random characters.
INDEXES: btree(key), btree(owner_id), btree(category_id)
My GOAL is adding is_deleted attribute to this table, updating owner_id(previously is None) from another table. (around 2 million rows affected)
I tried updating multiple rows with format:
update post_tbl as p
set owner_id = c.owner_id,
is_deleted = True
from (values
(92485, 137736),
(948130, 250745),
(423832, 1164883),
(685966, 521767),
...
) as c(owner_id, id)
where p.key in c.id;
WITH number_of_rows are 100, 200, 500, 1000, 2000. However, it seems really slow ~ around 10 rows per second.
So that with 2000000 rows, it might take roughly 50 hours.
I also tried to change postgresql variables:
synchronous_commit=off;
wal_buffers=32;
shared_buffers=1GB;
random_page_cost=0.005;
seq_page_cost=0.005;
But not much change. (plus I tried to delete btree index in owner_id but nothing changes)
Could you please tell me how can I speed up this update? Thanks.

Simple Oracle UPDATE Statement unusually bad performance

every month I do a simple update statement on my oracle database. But, since monday it takes very long. The table grows every month by 5 percent. Now there are 8 million records stored.
The Statement:
update /*+ parallel(destination_tab, 4) */ destination_tab dest
set (full_name, state) =
(select /*+ parallel(source_tab, 4) */ dest.name, src.state
from source_tab src
where src.city = dest.city);
In real there are 20 fields to update, not only two... but so it looks easier to descripe the problem.
explain plan:
-----------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------------------------
| 0 | update statement | | 8517K| 3167M| 579M (50)|999:59:59 |
| 1 | update | destination_tab | | | | |
| 2 | PX COORDINATOR | | | | | |
| 3 | PX SEND QC (RANDOM) | :TQ10000 | 8517K| 3167M| 6198 (1)| 00:01:27 |
| 4 | px block iterator | | 8517K| 3167M| 6198 (1)| 00:01:27 |
| 5 | table access full | DESTINATION_TAB | 8517K| 3167M| 6198 (1)| 00:01:27 |
| 6 | table access by index rowid| SOURCE_TAB | 1 | 56 | 1 (0)| 00:00:01 |
|* 7 | index unique scan | CITY_PK | 1 | | 1 (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------------
Could anyone descripe to me, how this can be? The plan looks very bad! Thank you very very much.
You didn't say how long is too long. You are joining an 8 million row table. Not sure how many rows are in source_tab.
I noticed the execution plan indicates a full table scan of destination_tab. Is the city column on the destination_tab table indexed? If not, try adding an index. If it is, Oracle may be ignoring it because it knows it needs to return every value anyway and destination_tab is the driving table.
No matter how you optimize it, this will always degrade in performance as the tables grow because you are updating every row by fetching a value from the same table joined to another. That is, you are always doing N operations where N is the number of rows in destination_tab.
High-level questions/suggestions:
Do you need to update every row every time? Are only certain rows likely to have changed values? If so, can you somehow predict which rows you need to update and limit your updates to it.
Why are the hints there? If performance changes, I would experiment with dropping hints. It's the optimizer's job to find the best plan for you. By using hints, you are telling the optimizer how to do its job. You'd better be right.
You are updating the full_name column on destination_tab to the name column of the same row. But you are obtaining the name column through a join to the table. It may be quicker to take that out of your select and use something like below. This is a guess. It may not matter.
update destination_tab dest
set full_name = name,
state =
(select src.state
from source_tab src
where src.city = dest.city);
Try the following.
merge
into destination_tab d
using source_tab s
on (d.city = d.city)
when matched then
update
set d.state = s.state
where decode(d.state, s.state, 1, 0) = 0;
If this is a data warehouse, I wouldn't do updates, especially not every row in a large table. I'd probably create a materialized view combining the pieces from various base tables, and do a full refresh when needed (non-atomic: truncate + insert append).
Edit:
As for WHY the current update approach is taking much longer than usual, my guess is that in previous runs Oracle found a good number of blocks needed for the update in buffer cache, and lately Oracle has had to pull a lot from disk into buffer first. You can look into consistent gets and db block gets (logical io) vs physical io (disk).
I understand the comments about the sense of a data warehouse and so on. However, I have to do this update in this kind. The update is part of an ETL workflow. I have to copy every month the complete 8 million records of the table "destination". After this step I have to do the UPDATE which makes problems.
I do not understand the problem, that the performance is so bad day-to-day. Usually, the update runs 45 minutes. Now, it runs about 4 hours. But why? There is no sorting necessary, so the famous reason "sorting on disc instead on main memory" is not possible. What is the problem in my case?
Could there be an difference about the performance between normal update (how I do it) and the merge-update?

Adding an Index degraded execution time

I have a table like this:
myTable (id, group_id, run_date, table2_id, description)
I also have a index like this:
index myTable_grp_i on myTable (group_id)
I used to run a query like this:
select * from myTable t where t.group_id=3 and t.run_date='20120512';
and it worked fine and everyone was happy.
Until I added another index:
index myTable_tab2_i on myTable (table2_id)
My life became miserable... it's taking almost as 5 times longer to run !!!
execution plan looks the same (with or without the new index):
--------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost
--------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 220 | 17019
|* 1 | TABLE ACCESS BY INDEX ROWID| MYTABLE | 1 | 220 | 17019
|* 2 | INDEX RANGE SCAN | MYTABLE_GRP_I | 17056 | | 61
--------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("T"."RUN_DATE"='20120512')
2 - access("T"."GROUP_ID"=3)
I have almost no hair left on my head, why should another index which is not used, on a column which is not in the where clause make a difference ...
I will update the things I checked:
a. I removed the new index and it run faster
b. I added the new index in 2 more different environments and the same thing happen
c. I changed MYTABLE_GRP_I to be on columns run_date and group_id - this made it run fast as a lightning !!
But still why does it happen ?

Resources