Optimize table does not run right after table mutation - clickhouse

I'm updating a table with table mutations like this:
ALTER TABLE T1
UPDATE column1 = replaceAll('X', 'Y')
After that, I'm sending optimize-final command with clickhouse-client like this:
OPTIMIZE TABLE T1 FINAL
Ok.
0 rows in set. Elapsed: 0.002 sec.
But it returns instantly(0.002 sec.) and I can see the rows are not updated yet.
After a couple of seconds(10-50) I run the optimize-final command again but this time it hangs until the table is optimized.
Is this the expected behavior of optimize-final?

I can see the rows are not updated yet.
ALTER TABLE T1 UPDATE -- asynchronous
You should check select count() from system.mutations where not is_done; that your mutation is done.
In next versions you can run mutations synchronously
ALTER TABLE T1 UPDATE column1 = replaceAll('X', 'Y') SETTINGS
mutations_sync = 2
mutations_sync, 0, "Wait for synchronous execution of ALTER TABLE UPDATE/DELETE queries (mutations). 0 - execute asynchronously. 1 - wait current server. 2 - wait all replicas if they exist.
OPTIMIZE TABLE T1 FINAL
OPTIMIZE -- merge has no relation to mutations.
0 rows in set. Elapsed: 0.002 sec.
In some cases OPTIMIZE could not start and returns immediately
Use optimize_throw_if_noop to find out a reason
set optimize_throw_if_noop = 1;
OPTIMIZE TABLE T1 FINAL;

Related

Sequential vs parallel solution

I will try to present my problem as simplified as possible.
Assume that we have 3 tables in Oracle 11g.
Persons (person_id, name, surname, status, etc )
Actions (action_id, person_id, action_value, action_date, calculated_flag)
Calculations (calculation_id, person_id,computed_value,computed_date)
What I want is for each person that meets certain criteria (let's say status=3)
I should get the sum of action_values from the Actions table where calculated_flag=0. (something like this select sum(action_value) from Actions where calculated_flag=0 and person_id=current_id).
Then I shall use that sum in a some kind of formula and update the Calculations table for that specific person_id.
update Calculations set computed_value=newvalue, computed_date=sysdate
where person_id=current_id
After that calculated_flag for participated rows will be set to 1.
update Actions set calculated_flag=1
where calculated_flag=0 and person_id=current_id
Now this can be easily done sequentially, by creating a cursor that will run through Persons table and then execute each action needed for the specific person.
(I don't provide the code for the sequential solution as the above is just an example that resembles my real-world setup.)
The problem is that we are talking about quite big amount of data and sequential approach seems like a waste of computational time.
It seems to me that this task could be performed in parallel for number of person_ids.
So the question is:
Can this kind of task be performed using parallelization in PL/SQL?
What would the solution look like? That is, what special packages (e.g. DBMS_PARALLEL_EXECUTE), keywords (e.g. bulk collect), methods should be used and in what manner?
Also, should I have any concerns about partial failure of parallel updates?
Note that I am not quite familiar with parallel programming with PL/SQL.
Thanks.
Edit 1.
Here my pseudo code for my sequential solution
procedure sequential_solution is
cursor persons_of_interest is
select person_id from persons
where status = 3;
tempvalue number;
newvalue number;
begin
for person in persons_of_interest
loop
begin
savepoint personsp;
--step 1
select sum(action_value) into tempvalue
from actions
where calculated_flag = 0
and person_id = person.person_id;
newvalue := dosomemorecalculations(tempvalue);
--step 2
update calculations set computed_value = newvalue, computed_date = sysdate
where person_id = person.person_id;
--step 3
update actions set calculated_flag = 1;
where calculated_flag = 0 and person_id = person.person_id;
--step 4 (didn't mention this step before - sorry)
insert into actions
( person_id, action_value, action_date, calculated_flag )
values
( person.person_id, 100, sysdate, 0 );
exception
when others then
rollback to personsp;
-- this call is defined with pragma AUTONOMOUS_TRANSACTION:
log_failure(person_id);
end;
end loop;
end;
Now, how would I speed up the above either with forall and bulk colletct or with parallel programming Under the following constrains:
proper memory management (taking into consideration large amount of data)
For a single person if one part of the step sequence fails - all steps should be rolled back and the failure logged.
I can propose the following. Let's say you have 1 000 000 rows in persons table, and you want to process 10 000 persons per iteration. So you can do it in this way:
declare
id_from persons.person_id%type;
id_to persons.person_id%type;
calc_date date := sysdate;
begin
for i in 1 .. 100 loop
id_from := (i - 1) * 10000;
id_to := i * 10000;
-- Updating Calculations table, errors are logged into err$_calculations table
merge into Calculations c
using (select p.person_id, sum(action_value) newvalue
from Actions a join persons p on p.person_id = a.person_id
where a.calculated_flag = 0
and p.status = 3
and p.person_id between id_from and id_to
group by p.person_id) s
on (s.person_id = c.person_id)
when matched then update
set c.computed_value = s.newvalue,
c.computed_date = calc_date
log errors into err$_calculations reject limit unlimited;
-- updating actions table only for those person_id which had no errors:
merge into actions a
using (select distinct p.person_id
from persons p join Calculations c on p.person_id = c.person_id
where c.computed_date = calc_date
and p.person_id between id_from and id_to)
on (c.person_id = p.person_id)
when matched then update
set a.calculated_flag = 1;
-- inserting list of persons for who calculations were successful
insert into actions (person_id, action_value, action_date, calculated_flag)
select distinct p.person_id, 100, calc_date, 0
from persons p join Calculations c on p.person_id = c.person_id
where c.computed_date = calc_date
and p.person_id between id_from and id_to;
commit;
end loop;
end;
How it works:
You split the data in persons table into chunks about 10000 rows (depends on gaps in numbers of ID's, max value of i * 10000 should be knowingly more than maximal person_id)
You make a calculation in the MERGE statement and update the Calculations table
LOG ERRORS clause prevents exceptions. If an error occurs, the row with the error will not be updated, but it will be inserted into a table for errors logging. The execution will not be interrupted. To create this table, execute:
begin
DBMS_ERRLOG.CREATE_ERROR_LOG('CALCULATIONS');
end;
The table err$_calculations will be created. More information about DBMS_ERRLOG package see in the documentation.
The second MERGE statement sets calculated_flag = 1 only for rows, where no errors occured. INSERT statement inserts the these rows into actions table. These rows could be found just with the select from Calculations table.
Also, I added variables id_from and id_to to calculate ID's range to update, and the variable calc_date to make sure that all rows updated in first MERGE statement could be found later by date.

Performance issue in hive version 0.13.1

I use AWS-EMR to run my Hive queries and I have a performance issue while running hive version 0.13.1.
The newer version of hive took around 5 minutes for running 10 rows of data. But the same script for 230804 rows is taking 2 days and is still running. What should I do to analyze and fix the problem?
Sample Data:
Table 1:
hive> describe foo;
OK
orderno string
Time taken: 0.101 seconds, Fetched: 1 row(s)
Sample data for table1:
hive>select * from foo;
OK
1826203307
1826207803
1826179498
1826179657
Table 2:
hive> describe de_geo_ip_logs;
OK
id bigint
startorderno bigint
endorderno bigint
itemcode int
Time taken: 0.047 seconds, Fetched: 4 row(s)
Sample data for Table 2:
hive> select * from bar;
127698025 417880320 417880575 306
127698025 3038626048 3038626303 584
127698025 3038626304 3038626431 269
127698025 3038626560 3038626815 163
My Query:
SELECT b.itemcode
FROM foo a, bar b
WHERE a.orderno BETWEEN b.startorderno AND b.endorderno;
In the very top of your Hive log output, it states "Warning: Shuffle Join JOIN[4][Tables a, b] in Stage 'Stage-1 Mapred' is a cross product."
EDIT:
A 'cross product' or Cartesian product is a join without conditions, which returns every row in the 'b' table, for every row in the 'a' table. So, if you take an example of 'a' is 5 rows, and 'b' is 10 rows, you get the product, or, 5 multiplied by 10 = 50 rows returned. There will be a lot of rows that are completely 'null' for one or the other tables.
Now, if you have a table 'a' of 20,000 rows and join it to another table 'b' of 500,000 rows, you are asking the SQL engine to return to you a data set 'a, b' of 10,000,000,000 rows, and then perform the BETWEEN operation on the 10-million rows.
So, if you drop the number of 'b' rows, you see you will get more benefit than the 'a' - in your example, if you can filter the ip_logs table, table 2, since I am making a guess that it has more rows than your order number table, it will cut down on the execution time.
END EDIT
You're forcing the execution engine to work through a Cartesian product by not specifying a condition for the join. It's having to scan all of table a over and over. With 10 rows, you will not have a problem. With 20k, you are running into dozens of map/reduce waves.
Try this query:
SELECT b.itemcode
FROM foo a JOIN bar b on <SomeKey>
WHERE a.orderno BETWEEN b.startorderno AND b.endorderno;
But I'm having trouble figuring out what column your model will allow joining on. Maybe the data model for this expression could be improved? It may just be me not reading the sample clearly.
Either way, you need to filter the number of comparisons BEFORE the where clause. Other ways I have done this in Hive is to make a view with a smaller set of data, and join/match the view instead of the original table.

optimizing a dup delete statement Oracle

I have 2 delete statements that are taking a long time to complete. There are several indexes on the columns in where clause.
What is a duplicate?
If 2 or more records have same values in columns id,cid,type,trefid,ordrefid,amount and paydt then there are duplicates.
The DELETEs delete about 1 million record.
Can they be re-written in any way to make it quicker.
DELETE FROM TABLE1 A WHERE loaddt < (
SELECT max(loaddt) FROM TABLE1 B
WHERE
a.id=b.id and
a.cid=b.cid and
NVL(a.type,'-99999') = NVL(b.type,'-99999') and
NVL(a.trefid,'-99999')=NVL(b.trefid,'-99999') and
NVL(a.ordrefid,'-99999')= NVL(b.ordrefid,'-99999') and
NVL(a.amount,'-99999')=NVL(b.amount,'-99999') and
NVL(a.paydt,TO_DATE('9999-12-31','YYYY-MM-DD'))=NVL(b.paydt,TO_DATE('9999-12-31','YYYY-MM-DD'))
);
COMMIT;
DELETE FROM TABLE1 a where rowid > (
Select min(rowid) from TABLE1 b
WHERE
a.id=b.id and
a.cid=b.cid and
NVL(a.type,'-99999') = NVL(b.type,'-99999') and
NVL(a.trefid,'-99999')=NVL(b.trefid,'-99999') and
NVL(a.ordrefid,'-99999')= NVL(b.ordrefid,'-99999') and
NVL(a.amount,'-99999')=NVL(b.amount,'-99999') and
NVL(a.paydt,TO_DATE('9999-12-31','YYYY-MM-DD'))=NVL(b.paydt,TO_DATE('9999-12-31','YYYY-MM-DD'))
);
commit;
Explain Plan:
DELETE TABLE1
HASH JOIN 1296491
Access Predicates
AND
A.ID=ITEM_1
A.CID=ITEM_2
ITEM_3=NVL(TYPE,'-99999')
ITEM_4=NVL(TREFID,'-99999')
ITEM_5=NVL(ORDREFID,'-99999')
ITEM_6=NVL(AMOUNT,(-99999))
ITEM_7=NVL(PAYDT,TO_DATE(' 9999-12-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
Filter Predicates
LOADDT<MAX(LOADDT)
TABLE ACCESS TABLE1 FULL 267904
VIEW VW_SQ_1 690385
SORT GROUP BY 690385
TABLE ACCESS TABLE1 FULL 267904
How large is the table? If count of deleted rows is up to 12% then you may think about index.
Could you somehow partition your table - like week by week and then scan only actual week?
Maybe this could be more effecient. When you're using aggregate function, then oracle must walk through all relevant rows (in your case fullscan), but when you use exists it stops when the first occurence is found. (and of course the query would be much faster, when there was one function-based(because of NVL) index on all columns in where clause)
DELETE FROM TABLE1 A
WHERE exists (
SELECT 1
FROM TABLE1 B
WHERE
A.loaddt != b.loaddt
a.id=b.id and
a.cid=b.cid and
NVL(a.type,'-99999') = NVL(b.type,'-99999') and
NVL(a.trefid,'-99999')=NVL(b.trefid,'-99999') and
NVL(a.ordrefid,'-99999')= NVL(b.ordrefid,'-99999') and
NVL(a.amount,'-99999')=NVL(b.amount,'-99999') and
NVL(a.paydt,TO_DATE('9999-12-31','YYYY-MM-DD'))=NVL(b.paydt,TO_DATE('9999-12-31','YYYY-MM-DD'))
);
Although some may disagree, I am a proponent of running large, long running deletes procedurally. In my view it is much easier to control and track progress (and your DBA will like you better ;-) Also, not sure why you need to join table1 to itself to identify duplicates (and I'd be curious if you ever run into snapshot too old issues with your current approach). You also shouldn't need multiple delete statements, all duplicates should be handled in one process. Finally, you should check WHY you're constantly re-introducing duplicates each week, and perhaps change the load process (maybe doing a merge/upsert rather than all inserts).
That said, you might try something like:
-- first create mat view to find all duplicates
create materialized view my_dups_mv
tablespace my_tablespace
build immediate
refresh complete on demand
as
select id,cid,type,trefid,ordrefid,amount,paydt, count(1) as cnt
from table1
group by id,cid,type,trefid,ordrefid,amount,paydt
having count(1) > 1;
-- dedup data (or put into procedure and schedule along with mat view refresh above)
declare
-- make sure my_dups_mv is refreshed first
cursor dup_cur is
select * from my_dups_mv;
type duprec_t is record(row_id rowid);
duprec duprec_t;
type duptab_t is table of duprec_t index by pls_integer;
duptab duptab_t;
l_ctr pls_integer := 0;
l_dupcnt pls_integer := 0;
begin
for rec in dup_cur
loop
l_ctr := l_ctr + 1;
-- assuming needed indexes exist
select rowid
bulk collect into duptab
from table1
where id = rec.id
and cid = rec.cid
and type = rec.type
and trefid = rec.trefid
and ordrefid = rec.ordrefid
and amount = rec.amount
and paydt = rec.paydt
-- order by whatever makes sense to make the "keeper" float to top
order by loaddt desc
;
for i in 2 .. duptab.count
loop
l_dupcnt := l_dupcnt + 1;
delete from table1 where rowid = duptab(i).row_id;
end loop;
if (mod(l_ctr, 10000) = 0) then
-- log to log table here (calling autonomous procedure you'll need to implement)
insert_logtable('Table1 deletes', 'Commit reached, deleted ' || l_dupcnt || ' rows');
commit;
end if;
end loop;
commit;
end;
Check your log table for progress status.
1. Parallel
alter session enable parallel dml;
DELETE /*+ PARALLEL */ FROM TABLE1 A WHERE loaddt < (
...
Assuming you have Enterprise Edition, a sane server configuration, and you are on 11g. If you're not on 11g, the parallel syntax is slightly different.
2. Reduce memory requirements
The plan shows a hash join, which is probably a good thing. But without any useful filters, Oracle has to hash the entire table. (Tbone's query, that only use a GROUP BY, looks nicer and may run faster. But it will also probably run into the same problem trying to sort or hash the entire table.)
If the hash can't fit in memory it must be written to disk, which can be very slow. Since you run this query every week, only one of the tables needs to look at all the rows. Depending on exactly when it runs, you can add something like this to the end of the query: ) where b.loaddt >= sysdate - 14. This may significantly reduce the amount of writing to temporary tablespace. And it may also reduce read IO if you use some partitioning strategy like jakub.petr suggested.
3. Active Report
If you want to know exactly what your query is doing, run the Active Report:
select dbms_sqltune.report_sql_monitor(sql_id => 'YOUR_SQL_ID_HERE', type => 'active')
from dual;
(Save the output to an .html file and open it with a browser.)

SQL insert slow on 1 million rows

WITH TOP 100000 (100k) this query is finished in about 3 seconds
WITH TOP 1000000 (1mil) this query is finished in about 2 minutes
SELECT TOP 1000000
db_id = IDENTITY(int, 1, 1), *
INTO dbo.tablename
FROM dbname.dbo.tablename
Actual execution plan is always:
clustered index scan 4% cost
top
top
compute scalar
insert (96% cost)
select into
The table has 1.3 mil rows and has an int primary key on first column
Can I speed it up somehow? I'm using SQL Server 2008 R2.
The results showed that 100,000 records takes 159 ms, and 1,000,000 records takes 1,435 ms. On a Raid 1 OS, Raid 1 Data, Raid 1 Log, Raid 1 TempDb all separate drives. Our Dev enviroment.
The results showed that 100,000 records takes 113 ms, and 1,000,000 records takes 996 ms. On my laptop with a single SSD (Samsung 840 250GB). SSD's rock!!!
The results showed that 100,000 records takes 188 ms, and 1,000,000 records takes 1,880 ms. On a Raid 1 OS, Raid 10 Data, Raid 10 Log, Raid 1 TempDb all separate drives under a production load.
Here is a complete script that shows that the 1 million takes less than ten times as long as 100,000. Your situation is likely slightly different, but this shows that the fundamentals are not the issue.
The results show that 100,000 records takes 146 ms, and 1,000,000 records takes 1,315 ms.
These results are from my desktop. If someone else could run the script and post their results, that would be very useful.
Rob
USE master;
GO
-- Drop database SourceDB
IF EXISTS (SELECT * FROM sys.databases WHERE name = 'SourceDB') ALTER DATABASE SourceDB SET SINGLE_USER WITH ROLLBACK IMMEDIATE;
IF EXISTS (SELECT * FROM sys.databases WHERE name = 'SourceDB') DROP DATABASE SourceDB;
GO
-- Create database SourceDB
CREATE DATABASE SourceDB;
ALTER DATABASE SourceDB SET RECOVERY SIMPLE;
GO
USE SourceDB;
GO
-- Create table SourceDB.dbo.SourceTable
CREATE TABLE dbo.SourceTable (
ColID int PRIMARY KEY
);
GO
-- Populate table SourceDB.dbo.SourceTable
DECLARE #i int = 0;
WHILE #i < 1300000
BEGIN
SET #i += 1;
INSERT INTO dbo.SourceTable (ColID) VALUES (#i);
END;
GO
-- Drop database Test1
IF EXISTS (SELECT * FROM sys.databases WHERE name = 'Test1') ALTER DATABASE Test1 SET SINGLE_USER WITH ROLLBACK IMMEDIATE;
IF EXISTS (SELECT * FROM sys.databases WHERE name = 'Test1') DROP DATABASE Test1;
GO
-- Create database Test1
CREATE DATABASE Test1;
ALTER DATABASE Test1 SET RECOVERY SIMPLE;
ALTER DATABASE Test1 MODIFY FILE (NAME = Test1, SIZE = 3000MB, MAXSIZE = 8TB);
ALTER DATABASE Test1 MODIFY FILE (NAME = Test1_log, SIZE = 3000MB, MAXSIZE = 2TB);
GO
USE Test1;
GO
IF EXISTS (SELECT * FROM sys.tables WHERE [OBJECT_ID] = OBJECT_ID('dbo.DestinationTable1')) DROP TABLE dbo.DestinationTable1;
IF EXISTS (SELECT * FROM sys.tables WHERE [OBJECT_ID] = OBJECT_ID('dbo.DestinationTable2')) DROP TABLE dbo.DestinationTable2;
GO
DECLARE #n int = 100000;
DECLARE #t1 datetime2 = SYSDATETIME();
SELECT TOP (#n) db_id = IDENTITY(int, 1, 1), *
INTO dbo.DestinationTable1
FROM SourceDB.dbo.SourceTable;
SELECT DATEDIFF(ms, #t1, SYSDATETIME()) AS ElapsedMs;
GO
DECLARE #n int = 1000000;
DECLARE #t1 datetime2 = SYSDATETIME();
SELECT TOP (#n) db_id = IDENTITY(int, 1, 1), *
INTO dbo.DestinationTable2
FROM SourceDB.dbo.SourceTable;
SELECT DATEDIFF(ms, #t1, SYSDATETIME()) AS ElapsedMs;
GO

How do I get the number of inserts/updates occuring in an Oracle database?

How do I get the total number of inserts/updates that have occurred in an Oracle database over a period of time?
Assuming that you've configured AWR to retain data for all SQL statements (the default is to only retain the top 30 by CPU, elapsed time, etc. if the STATISTICS_LEVEL is 'TYPICAL' and the top 100 if the STATISTICS_LEVEL is 'ALL') via something like
BEGIN
dbms_workload_repository.modify_snapshot_settings (
topnsql => 'MAXIMUM'
);
END;
and assuming that SQL statements don't age out of the cache before a snapshot captures them, you can use the AWR tables for some of this.
You can gather the number of times that an INSERT statement was executed and the number of times that an UPDATE statement was executed
SELECT sum( stat.executions_delta ) insert_executions
FROM dba_hist_sqlstat stat
JOIN dba_hist_sqltext txt ON (stat.sql_id = txt.sql_id )
JOIN dba_hist_snapshot snap ON (stat.snap_id = snap.snap_id)
WHERE snap.begin_interval_time BETWEEN <<start time>> AND <<end time>>
AND txt.command_type = 2;
SELECT sum( stat.executions_delta ) update_executions
FROM dba_hist_sqlstat stat
JOIN dba_hist_sqltext txt ON (stat.sql_id = txt.sql_id )
JOIN dba_hist_snapshot snap ON (stat.snap_id = snap.snap_id)
WHERE snap.begin_interval_time BETWEEN <<start time>> AND <<end time>>
AND txt.command_type = 6;
Note that these queries include both statements that your application issues and statements that Oracle issues in the background. You could add additional criteria if you want to filter out certain SQL statements.
Similarly, you could get the total number of distinct INSERT and UPDATE statements
SELECT count( distinct stat.sql_id ) distinct_insert_stmts
FROM dba_hist_sqlstat stat
JOIN dba_hist_sqltext txt ON (stat.sql_id = txt.sql_id )
JOIN dba_hist_snapshot snap ON (stat.snap_id = snap.snap_id)
WHERE snap.begin_interval_time BETWEEN <<start time>> AND <<end time>>
AND txt.command_type = 2;
SELECT count( distinct stat.sql_id ) distinct_update_stmts
FROM dba_hist_sqlstat stat
JOIN dba_hist_sqltext txt ON (stat.sql_id = txt.sql_id )
JOIN dba_hist_snapshot snap ON (stat.snap_id = snap.snap_id)
WHERE snap.begin_interval_time BETWEEN <<start time>> AND <<end time>>
AND txt.command_type = 6;
Oracle does not, however, track the number of rows that were inserted or updated in a given interval. So you won't be able to get that information from AWR. The closest you could get would be to try to leverage the monitoring Oracle does to determine if statistics are stale. Assuming MONITORING is enabled for each table (it is by default in 11g and I believe it is by default in 10g), i.e.
ALTER TABLE table_name
MONITORING;
Oracle will periodically flush the approximate number of rows that are inserted, updated, and deleted for each table to the SYS.DBA_TAB_MODIFICATIONS table. But this will only show the activity since statistics were gathered on a table, not the activity in a particular interval. You could, however, try to write a process that periodically captured this data to your own table and report off that.
If you instruct Oracle to flush the monitoring information from memory to disk (otherwise there is a lag of up to several hours)
BEGIN
dbms_stats.flush_database_monitoring_info;
END;
you can get an approximate count of the number of rows that have changed in each table since statistics were last gathered
SELECT table_owner,
table_name,
inserts,
updates,
deletes
FROM sys.dba_tab_modifications

Resources