In my limited experience with CH Cluster, now I have two nodes, using replicatedMergeTree,1 sharding 2 replicas. I meet the problem that do data synchronize from Mysql.
When to update the table, I first delete data some days ago and count the table record where date >days_ago, and then load data from Mysql,codes like follows:
delete from ods.my_table where data_date>:days_ago;
# here to check if record count is zero
select count(*) from ods.my_table where data_date>:days_ago;
# if count(*) =0 ,load data ; else wait
insert into ods.my_table select * from mysql('xxx'......) where data_date>:days_ago;
but I get zero records in CH ods.my_table where data_date>:days_ago;
if I run it again, it will have data; and run it again, it will be zero..., the result is like that: when it's zero, rerun will be ok; when it's not zero, rerun will not be ok.
I analysis the log, and found that when the mutation is not done, the insert statement has been executed, so, data missed.
I try to check if the mutation is finished on the table, but I could not find any solution, can anybody help me ? Thank you in advantage?
just add table TTL definition on clickhouse side and forget about manual delete
https://clickhouse.tech/docs/en/engines/table-engines/mergetree-family/mergetree/#mergetree-table-ttl
you can add TTL to exists clickhouse MergeTree table
https://clickhouse.tech/docs/en/sql-reference/statements/alter/ttl/
Related
I'm trying to see whether I can use database lock to deal with race conditions. For example
CREATE TABLE ORDER
(
T1_ID NUMBER PRIMARY KEY,
AMT NUMBER,
STATUS1 CHAR(1),
STATUS2 CHAR(1),
UPDATED_BY VARCHAR(25)
);
insert into order values (order_seq.nextval, 1, 'N', 'N', 'U0');
Later two users can update the order record at the same time. Requirement is that only one can proceed while the other should NOT. We can certainly use a distributed lock manager (DLM) to do this but I figure database lock may be more efficient.
User 1:
update T1 set status1='Y', updated_by='U1' where status1='N';
User 2:
update T1 set status2='Y', updated_by='U2' where status1='N';
Two users are doing these at the same time. Ideally only one should be allowed to proceed. I played using Sql Plus and also wrote a little java test program letting two threads do these simultaneously. I got the same result. Let's say User 1 got the DB row lock first. It returns 1 row updated. The second session will be blocked waiting for the row lock before the 1st session commits or rollbacks. The question is REALLY this:
Update with a where clause seems like two operations: first it will do an implicit select based on the where clause to pick the row that will be updated. Since Oracle only supports READ COMMITTED isolation level, I expect both UPDATE statements will pick the single record in the DB. As a result, I expected both UPDATE statement will eventually return "1 row updated" although one will wait till the other transaction commits. HOWEVER that's not what I saw. The second UPDATE returns "0 row updated" after the first commits. I feel that Oracle actually runs the where clause AGAIN after the first session commits, which results in "0 row updated" result.
This is strange to me. I thought I would run into the classical "lost update" phenomenon.
can somebody please explain what's going on here? Thanks very much!
I am trying to insert data into elastic search from a hive table.
CREATE EXTERNAL TABLE IF NOT EXISTS es_temp_table (
dt STRING,
text STRING
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource'='aggr_2014-10-01/metric','es.index.auto.create'='true')
;
INSERT OVERWRITE TABLE es_temp_table
SELECT dt, description
FROM other_table
However, the data is off. When I do a count(*) on my other table I am getting 6,000 rows. When I search the aggr_2014-10-01 index, I see 10,000 records! Somehow, the records are being duplicated (rows are being copied over multiple times). Maybe I can remove duplicate records inside of elastic search? Not sure how I would do that though.
I believe it might be a result of Hive/Qubole spawning two tasks for every mapping. If one mapper succeeds, it tries to kill the other. However, the other task already did damage (aka inserted into ElasticSearch). This is my best guess, but I would prefer to know the exact reason and if there is a way for me to fix it.
set mapred.map.tasks.speculative.execution=false;
One thing I found was to set speculative execution to false, so that only one task is spawned per mapper (see above setting). However, now I am seeing undercounting. I believe this may be due to records being skipped, but I am unable to diagnose why those records would be skipped in the first place.
In this version, it also means that if even one task/mapper fails, the entire job fails, and then I need to delete the index (partial data was uploaded) and rerun the entire job (which takes ~4hours).
[PROGRESS UPDATE]
I attempted to solve this by putting all of the work in the reducer (it's the only way to only spawn one task to ensure no duplicate record insertions).
INSERT OVERWRITE TABLE es_temp_table
SELECT dt, description
FROM other_table
DISTRIBUTE BY cast(rand()*250 as int);
However, I now see a huge underestimate! 2,000 records only now. Elastic search does estimate some things, but not to this extent. There are simply less records in ElasticSearch. This may be due to failed tasks (that are no longer retrying). It may be from when Qubole/Hive passes over malformed entries. But I set:
set mapreduce.map.skip.maxrecords=1000;
Here are some other settings for my query:
set es.nodes=node-names
set es.port=9200;
set es.bulk.size.bytes=1000mb;
set es.http.timeout=20m;
set mapred.tasktracker.expiry.interval=3600000;
set mapred.task.timeout=3600000;
I determined the problem. As I suspected, insertion was skipping over some records that were considered "bad." I was never able to find what records exactly were being skipped, but I tried replacing all non-alphanumeric characters with a space. This solved the problem! The records are no longer being skipped, and all data is uploaded to Elastic Search.
INSERT OVERWRITE TABLE es_temp_table
SELECT dt, REGEXP_REPLACE(description, '[^0-9a-zA-Z]+', ' ')
FROM other_table
I am inserting millions of records in a table
In other session I want to know the status of the same that is how many records have been processed,
No commit is issued in first session
Query v$session_longops. Not 100% but gives a good sense of the session state.
If you have oracle 11g i propose usage of DBMS_PARALLEL_EXECUTE package which allows you to create task chunks and monitoring the process flow (in your case number processed rows). I found tutorial resolving similar case to your situation: http://www.oraclefrontovik.com/2013/05/getting-started-with-dbms_parallel_execute/
Create a procedure with pragma AUTONOMOUS_TRANSACTION which update your counters in another table. Other sessions will see the counters but your main session still not commit.
You may update the counter when you reach 1000 records or how frequently you need it. Don't forget to update the counters when you rollback in main session.
example of procedure
If you're doing inserts in a loop try this construct:
FOR cur IN (SELECT rownum rn, something ... FROM something_else ...)
LOOP
INSERT INTO somewhere ...
dbms_application_info.set_module('Doing inserts', cur.rn || ' rows done');
END LOOP;
Then check module and action columns in v$session view.
First, find SQL_ID of Your session in GV$SQL_MONITOR. Then use it to find Your query's progress and details:
SELECT SQL_ID, status, PROCESS_NAME, FIRST_REFRESH_TIME, LAST_CHANGE_TIME
,PLAN_LINE_ID,PLAN_DEPTH,LPAD(' ',PLAN_DEPTH)||PLAN_OPERATION AS OPER
,PLAN_OPTIONS, PLAN_OBJECT_NAME, PLAN_OBJECT_TYPE
,OUTPUT_ROWS
FROM GV$SQL_PLAN_MONITOR
WHERE SQL_ID IN ( '0u063q75nbjt7' ) -- Your SQL_ID
order by sql_id,process_name,plan_line_id;
Once You found SQL_ID of Your query You can monitor progress in (G)V$SQL_PLAN_MONITOR and processed rows in OUTPUT_ROWS column.
The status can sometimes be derived from the table's segment size.
This is usually not the best approach. But often times it is too late to add logging, and tools like v$session_longops and SQL Monitoring frequently fail to provide accurate estimates, or any estimates at all.
Estimating by the segment size can be very tricky. You must be able to estimate the final size, perhaps based on another environment. Progress may not be linear; some operations may take an hour, and then writing to the table takes the last hour. Direct-path writes will not write to the table's segment initially, instead there will be a temporary segment with an unusual name.
Start with a query like this:
select sum(bytes)/1024/1024 mb
from dba_segments
where segment_name = '[table name]';
I have table with "varchar2" as primary key.
It has about 1 000 000 Transactions per day.
My app wakes up every 5 minute to generate text file by querying only new record.
It will remember last point and process only new records.
Do you have idea how to query with good performance?
I am able to add new column if necessary.
What do you think this process should do by?
plsql?
java?
Everyone here is really really close. However:
Scott Bailey's wrong about using a bitmap index if the table's under any sort of continuous DML load. That's exactly the wrong time to use a bitmap index.
Everyone else's answer about the PROCESSED CHAR(1) check in ('Y','N')column is right, but missing how to index it; you should use a function-based index like this:
CREATE INDEX MY_UNPROCESSED_ROWS_IDX ON MY_TABLE
(CASE WHEN PROCESSED_FLAG = 'N' THEN 'N' ELSE NULL END);
You'd then query it using the same expression:
SELECT * FROM MY_TABLE
WHERE (CASE WHEN PROCESSED_FLAG = 'N' THEN 'N' ELSE NULL END) = 'N';
The reason to use the function-based index is that Oracle doesn't write index entries for entirely NULL values being indexed, so the function-based index above will only contain the rows with PROCESSED_FLAG = 'N'. As you update your rows to PROCESSED_FLAG = 'Y', they'll "fall out" of the index.
Well, if you can add a new column, you could create a Processed column, which will indicate processed records, and create an index on this column for performance.
Then the query should only be for those rows that have been newly added, and not processed.
This should be easily done using sql queries.
Ah, I really hate to add another answer when the others have come so close to nailing it. But
As Ponies points out, Oracle does have a hidden column (ORA_ROWSCN - System Change Number) that can pinpoint when each row was modified. Unfortunately, the default is that it gets the information from the block instead of storing it with each row and changing that behavior will require you to rebuild a really large table. So while this answer is good for quieting the SQL Server fella, I'd not recommend it.
Astander is right there but needs a few caveats. Add a new column needs_processed CHAR(1) DEFAULT 'Y' and add a BITMAP index. For low cardinality columns ('Y'/'N') the bitmap index will be faster. Once you have the rest is pretty easy. But you've got to be careful not select the new rows, process them and mark them as processed in one step. Otherwise, rows could be inserted while you are processing that will get marked processed even though they have not been.
The easiest way would be to use pl/sql to open a cursor that selects unprocessed rows, processes them and then updates the row as processed. If you have an aversion to walking cursors, you could collect the pk's or rowids into a nested table, process them and then update using the nested table.
In MS SQL Server world where I work, we have a 'version' column of type 'timestamp' on our tables.
So, to answer #1, I would add a new column.
To answer #2, I would do it in plsql for performance.
Mark
"astander" pretty much did the work for you. You need to ALTER your table to add one more column (lets say PROCESSED)..
You can also consider creating an INDEX on the PROCESSED ( a bitmap index may be of some advantage, as the possible value can be only 'y' and 'n', but test it out ) so that when you query it will use INDEX.
Also if sure, you query only for every 5 mins, check whether you can add another column with TIMESTAMP type and partition the table with it. ( not sure, check out again ).
I would also think about writing job or some thing and write using UTL_FILE and show it front end if it can be.
If performance is really a problem and you want to create your file asynchronously, you might want to use Oracle Streams, which will actually get modification data from your redo log withou affecting performance of the main database. You may not even need a separate job, as you can configure Oracle Streams to do Asynchronous replication of the changes, through which you can trigger the file creation.
Why not create an extra table that holds two columns. The ID column and a processed flag column. Have an insert trigger on the original table place it's ID in this new table. Your logging process can than select records from this new table and mark them as processed. Finally delete the processed records from this table.
I'm pretty much in agreement with Adam's answer. But I'd want to do some serious testing compared to an alternative.
The issue I see is that you need to not only select the rows, but also do an update of those rows. While that should be pretty fast, I'd like to avoid the update. And avoid having any large transactions hanging around (see below).
The alternative would be to add CREATE_DATE date default sysdate. Index that. And then select records where create_date >= (start date/time of your previous select).
But I don't have enough data on the relative costs of setting a sysdate as default vs. setting a value of Y, updating the function based vs. date index, and doing a range select on the date vs. a specific select on a single value for the Y. You'll probably want to preserve stats or hint the query to use the index on the Y/N column, and definitely want to use a hint on a date column -- the stats on the date column will almost certainly be old.
If data are also being added to the table continuously, including during the period when your query is running, you need to watch out for transaction control. After all, you don't want to read 100,000 records that have the flag = Y, then do your update on 120,000, including the 20,000 that arrived when you query was running.
In the flag case, there are two easy ways: SET TRANSACTION before your select and commit after your update, or start by doing an update from Y to Q, then do your select for those that are Q, and then update to N. Oracle's read consistency is wonderful but needs to be handled with care.
For the date column version, if you don't mind a risk of processing a few rows more than once, just update your table that has the last processed date/time immediately before you do your select.
If there's not much information in the table, consider making it Index Organized.
What about using Materialized view logs? You have a lot of options to play with:
SQL> create table test (id_test number primary key, dummy varchar2(1000));
Table created
SQL> create materialized view log on test;
Materialized view log created
SQL> insert into test values (1, 'hello');
1 row inserted
SQL> insert into test values (2, 'bye');
1 row inserted
SQL> select * from mlog$_test;
ID_TEST SNAPTIME$$ DMLTYPE$$ OLD_NEW$$ CHANGE_VECTOR$$
---------- ----------- --------- --------- ---------------------
1 01/01/4000 I N FE
2 01/01/4000 I N FE
SQL> delete from mlog$_test where id_test in (1,2);
2 rows deleted
SQL> insert into test values (3, 'hello');
1 row inserted
SQL> insert into test values (4, 'bye');
1 row inserted
SQL> select * from mlog$_test;
ID_TEST SNAPTIME$$ DMLTYPE$$ OLD_NEW$$ CHANGE_VECTOR$$
---------- ----------- --------- --------- ---------------
3 01/01/4000 I N FE
4 01/01/4000 I N FE
I think this solution should work..
What you need to do following steps
For the first run, you will have to copy all records. In first run you need to execute following query
insert into new_table(max_rowid) as (Select max(rowid) from yourtable);
Now next time when you want to get only newly inserted values, you can do it by executing follwing command
Select * from yourtable where rowid > (select max_rowid from new_table);
Once you are done with processing above query, simply truncate new_table and insert max(rowid) from yourtable
I think this should work and would be fastest solution;
I have a table that is 5 GB, now I was trying to delete like below:
delete from tablename
where to_char(screatetime,'yyyy-mm-dd') <'2009-06-01'
But it's running long and no response. Meanwhile I tried to check if anybody is blocking with this below:
select l1.sid, ' IS BLOCKING ', l2.sid
from v$lock l1, v$lock l2
where l1.block =1 and l2.request > 0
and l1.id1=l2.id1
and l1.id2=l2.id2
But I didn't find any blocking also.
How can I delete this large data without any problem?
5GB is not a useful measurement of table size. The total number of rows matters. The number of rows you are going to delete as a proportion of the total matters. The average length of the row matters.
If the proportion of the rows to be deleted is tiny it may be worth your while creating an index on screatetime which you will drop afterwards. This may mean your entire operation takes longer, but crucially, it will reduce the time it takes for you to delete the rows.
On the other hand, if you are deleting a large chunk of rows you might find it better to
Create a copy of the table using
'create table t1_copy as select * from t1
where screatedate >= to_date('2009-06-01','yyyy-mm-dd')`
Swap the tables using the rename command.
Re-apply constraints, indexs to the new T1.
Another thing to bear in mind is that deletions eat more UNDO than other transactions, because they take more information to rollback. So if your records are long and/or numerous then your DBA may need to check the UNDO tablespace (or rollback segs if you're still using them).
Finally, have you done any investigation to see where the time is actually going? DELETE statements are just another query, and they can be tackled using the normal panoply of tuning tricks.
Use a query condition to export necessary rows
Truncate table
Import rows
If there is an index on screatetime your query may not be using it. Change your statement so that your where clause can use the index.
delete from tablename where screatetime < to_date('2009-06-01','yyyy-mm-dd')
It runs MUCH faster when you lock the table first. Also change the where clause, as suggested by Rene.
LOCK TABLE tablename IN EXCLUSIVE MODE;
DELETE FROM tablename
where screatetime < to_date('2009-06-01','yyyy-mm-dd');
EDIT: If the table cannot be locked, because it is constantly accessed, you can choose the salami tactic to delete those rows:
BEGIN
LOOP
DELETE FROM tablename
WHERE screatetime < to_date('2009-06-01','yyyy-mm-dd')
AND ROWNUM<=10000;
EXIT WHEN SQL%ROWCOUNT=0;
COMMIT;
END LOOP;
END;
Overall, this will be slower, but it wont burst your rollback segment and you can see the progress in another session (i.e. the number of rows in tablename goes down). And if you have to kill it for some reason, rollback won't take forever and you haven't lost all work done so far.