clickhouse merge one partition at a time

clickhouse merge one partition at a time - clickhouse

How do I force ClickHouse to only merge one partition at a time when I run optimize table **** final (without specifying partition 201304 and then 201305 and running it sequentilly) ?
I am using a CollapsingMergeTree. Its using a lot of RAM to do multiple merges together for many partitions and killing the service/machine.

The main problem of optimize final (table or partition does not matter) that it re-writes/re-merges a partition fully even if partition have only 1 part which is excessive in 99.9999% occasions!!!! It re-merges old data which was finally merged already!!!
It needed because sometimes one needs to collapse rows (duplicates) inserted with single insert into partition with forever single part. It is a very very rare necessity.
So I recommend to run optimize final against partitions which have more than one part. You can use something like this
select concat('optimize table ',database, '.','\`', table, '\` partition ',partition , ' final;')
from system.parts
where active and (engine like '%ReplacingMergeTree' or engine like '%CollapsingMergeTree')
group by database,table,partition
having count()>1
PS: If you use GraphiteMergeTree it's another story and there are more simple solutions.

Related

Data cleanup in Oracle DB is taking long time for 300 billion records

Problem statement:
There is address table in Oracle which is having relationship with multiple tables like subscriber, member etc.
Currently design is in such a way that when there is any change in associated tables, it increments record version throughout all tables.
So new record is added in address table even if same address is already present, resulting into large number of duplicate copies.
We need to identify and remove duplicate records, and update foreign keys in associated tables while making sure it doesn't impact the running application.
Tried solution:
We have written a script for cleanup logic, where unique hash is generated for every address. If calculated hash is already present then it means address is duplicate, where we merge into single address record and update foreign keys in associated tables.
But the problem is there are around 300 billion records in address table, so this cleanup process is taking lot of time, and it will take several days to complete.
We have tried to have index for hash column, but process is still taking time.
Also we have updated the insertion/query logic to use addresses as per new structure (using hash, and without version), in order to take care of incoming requests in production.
We are planning to do processing in chunks, but it will be very long an on-going activity.
Questions:
Would like to if any further improvement can be made in above approach
Will distributed processing will help here? (may be using Hadoop Spark/hive/MR etc.)
Is there any some sort of tool that can be used here?

Suggestion 1
Use built-in delete parallel
delete /*+ parallel(t 8) */ mytable t where ...
Suggestion 2
Use distributed processing (Hadoop Spark/hive) - watch out for potential contention on indexes or table blocks. It is recommended to have each process to work on a logical isolated subset, e.g.
process 1 - delete mytable t where id between 1000 and 1999
process 2 - delete mytable t where id between 2000 and 2999
...
Suggestion 3
If more than ~30% of the table need to be deleted - the fastest way would be to create an empty table, copy there all required rows, drop original table, rename new, create all indexes+constraints. Of course it requires downtime and it greatly depends on number of indexes - the more you have the longer it will take
P.S. There are no "magic" tools to do it. In the end they all run the same sql commands as you can.

It's possible use oracle merge instruction to insert data if you use clean sql.

Delete rows vs Delete Columns performance

I'm creating the datamodel for a timeseries application on Cassandra 2.1.3. We will be preserving X amount of data for each user of the system and I'm wondering what is the best approach to design for this requirement.
Option1:
Use a 'bucket' in the partition key, so data for X period goes into the same row. Something like this:
((id, bucket), timestamp) -> data
I can delete a single row at once at the expense of maintaining this bucket concept. It also limits the range I can query on timestamp, probably resulting in several queries.
Option2:
Store all the data in the same row. N deletes are per column.
(id, timestamp) -> data
Range queries are easy again. But what about performance after many column deletes?
Given that we plan to use TTL to let the data expire, which of the two models would deliver the best performance? Is the tombstone overhead of Option1 << Option2 or will there be a tombstone per column on both models anyway?
I'm trying to avoid to bury myself in the tombstone graveyard.

I think it will all depend on how much data you plan on having for the given partition key you end up choosing, what your TTL is and what queries you are making.
I typically lean towards option #1, especially if your TTL is the same for all writes. In addition if you are using LeveledCompactionStrategy or DataTieredCompactionStrategy, Cassandra will do a great job keeping data from the same partition in the same SSTable, which will greatly improve read performance.
If you use Option #2, data for the same partition could likely be spread across multiple levels (if using LCS) or just in general multiple sstables, which may cause you to read from a lot of SSTables, depending on the nature of your queries. There is also the issue of hotspotting, where you could overload particular cassandra nodes if you have a really wide partition.
The other benefit of #1 (which you allude to), is that you can easily delete the entire partition, which creates a single tombstone marker which is much cheaper. Also, if you are using the same TTL, data within that partition will expire pretty much at the same time.
I do agree that it is a bit of a pain to have to make multiple queries to read across multiple partitions as it pushes some complexity into the application end. You may also need to maintain a separate table to keep track of the buckets for the given id if they can not be determined implicitly.
As far as performance goes, do you see it as likely that you will need to read cross-partitions when your application makes queries? For example, if you have a query for 'the most recent 1000 records' and a partition typically is wider than that, you may only need to make 1 query for Option #1. However, if you want to have a query like 'give me all records', Option #2 may be better as otherwise you'll need to a make queries for each bucket.

After creating the tables you described above:
CREATE TABLE option1 (
... id bigint,
... bucket bigint,
... timestamp timestamp,
... data text,
... PRIMARY KEY ((id, bucket), timestamp)
... ) WITH default_time_to_live=10;
CREATE TABLE option2 (
... id bigint,
... timestamp timestamp,
... data text,
... PRIMARY KEY (id, timestamp)
... ) WITH default_time_to_live=10;
I inserted a test row:
INSERT INTO option1 (id,bucket,timestamp,data) VALUES (1,2015,'2015-03-16 11:24:00-0500','test1');
INSERT INTO option2 (id,timestamp,data) VALUES (1,'2015-03-16 11:24:00-0500','test2');
...waited 10 seconds, queried with tracing on, and I saw identical tombstone counts for each table. So I either way that shouldn't be too much of a concern for you.
The real issue, is that if you think you'll ever hit the limit of 2 billion columns per partition, then Option #1 is the safe one. If you have a lot of data Option #1 might perform better (because you'll be eliminating the need to look at partitions that don't match your bucket), but really either one should be fine in that respect.
tl;dr;
As the issues of performance and tombstones are going to be similar no matter which option you choose, I'm thinking that Option #2 is the better one, just due to ease of querying.

Is there a way to make selecting query faster?

I want to select multiple rows from multiple tables, one of them having billions of rows. It sometimes take 20 seconds and there are over thousands of users using it so it is pretty bad.
I looked into COLUMNSTORE and tried it in my local machine and the performance is x50 faster than usual! (note that I was clearing the cache to see the difference)
However, the downside is I can't update, insert and delete rows, which is being constantly done for that table with the billion rows.
Is there a way to optimize it? (Besides the (NOLOCK) dirty read, which security is not an issue btw)
There are already indexes in that table, but doesn't help.
Is there a way to perform BATCH EXECUTION (I see it does row execution)? Or any optimization advice?
Using Microsoft SQL Server 2012

When you get to the scale of billions of rows, you often need to take different approaches for handling the data. Separating the content into multiple databases and storing on different machines might be more effective, however the design is considerably more complex.
An alternative is to consider using a combination of partitioned tables with a column-based index. That way at least, you can stage the updated data for the partition and then swap the updated one for the existing one to perform updates. See: http://technet.microsoft.com/en-us/library/gg492088.aspx#Update
An alternative is to consider using three tables: one that is static -- and is perhaps using column-based storage -- the other one dynamic, holding only recent updates and inserts, and the third holding just a list of deleted rows identified by the primary key. You then have to use a view to reconcile the content for queries.

ETL for processing history records

I am in sort of a DWH project (not quite, but still). And there is this issue we constantly run into which I was wondering if there would be a better solution. Follows
We receive some big files with records containing the all states a user have been into, like:
UID | State | Date
1 | Active | 20120518
2 | Inactive | 20120517
1 | Inactive | 20120517
...
And we are usually inly interested in the latest state of each user. So far so good, with just a little sorting and we could get the way we want it. Only problem is, these files are usually big.. like 20-60gb, sorting these guys sometimes is a pain since the logic for sorting isn't usually so straight forward.
What we do generally is load everything into our Oracle and use intermediary tables and materialized views to have it done. Still, sometimes performance bites us.
20-60gb might be big, but not that big. I mean, should be a somewhat more specialised way to deal with these records, shouldn't it?
I imagine two basic ways of seeing tackling the issue:
1) Programming outside the DBMS, scripts and compiled things. But maybe this is not very flexible unless some bigger amount of time is invested developing something. Also, I might have to busy myself administrating the box resources, whereas I wish not to worry with that.
2) Load everything into the DBMS (Oracle in my case) and use whatever tools it provide to sort and clip the data. This would be my case, though, I am not sure we are using all the tools or simply doing it the right way that would be for Oracle 10g.
Question is then:
You have a 60gb file with millions of historical records like the one above and your user want a table in DB with the last state for each user.
how would you guys do?
thanks!

There are two things you can do to speed up the process.
The first thing is to throw compute power at it. If you have Enterprise Edition and lots of cores you will get significant reductions in load time with parallel query.
The other thing is to avoid loading the records you don't want. This is why you mention pre-processing the file. I'm not sure there's much you can do there, unless you have access to a Hadoop cluster to run some map-reduce jobs on your file (well, reduce mainly, the structure you post is about as mapped as can be already).
But there is an alternative: external tables. External tables are tables which have their data in OS files rather then tablespaces. And they can be parallel enabled (providing your file meet certain criteria). Find out more.
So, you might have an external table like this
CREATE TABLE user_status_external (
uid NUMBER(6),
status VARCHAR2(10),
sdate DATE
ORGANIZATION EXTERNAL
(TYPE oracle_loader
DEFAULT DIRECTORY data_dir
ACCESS PARAMETERS
(
RECORDS DELIMITED BY newline
BADFILE 'usrsts.bad'
DISCARDFILE 'usrsts.dis'
LOGFILE 'usrsts.log'
FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY '"'
(
uid INTEGER EXTERNAL(6),
status CHAR(10),
sdate date 'yyyymmdd' )
)
LOCATION ('usrsts.dmp')
)
PARALLEL
REJECT LIMIT UNLIMITED;
Note that you need read and write permissions on the DATA_DIR directory object.
Having created the external table you can load the only desired data into your target table with an insert statement:
insert into user_status (uid, status, last_status_date)
select sq.uid
, sq.status
, sq.sdate
from (
select /*+ parallel (et,4) */
et.uid
, et.status
, et.sdate
, row_number() over (partition by et.uid order by et.sdate desc) rn
from user_status_external et
) sq
where sq.rn = 1
Note that as with all performance advice, there are no guarantees. You need to benchmark things in your environment.
Another thing is the use of INSERT: I'm assuming these are all fresh USERIDs, as that is the scenario your post suggests. If you have a more complicated scenario then you probably want to look at MERGE or a different approach altogether.
One last thing: you seem to be assuming this is a common situation, which has some standard approaches. But most data warehouses load all the data they get. They may then filter it for various different uses, data marts, etc. But they almost always maintain a history in the actual warehouse of all the distinct records. So that's why you might not get an industry standard solution.

I'd go with something along the lines of what APC said as a first go. However, I think parallel tables can only load data in parallel if the data is in multiple files, so you might have to cut the files into several. How are the files generated? A 20 - 60GB file is a real pain to deal with - can you get the generation of the files changed so you get X 2GB files for example?
After getting all the records into the database, you might run into problems attempting to sort 60GB of data - it would be worth having a look at the sort stage of the query you are using to extract the latest status. In the past I helped large sorts by hash partitioning the data on one of the fields to be sorted, in this case user_id. Then Oracle only has to do X smaller sorts, each of which can proceed in parallel.
So, my thoughts would be:
Try and get many smaller files generated instead of 1 big one
Using External tables, see if it is feasible to extract the data you want directly from the external tables
If not, load the entire contents of the files into a hash partition table - at this stage make sure you do insert /*+ append nologging */ to avoid undo generation and redo generation. If your database has force_logging set to true, the nologging hint will have no effect.
Run the select on the staged data to extract only the rows you care about and then trash the staged data.
The nologging option is probably critical to you getting good performance - to load 60GB of data, you are going to generate at least 60GB of redo logs, so if that can be avoided, all the better. You would probably need to have a chat with your DBA about that!
Assuming you have lots of CPU available, it may also make sense to compress the data as you bulk load it into the staging table. Compression may well half the size of your data on disk if it has repeating fields - the disk IO saved when writing it usually more than beats any extra CPU consumed when loading it.

I may be oversimplifying the problem, but why not something like:
create materialized view my_view
tablespace my_tablespace
nologging
build immediate
refresh complete on demand
with primary key
as
select uid,state,date from
(
select /*+ parallel (t,4) */ uid, state, date, row_number() over (partition by uid order by date desc) rnum
from my_table t;
)
where rnum = 1;
Then refresh fully when you need to.
Edit: Any don't forget to rebuild stats and probably throw a unique index on uid.

I would write a program to iterate over each record and retain only those which are more recent than record previously seen. At the end, insert the data into the database.
How practical that is would depend on how many users we're talking about - you could end up having to think carefully about your intermediate storage.
In general, this becomes (in pseudo-code):
foreach row in file
if savedrow is null
save row
else
if row is more desirable than savedrow
save row
end
end
end
send saved rows to database
The point it, you need to define how one row is considered to be more desirable than another. In the simple case, for a given user, the current row's date is later than the last row we saved. At the end, you'd have a list of rows, one-per-user, each of which has the most recent date you saw.
You could general the script or program so that the framework is separate from the code that understands each data file.
It'll still take a while, mind :-)

overcoming 'log file sync' by design?

Advice/suggestions needed for a bit of application design.
I have an application which uses 2 tables, one is a staging table, which many separate processes write to, once a 'group' of processes has finished, another job comes along a aggregates the results together into a final table, then deletes that 'group' from the staging table.
The problem that I'm having is that when the staging table is being cleared out, lots of redo is generated and I'm seeing a lot of 'log file sync' waits in the database. This is a shared database with many other applications and this is causing some issues.
When applying the aggregate, the rows are reduced to about 1 row in the final table for every 20 rows in the staging table.
I'm thinking of getting around this by rather than having a single 'staging' table, I will create a table for each 'group'. Once done, this table can just be dropped, which should result in much less redo.
I only have SE, so partitioned tables isn't an option. Also faster disks for the redo probably isn't an option in the short term either.
Is this a bad idea? Any better solutions to be offered?
Thanks.

Would it be possible to solve the problem by having your process do a logical delete (i.e. set a DELETE_FLAG column in the table to 'Y') and then having a nightly process that truncates the table (potentially writing any non-deleted rows to a separate table before the truncate and then copy them back after the table is truncated)?
Are you certain that the source of the log file sync waits is that your disks can't keep up with the I/O? That's certainly possible, of course, but there are other possible causes of excessive log file sync waits including excessive commits. There is an excellent article on tuning log file sync events on the Pythian blog.

The most common cause of excessive log file syncs is too frequent commits, which are often deliberately coded in a mistaken attempt to reduce system load due to locking. You should commit only when your business transaction is complete.

Loading each group into a separate table sounds like a fine plan to reduce redo. You can truncate individual group table following each aggregation.
Another (but I think probably worse) option is to create a new staging table with the groups that haven't been aggregated then drop the original and rename the new table to replace the staging table.

I prefer Justin's suggestion ("logical delete"), but another option to consider might be a partitioned table, if you have the EE licence. The aggregation process could drop a partition instead of deleting the rows.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio