Multiple small inserts in clickhouse - clickhouse

I have an event table (MergeTree) in clickhouse and want to run a lot of small inserts at the same time. However the server becomes overloaded and unresponsive. Moreover, some of the inserts are lost. There are a lot of records in clickhouse error log:
01:43:01.668 [ 16 ] <Error> events (Merger): Part 201 61109_20161109_240760_266738_51 intersects previous part
Is there a way to optimize such queries? I know I can use bulk insert for some types of events. Basically, running one insert with many records, which clickhouse handles pretty well. However, some of the events, such as clicks or opens could not be handled in this way.
The other question: why clickhouse decides that similar records exist, when they don't? There are similar records at the time of insert, which have the same fields as in index, but other fields are different.
From time to time I also receive the following error:
Caused by: ru.yandex.clickhouse.except.ClickHouseUnknownException: ClickHouse exception, message: Connect to localhost:8123 [ip6-localhost/0:0:0:0:0:0:0:1] timed out, host: localhost, port: 8123; Connect to ip6-localhost:8123 [ip6-localhost/0:0:0:0:0:0:0:1] timed out
... 36 more
Mostly during project build when test against clickhouse database are run.

Clickhouse has special type of tables for this - Buffer. It's stored in memory and allow many small inserts with out problem. We have near 200 different inserts per second - it works fine.
Buffer table:
CREATE TABLE logs.log_buffer (rid String, created DateTime, some String, d Date MATERIALIZED toDate(created))
ENGINE = Buffer('logs', 'log_main', 16, 5, 30, 1000, 10000, 1000000, 10000000);
Main table:
CREATE TABLE logs.log_main (rid String, created DateTime, some String, d Date)
ENGINE = MergeTree(d, sipHash128(rid), (created, sipHash128(rid)), 8192);
Details in manual: https://clickhouse.yandex/docs/en/operations/table_engines/buffer/

This is known issue when processing large number of small inserts into (non-replicated) MergeTree.
This is a bug, we need to investigate and fix.
For workaround, you should send inserts in larger batches, as recommended: about one batch per second: https://clickhouse.tech/docs/en/introduction/performance/#performance-when-inserting-data.

I've had a similar problem, although not as bad - making ~20 inserts per second caused the server to reach a high loadavg, memory consumption and CPU use. I created a Buffer table which buffers the inserts in memory, and then they are flushed periodically to the "real" on-disk table. And just like magic, everything went quite: loadavg, memory and CPU usage came down to normal levels. The nice thing is that you can run queries against the buffer table, and get back matching rows from both memory and disk - so clients are unaffected by the buffering. See https://clickhouse.tech/docs/en/engines/table-engines/special/buffer/

Alternatively, you can use something like https://github.com/nikepan/clickhouse-bulk: it will buffer multiple inserts and flush them all together according to user policy.

The design of clickhouse MergeEngines is not meant to take small writes concurrently. The MergeTree as much as I understands merges the parts of data written to a table into based on partitions and then re-organize the parts for better aggregated reads. If we do small writes often you would encounter another exception that Merge
Error: 500: Code: 252, e.displayText() = DB::Exception: Too many parts (300). Merges are processing significantly slow
When you would try to understand why the above exception is thrown the idea will be a lot clearer. CH needs to merge data and there is an upper limit as to how many parts can exist! And every write in a batch is added as a new part and then eventually merged with the partitioned table.
SELECT
table, count() as cnt
FROM system.parts
WHERE database = 'dbname' GROUP BY `table` order by cnt desc
The above query can help you monitor parts, observe while writing how the parts would increase and eventually merge down.
My best bet for the above would be buffering the data set and periodically flushing it to DB, but then that means no real-time analytics.
Using buffer is good, however please consider these points:
If the server is restarted abnormally, the data in the buffer is lost.
FINAL and SAMPLE do not work correctly for Buffer tables. These conditions are passed to the destination table, but are not used for processing data in the buffer
When adding data to a Buffer, one of the buffers is locked. (So no reads)
If the destination table is replicated, some expected characteristics of replicated tables are lost when writing to a Buffer table. (no deduplication)
Please read throughly, it's a special case engine: https://clickhouse.tech/docs/en/engines/table-engines/special/buffer/

Related

Is Clickhouse Buffer Table appropriate for realtime ingestion of many small inserts?

I am writing an application that plots financial data and interacts with a realtime feed of such data. Due to the nature of the task, live market data may be received very frequently in one-trade-at-a-time fashion. I am using the database locally and I am the only user. There is only one program (my middleware) that will be inserting data to the db. My primary concern is latency - I want to minimize it as much as possible. For that reason, I would like to avoid having a queue (in a sense, I want the Buffer Table to fulfill that role). A lot of the analytics Clickhouse calculates for me are expected to be realtime (as much as possible) as well. I have three questions:
Clarify some limitations/caveats from the Buffer Table documentation
Clarify how querying works (regular queries + materialized views)
What happens when I query the db when data is being flushed
Question 1) Clarify some limitations/caveats from the Buffer Table documentation
Based on Clickhouse documentation, I understand that many small INSERTs are sub-optimal to say the least. While researching the topic I found that the Buffer Engine [1] could be used as a solution. It made sense to me, however when I read Buffer's documentation I found some caveats:
Note that it does not make sense to insert data one row at a time, even for Buffer tables. This will only produce a speed of a few thousand rows per second, while inserting larger blocks of data can produce over a million rows per second (see the section “Performance”).
A few thousand rows per second is absolutely fine for me, however I am concerned about other performance considerations - if I do commit data to the buffer table one row at a time, should I expect spikes in CPU/memory? If I understand correctly, committing one row at a time to a MergeTree table would cause a lot of additional work for the merging job, but it should not be a problem if Buffer Table is used, correct?
If the server is restarted abnormally, the data in the buffer is lost.
I understand that this refers to things like power outage or computer crashing. If I shutdown the computer normally or stop the clickhouse server normally, can I expect the buffer to flush data to the target table?
Question 2) Clarify how querying works (regular queries + materialized views)
When reading from a Buffer table, data is processed both from the buffer and from the destination table (if there is one).
Note that the Buffer tables does not support an index. In other words, data in the buffer is fully scanned, which might be slow for large buffers. (For data in a subordinate table, the index that it supports will be used.)
Does that mean I can use queries against the target table and expect Buffer Table data to be included automatically? Or is it the other way around - I query the buffer table and the target table is included in the background? If either is true (and I don't need to aggregate both tables manually), does that also mean Materialized Views would be populated? Which table should trigger the materialized view - the on-disk table or the buffer table? Or both, in some way?
I rely on Materialized Views a lot and need them updated in realtime (or as close as possible). What would be the best strategy to accomplish that goal?
Question 3) What happens when I query the db when data is being flushed?
My two main concerns here are with regards to:
Running a query at the exact time flushing occurs - is there a risk of duplicated records or omitted records?
At which point are Materialized Views of the target table populated (I suppose it depends on whether it's the target table or the buffer table that triggers the MV)? Is flushing the buffer important in how I structure the MV?
Thank you for your time.
[1] https://clickhouse.tech/docs/en/engines/table-engines/special/buffer/
A few thousand rows per second is absolutely fine for me, however I am
concerned about other performance considerations - if I do commit data
to the buffer table one row at a time, should I expect spikes in
CPU/memory?
No Buffer tables engine don't produce CPU\Memory spikes
If I understand correctly, committing one row at a time to
a MergeTree table would cause a lot of additional work for the merging
job, but it should not be a problem if Buffer Table is used, correct?
Buffer table engine is works as memory buffer which periodically flushed the batch of rows to underlying *MergeTree table, parameters of Buffer table is a size and frequency of flushes
If I shutdown the computer normally or stop the clickhouse server normally, can I expect the buffer to flush data to the target table?
Yes, when server stop normally, Buffer tables will flush their data.
I query the buffer table and the target table is included in the background?
Yes, this is right behavior, when you SELECT from Buffer table, SELECT also will pass into underlying *MergeTree table and flushed data will read from *MergeTree
does that also mean Materialized Views would be populated?
It is not clear,
do you CREATE MATERIALIZED VIEW as trigger FROM *MergeTree table or trigger FROM the Buffer table, and which Table Engine do you use for TO table clause?
I would suggest CREATE MATERIALIZED VIEW as trigger FROM underlying MergeTree table

How does BigQuery caching on time partitioned tables work?

In contrast with the BigQuery documentation, we see that it DOES cache the results when selecting data from a streaming, data partitioned table (Standard SQL).
Example:
When we perform a deterministic date scan on the streaming, data partitioned table using:
where (_PARTITIONTIME > '2017-11-12' or _PARTITIONTIME is null)
...BigQuery caches the data for 5 to 20 minutes if we fire the same exact query within that time frame.
While in my interpretation of the documentation it states that it SHOULD NOT cache the data:
'When any of the tables referenced by the query have recently received streaming inserts (a streaming buffer is attached to the table) even if no new rows have arrived'
Important notes:
Our test query queries heartbeat events that really arrive at us continuously
We actually want this caching behavior, because we do not always need to have data to be actual to the last second. We just want to know if we really can depend on this behavior.
Our Questions:
What is going on here / Why does the BQ caching happen at all?
The time this data stays in the BQ cache is 'random' (between 5-20 minutes). What does this mean?
Thanks for clarifying the question. I think it's an overlook that we didn't disabled caching for partitioned tables with streaming data. It should as otherwise the query might return outdated results.
We invalidate the cache when the table is changed. Streaming into the table will cause the table to be changed. I guess that's why the cache is invalidated between 5 to 20 minutes.

What will happen when inserting a row during a long running query

I am writing some data loading code that pulls data from a large, slow table in an oracle database. I have read-only access to the data, and do not have the ability to change indexes or affect the speed of the query in any way.
My select statement takes 5 minutes to execute and returns around 300,000 rows. The system is inserting large batches of new records constantly, and I need to make sure I get every last one, so I need to save a timestamp for the last time I downloaded the data.
My question is: If my select statement is running for 5 minutes, and new rows get inserted while the select is running, will I receive the new rows or not in the query result?
My gut tells me that the answer is 'no', especially since a large portion of those 5 minutes is just the time spent on the data transfer from the database to the local environment, but I can't find any direct documentation on the scenario.
"If my select statement is running for 5 minutes, and new rows get inserted while the select is running, will I receive the new rows or not in the query result?"
No. Oracle enforces strict isolation levels and does not permit dirty reads.
The default isolation level is Read Committed. This means the result set you get after five minutes will be identical to the one you would have got if Oracle could have delivered you all the records in 0.0000001 seconds. Anything committed after you query started running will not be included in the results. That includes updates to the records as well as inserts.
Oracle does this by tracking changes to the table in the UNDO tablespace. Provided it can restrict the original image from that data your query will run to completion; if for any reason the undo information is overwritten your query will fail with the dreaded ORA-1555: Snapshot too old. That's right: Oracle would rather hurl an exception than provide us with an inconsistent result set.
Note that this consistency applies at the statement level. If we run the same query twice within the one transaction we may see two different results sets. If that is a problem (I think not in your case) we need to switch from Read Committed to Serialized isolation.
The Concepts Manual covers Concurrency and Consistency in great depth. Find out more.
So to answer your question, take the timestamp from the time you start the select. Specifically, take the max(created_ts) from the table before you kick off the query. This should protect you from the gap Alex mentions (if records are not committed the moment they are inserted there is the potential to lose records if you base the select on comparing with the system timestamp). Although doing this means you're issuing two queries in the same transaction which means you do need Serialized isolation after all!

Delphi: ClientDataSet is not working with big tables in Oracle

We have a TDBGrid that connected to TClientDataSet via TDataSetProvider in Delphi 7 with Oracle database.
It goes fine to show content of small tables, but the program hangs when you try to open a table with many rows (for ex 2 million rows) because TClientDataSet tries to load the whole table in memory.
I tried to set "FetchOnDemand" to True for our TClientDataSet and "poFetchDetailsOnDemand" to True in Options for TDataSetProvider, but it does not help to solve the problem. Any ides?
Update:
My solution is:
TClientDataSet.FetchOnDemand = T
TDataSetProvider.Options.poFetchDetailsOnDemand = T
TClientDataSet.PacketRecords = 500
I succeeded to solve the problem by setting the "PacketRecords" property for TCustomClientDataSet. This property indicates the number or type of records in a single data packet. PacketRecords is automatically set to -1, meaning that a single packet should contain all records in the dataset, but I changed it to 500 rows.
When working with RDBMS, and especially with large datasets, trying to access a whole table is exactly what you shouldn't do. That's a typical newbie mistake, or a borrowing from old file based small database engines.
When working with RDBMS, you should load the rows you're interested in only, display/modify/update/insert, and send back changes to the database. That means a SELECT with a proper WHERE clause and also an ORDER BY - remember row ordering is never assured when you issue a SELECT without an OREDER BY, a database engine is free to retrieve rows in the order it sees fit for a given query.
If you have to perform bulk changes, you need to do them in SQL and have them processed on the server, not load a whole table client side, modify it, and send changes row by row to the database.
Loading large datasets client side may fali for several reasons, lack of memory (especially 32 bit applications), memory fragmentation, etc. etc., you will flood the network probably with data you don't need, force the database to perform a full scan, maybe flloding the database cache as well, and so on.
Thereby client datasets are not designed to handle millions of billions of rows. They are designed to cache the rows you need client side, and then apply changes to the remote data. You need to change your application logic.

How does oracle db writer decide whether or not to do multiblock / sequential writes

We have a test system which matches our production system like for like. 6 months ago we did some testing on new hardware, and found the performance limit of our system.
However, now we are re-doing the testing with a view to adding further hardware, and we have found the system doesnt perform as it used to.
The reason for this is because on one specific volume we are now doing random I/O which used to be sequential. Further to this it has turned out that the activity on this volume by oracle which is 100% writes, is actually in 8k blocks, where before it was up to 128k.
So something has caused the oracle db writer to stop batching up it's writes.
We've extensively checked our config, and cannot see any difference between our test and production systems. We've also opened a call with Oracle but at this stage information is slow in forthcoming.
so; Ultimately this is 2 related questions:
Can you rely on oracle multiblock writes? Is that a safe thing to engineer/tune your system for?
Why would oracle change its behaviour?
We're not at this stage necessarily blaming oracle - it may well be reacting to something in the environment - but what?
The OS/arch is solaris/sparc.
Oh; I forgot to mention, the insert table has no indexes, and only a couple of foreign keys - it's designed as a bucket for as fast an insert as possible. It's also partitioned on the key field.
Thanks for any tips!
More description of the workload would allow some hypotheses.
If you are updating random blocks, then the DBWR process(es) are going to have little choice but to do single-block writes. Indexes especially are likely to have writes all over the place. If you have an index of character values and need to insert a new 'M' record where there isn't room, it will get a new block for the index and split the current block. You'll have some of those 'M' records in the original block, and some in the new block (while will be the last [used] block in the last extent).
I suspect you are most likely to get multi-block writes when bulk inserting into tables, as new blocks will be allocated and written to. Potentially, initially you had (say) 1GB of extents allocated and were writing into that space. Now you might have reached the limit of that and be creating new extents (say 50 Mb) which it may be getting from scattered file locations (eg other tables that have been dropped).

Resources