BigQuery streaming insert with deduplication only writes one row - go

I'm using the Go BigQuery library and the ValueSaver interface to perform a streaming insert to a table. I'm using a single field as an insert key to cause a dedupe, since it's possible that two processes will be trying to write the same data (Cloud Run with 3 instances triggered by a PubSub subscription where a 2nd process might pick up the message and start processing before a 1st can write).
When the job(s) is/are done, I can see the expected rows in the "Preview" tab of the BigQuery Web UI for the table, however, I can see under "Streaming buffer statistics" that it's still buffered data with "Estimated rows" to be about what I expect, but the "Number of rows" shows 0 and if I query, I only see the first row. All rows on "Preview" have a unique value for the field I'm using as the "InsertId". Eventually (like 90 minutes, as indicated by other questions here), the first row is written to the table and the buffer goes away along with the expected data. If I use bigquery.NoDedupeID I get "instant" writes, but duplicate data.
This leads me to two questions, though it's possible I'm missing the point entirely:
Am I misunderstanding how the "InsertId" is being used to dedupe data?
How do I "close" the insert buffer faster?

Related

Is Clickhouse Buffer Table appropriate for realtime ingestion of many small inserts?

I am writing an application that plots financial data and interacts with a realtime feed of such data. Due to the nature of the task, live market data may be received very frequently in one-trade-at-a-time fashion. I am using the database locally and I am the only user. There is only one program (my middleware) that will be inserting data to the db. My primary concern is latency - I want to minimize it as much as possible. For that reason, I would like to avoid having a queue (in a sense, I want the Buffer Table to fulfill that role). A lot of the analytics Clickhouse calculates for me are expected to be realtime (as much as possible) as well. I have three questions:
Clarify some limitations/caveats from the Buffer Table documentation
Clarify how querying works (regular queries + materialized views)
What happens when I query the db when data is being flushed
Question 1) Clarify some limitations/caveats from the Buffer Table documentation
Based on Clickhouse documentation, I understand that many small INSERTs are sub-optimal to say the least. While researching the topic I found that the Buffer Engine [1] could be used as a solution. It made sense to me, however when I read Buffer's documentation I found some caveats:
Note that it does not make sense to insert data one row at a time, even for Buffer tables. This will only produce a speed of a few thousand rows per second, while inserting larger blocks of data can produce over a million rows per second (see the section “Performance”).
A few thousand rows per second is absolutely fine for me, however I am concerned about other performance considerations - if I do commit data to the buffer table one row at a time, should I expect spikes in CPU/memory? If I understand correctly, committing one row at a time to a MergeTree table would cause a lot of additional work for the merging job, but it should not be a problem if Buffer Table is used, correct?
If the server is restarted abnormally, the data in the buffer is lost.
I understand that this refers to things like power outage or computer crashing. If I shutdown the computer normally or stop the clickhouse server normally, can I expect the buffer to flush data to the target table?
Question 2) Clarify how querying works (regular queries + materialized views)
When reading from a Buffer table, data is processed both from the buffer and from the destination table (if there is one).
Note that the Buffer tables does not support an index. In other words, data in the buffer is fully scanned, which might be slow for large buffers. (For data in a subordinate table, the index that it supports will be used.)
Does that mean I can use queries against the target table and expect Buffer Table data to be included automatically? Or is it the other way around - I query the buffer table and the target table is included in the background? If either is true (and I don't need to aggregate both tables manually), does that also mean Materialized Views would be populated? Which table should trigger the materialized view - the on-disk table or the buffer table? Or both, in some way?
I rely on Materialized Views a lot and need them updated in realtime (or as close as possible). What would be the best strategy to accomplish that goal?
Question 3) What happens when I query the db when data is being flushed?
My two main concerns here are with regards to:
Running a query at the exact time flushing occurs - is there a risk of duplicated records or omitted records?
At which point are Materialized Views of the target table populated (I suppose it depends on whether it's the target table or the buffer table that triggers the MV)? Is flushing the buffer important in how I structure the MV?
Thank you for your time.
[1] https://clickhouse.tech/docs/en/engines/table-engines/special/buffer/
A few thousand rows per second is absolutely fine for me, however I am
concerned about other performance considerations - if I do commit data
to the buffer table one row at a time, should I expect spikes in
CPU/memory?
No Buffer tables engine don't produce CPU\Memory spikes
If I understand correctly, committing one row at a time to
a MergeTree table would cause a lot of additional work for the merging
job, but it should not be a problem if Buffer Table is used, correct?
Buffer table engine is works as memory buffer which periodically flushed the batch of rows to underlying *MergeTree table, parameters of Buffer table is a size and frequency of flushes
If I shutdown the computer normally or stop the clickhouse server normally, can I expect the buffer to flush data to the target table?
Yes, when server stop normally, Buffer tables will flush their data.
I query the buffer table and the target table is included in the background?
Yes, this is right behavior, when you SELECT from Buffer table, SELECT also will pass into underlying *MergeTree table and flushed data will read from *MergeTree
does that also mean Materialized Views would be populated?
It is not clear,
do you CREATE MATERIALIZED VIEW as trigger FROM *MergeTree table or trigger FROM the Buffer table, and which Table Engine do you use for TO table clause?
I would suggest CREATE MATERIALIZED VIEW as trigger FROM underlying MergeTree table

NiFi Create Indexes after Inserting Records into table

I've got my first Process Group that drops indexes in table.
Then that routes to another process group the does inserts into table.
After successfully inserting the half million rows, I want to create the indexes on the table and analyze it. This is typical Data Warehouse methodology. Can anyone please give advice on how to do this?
I've tried setting counters, but cannot reference counters in Expression Language. I've tried RouteOnAttribute but getting nowhere. Now I'm digging into Wait & Notify Processors - maybe there's a solution there??
I have gotten Counters to count the flow file sql insert statements, but cannot reference the Counter values via Expression Language. Ie this always returns nulls: "${InsertCounter}" where InsertCounter is being set properly it appears via my UpdateCounter process in my flow.
So maybe this code can be used?
In the wait processor set the Target Signal Count to ${fragment.count}.
Set the Release Signal Identifier in both the notify and wait processor to ${fragment.identifier}
nothing works
You can use Wait/Notify processors to do that.
I assume you're using ExecuteSQL, SplitAvro? If so, the flow will look like:
Split approach
At the 2nd ProcessGroup
ExecuteSQL: e.g. 1 output FlowFile containing 5,000 records
SpritAvro: creates 5,000 FlowFiles, this processor adds fragment.identifier and fragment.count (=5,000) attributes.
split:
XXXX: Do some conversion per record
PutSQL: Insert records individually
Notify: Increase count for the fragment.identifier (Release Signal Identifier) by 1. Executed 5,000 times.
original - to the next ProcessGroup
At the 3rd ProcessGroup
Wait: waiting for fragment.identifier (Release Signal Identifier) to reach fragment.count (Target Signal Count). This route processes the original FlowFile, so executed only once.
PutSQL: Execute a query to create indices and analyze tables
Alternatively, if possible, using Record aware processors would make the flow simpler and more efficient.
Record approach
ExecuteSQL: e.g. 1 output FlowFile containing 5,000 records
Perform record level conversion: With UpdateRecord or LookupRecord, you can do data processing without splitting records into multiple FlowFiles.
PutSQL: Execute a query to create indices and analyze tables. Since the single FlowFile containing all records, no Wait/Notify is required, the output FlowFile can be connected to the downstream flow.
I Think my suggestion to this question will fit into your scenario as well
How to execute a processor only when another processor is not executing?
Check it out

Multiple small inserts in clickhouse

I have an event table (MergeTree) in clickhouse and want to run a lot of small inserts at the same time. However the server becomes overloaded and unresponsive. Moreover, some of the inserts are lost. There are a lot of records in clickhouse error log:
01:43:01.668 [ 16 ] <Error> events (Merger): Part 201 61109_20161109_240760_266738_51 intersects previous part
Is there a way to optimize such queries? I know I can use bulk insert for some types of events. Basically, running one insert with many records, which clickhouse handles pretty well. However, some of the events, such as clicks or opens could not be handled in this way.
The other question: why clickhouse decides that similar records exist, when they don't? There are similar records at the time of insert, which have the same fields as in index, but other fields are different.
From time to time I also receive the following error:
Caused by: ru.yandex.clickhouse.except.ClickHouseUnknownException: ClickHouse exception, message: Connect to localhost:8123 [ip6-localhost/0:0:0:0:0:0:0:1] timed out, host: localhost, port: 8123; Connect to ip6-localhost:8123 [ip6-localhost/0:0:0:0:0:0:0:1] timed out
... 36 more
Mostly during project build when test against clickhouse database are run.
Clickhouse has special type of tables for this - Buffer. It's stored in memory and allow many small inserts with out problem. We have near 200 different inserts per second - it works fine.
Buffer table:
CREATE TABLE logs.log_buffer (rid String, created DateTime, some String, d Date MATERIALIZED toDate(created))
ENGINE = Buffer('logs', 'log_main', 16, 5, 30, 1000, 10000, 1000000, 10000000);
Main table:
CREATE TABLE logs.log_main (rid String, created DateTime, some String, d Date)
ENGINE = MergeTree(d, sipHash128(rid), (created, sipHash128(rid)), 8192);
Details in manual: https://clickhouse.yandex/docs/en/operations/table_engines/buffer/
This is known issue when processing large number of small inserts into (non-replicated) MergeTree.
This is a bug, we need to investigate and fix.
For workaround, you should send inserts in larger batches, as recommended: about one batch per second: https://clickhouse.tech/docs/en/introduction/performance/#performance-when-inserting-data.
I've had a similar problem, although not as bad - making ~20 inserts per second caused the server to reach a high loadavg, memory consumption and CPU use. I created a Buffer table which buffers the inserts in memory, and then they are flushed periodically to the "real" on-disk table. And just like magic, everything went quite: loadavg, memory and CPU usage came down to normal levels. The nice thing is that you can run queries against the buffer table, and get back matching rows from both memory and disk - so clients are unaffected by the buffering. See https://clickhouse.tech/docs/en/engines/table-engines/special/buffer/
Alternatively, you can use something like https://github.com/nikepan/clickhouse-bulk: it will buffer multiple inserts and flush them all together according to user policy.
The design of clickhouse MergeEngines is not meant to take small writes concurrently. The MergeTree as much as I understands merges the parts of data written to a table into based on partitions and then re-organize the parts for better aggregated reads. If we do small writes often you would encounter another exception that Merge
Error: 500: Code: 252, e.displayText() = DB::Exception: Too many parts (300). Merges are processing significantly slow
When you would try to understand why the above exception is thrown the idea will be a lot clearer. CH needs to merge data and there is an upper limit as to how many parts can exist! And every write in a batch is added as a new part and then eventually merged with the partitioned table.
SELECT
table, count() as cnt
FROM system.parts
WHERE database = 'dbname' GROUP BY `table` order by cnt desc
The above query can help you monitor parts, observe while writing how the parts would increase and eventually merge down.
My best bet for the above would be buffering the data set and periodically flushing it to DB, but then that means no real-time analytics.
Using buffer is good, however please consider these points:
If the server is restarted abnormally, the data in the buffer is lost.
FINAL and SAMPLE do not work correctly for Buffer tables. These conditions are passed to the destination table, but are not used for processing data in the buffer
When adding data to a Buffer, one of the buffers is locked. (So no reads)
If the destination table is replicated, some expected characteristics of replicated tables are lost when writing to a Buffer table. (no deduplication)
Please read throughly, it's a special case engine: https://clickhouse.tech/docs/en/engines/table-engines/special/buffer/

Handling Duplicates in Data Warehouse

I was going through the below link for handling Data Quality issues in a data warehouse.
http://www.kimballgroup.com/2007/10/an-architecture-for-data-quality/
"
Responding to Quality Events
I have already remarked that each quality screen has to decide what happens when an error is thrown. The choices are: 1) halting the process, 2) sending the offending record(s) to a suspense file for later processing, and 3) merely tagging the data and passing it through to the next step in the pipeline. The third choice is by far the best choice.
"
In some dimensional feeds (like Client list), sometimes we get a same Client twice (the two records having difference in certain attributes). What is the best solution in this scenario?
I don't want to reject both records (as that would mean incomplete client data).
The source systems are very slow in fixing the issue, so we get the same issues every day. That means a manual fix to the problem also is tough as it has to be done every day (we receive the client list everyday).
Selecting a single record is not possible as we don't know what the correct value is.
Having both the records in our warehouse means our joins are disrupted. Because of two rows for the same ID, the fact table rows are doubled (in a join).
Any thoughts?
What is the best solution in this scenario?
There are a lot of permutations and combinations with your scenario. The big questions is "Are the differing details valid or invalid? as this will change how you deal with them.
Valid Data Example: Record 1 has John Smith living at 12 Main St, Record 2 has John Smith living at 45 Main St. This is valid because John Smith moved address between the first and second record. This is an example of Valid Data. If the data is valid you have options such as create a slowly changing dimension and track the changes (end date old record, start date new record).
Invalid Data Example: However if the data is INVALID (eg your system somehow creates duplicate keys incorrectly) then your options are different. I doubt you want to surface this data, as it's currently invalid and, as you pointed out, you don't have a way to identify which duplicate record is "correct". But you don't want your whole load to fail/halt.
In this instance you would usually:
Push these duplicate rows to a "Quarantine" area
Push an alert to the people who have the power to fix this operationally
Optionally select one of the records randomly as the "golden detail" record (so your system will still tally with totals) and mark an attribute on the record saying that it's "Invalid" and under investigation.
The point that Kimball is trying to make is that Option 1 is not desirable because it halts your entire system for errors that will happen, Option 2 isn't ideal because it means your aggregations will appear out of sync with your source systems, so Option 3 is the most desirable as it still leads to a data fix, but doesn't halt the process or the use of the data (but it does alert the users that this data is suspect).

Data processing and updating of selected records

Basically, the needed job is for large amount of records on a data base, and more records can be inserted all the time:
Select <1000> records with status "NEW" -> process the records -> update the records to status "DONE".
This sounds to me like "Map Reduce".
I think that the job described above can may be done in parallel, even by different machines, but then my concern is:
When I select <1000> records with status "NEW" - how can I know that none of these records are already being processed by some other job ?
The same records should not be selected and processed more than once of course.
Performance is critical.
The naive solution is to do the mentioned basic job in a loop.
It seems related to big data processing / nosql / map reduce etc'.
Thanks
Since considering Performance issue... We can can achieve this.The main goal is to distribute records to clients such way that no to clients get same record.
I irrespective of database...
If you have one more column which is used for locking record. So on fetching those records you can set lock, To prevent from fetching for send time.
But if you don not have such capability then my bets bet would be to create another table or im-memory key-value store, with Record primary key and lock, and on fetching records you need to check of record does not exist in other table....
If you have HBase then it can be achieved easily first approach is achievable with performance.

Resources