Data get discarded while I am using kafkastream aggregate method - apache-kafka-streams

I recently encountered a thorny problem, while I am using kafkastream's TimeWindowedKStream aggregation method. The phenomenon was that I stopped my program for 5 minutes and then restarted it. I found a small part of my data was lost and got the following prompt, "Skipping record for expired window". All data are normal data that want to be saved, there is no large delay. What can I do to prevent data from being discarded ? It seems that kafkastream got a later time when it got observedstreamtime

The error message means that a window was already closed -- thus you would need to increase GRACE as pointed out by #groo. -- Data expiration is based on event-time so stopping your program and resuming is later should not change much.
However, if there is a repartition topic before the aggregation, if you stop your program for some time, there might be more out-of-order data inside the repartition topic, because the input topic is read much faster than in the "live run" -- this increased unorder during catchup could be the issue.

Related

Solana transaction never gets picked up by the cluster

For the last few weeks or so, we are having the following issue:
Some of our transactions, when sent via sendRawTransaction() never get picked up by the network (if we look up the txid in the explorer, it's never there), and yet web3js doesn't error out.
We use "#solana/web3.js": "^1.44.1"
This has started happening to us like 2-3 weeks ago:
This issue affects some sets of transactions that all share the same instructions + amount of signers and accounts.
It repros 100% of the time for all transactions in those sets. No matter the state of the network or how many times we retry, they never get picked up.
We don't get any error back from web3.js, such as transaction limit hit
They all work in devnet, but not in mainnet!
For one of these tx, I removed one instruction+signer and it started working, so I imagine there's some limit we're hitting, but I can't tell which or how to even determine the limit.
When network congestion is high, validators can drop transactions without any error. To fix your issue you could send more of the same transaction on some interval while you're waiting for confirmation, and while your transaction blockhash is valid. This way you'll raise a chances for your transaction to been processed by the validator.

ElasticSearch document refresh=true does not appear to work

In order to speed up searches on our website, I have created a small elastic search instance which keeps a copy of all of the "searchable" fields from our database. It holds only a couple million documents with an average size of about 1KB per document. Currently (in development) we have just 2 nodes, but will probably want more in production.
Our application is a "primarily read" application - maybe 1000 documents/day get updated, but they get read and searched 10's of thousands of times/day.
Each document represents a case in a ticketing system, and the case may change status during the day as users research and close cases. If a researcher closes a case and then immediately refreshes his queue of open work, we expect the case to disappear from their queue, which is driven by a query to our Elastic Search instance, filtering by status. The status is a field in the case index.
The complaint we're getting is that when a researcher closes a case, upon immediate refresh of his queue, the case still comes back when filtering on "in progress" cases. If he refreshes the view a second or two later, it's gone.
In an effort to work around this, I added refresh=true when updating the document, e.g.
curl -XPUT 'https://my-dev-es-instance.com/cases/_doc/11?refresh=true' -d '{"status":"closed", ... }'
But still the problem persists.
Here's the response I got from the above request:
{"_index":"cases","_type":"_doc","_id":"11","_version":2,"result":"updated","forced_refresh":true,"_shards":{"total":2,"successful":1,"failed":0},"_seq_no":70757,"_primary_term":1}
The response seems to verify that the forced_refresh request was received, although it does say out of total 2 shards, 1 was successful and 0 failed. Not sure about the other one, but since I have only 2 nodes, does this mean it updated the secondary?
According to the doc:
To refresh the shard (not the whole index) immediately after the operation occurs, so that the document appears in search results immediately, the refresh parameter can be set to true. Setting this option to true should ONLY be done after careful thought and verification that it does not lead to poor performance, both from an indexing and a search standpoint. Note, getting a document using the get API is completely realtime and doesn’t require a refresh.
Are my expectations reasonable? Is there a better way to do this?
After more testing, I have concluded that my issue was due to application logic error, and not a problem with ElasticSearch. The refresh flag is behaving as expected. Apologies for the misinformation.

Redefine database "transactional" boundary on a spring batch job

Is there a way to redefine the database "transactional" boundary on a spring batch job?
Context:
We have a simple payment processing job that reads x number of payment records, processes and marks the records in the database as processed. Currently, the writer does a REST API call (to the payment gateway), processes the API response and marks the records as processed. We're doing a chunk oriented approach so the updates aren't flushed to the database until the whole chunk has completed. Since, basically the whole read/write is within a transaction, we are starting to see excessive database locks and contentions. For example, if the API takes a long time to respond (say 30 seconds), the whole application starts to suffer.
We can obviously reduce the timeout for the API call to be a smaller value.. but that still doesn't solve the issue of the tables potentially getting locked for longer than desirable duration. Ideally, we want to keep the database transaction as short lived as possible. Our thought is that if the "meat" of what the job does can be done outside of the database transaction, we could get around this issue. So, if the API call happens outside of a database transaction.. we can afford it to take a few more seconds to accept the response and not cause/add to the long lock duration.
Is this the right approach? If not, what would be the recommended way to approach this "simple" job in spring-batch fashion? Are there other batch tools better suited for the task? (if spring-batch is not the right choice).
Open to providing more context if needed.
I don't have a precise answer to all your questions but I will try to give some guidelines.
Since, basically the whole read/write is within a transaction, we are starting to see excessive database locks and contentions. For example, if the API takes a long time to respond (say 30 seconds), the whole application starts to suffer.
Since its inception, the term batch processing or processing data in "batches" is based on the idea that a batch of records is treated as a unit: either all records are processed (whatever the term "process" means) or none of the records is processed. This "all or nothing" semantic is exactly what Spring Batch implements in its chunk-oriented processing model. Achieving such a (powerful) property comes with trade-offs. In your case, you need to make a trade-off between consistency and responsiveness.
We can obviously reduce the timeout for the API call to be a smaller value.. but that still doesn't solve the issue of the tables potentially getting locked for longer than desirable duration.
The chunk-size is the most impactful parameter on the transaction behaviour. What you can do is try to reduce the number of records to be processed within a single transaction and see the result. There is no best value, this is an empirical process. This will also depend on the responsiveness of the API you are calling during the processing of a chunk.
Our thought is that if the "meat" of what the job does can be done outside of the database transaction, we could get around this issue. So, if the API call happens outside of a database transaction.. we can afford it to take a few more seconds to accept the response and not cause/add to the long lock duration.
A common technique to avoid doing such updates on a live system is to offload the processing against another datastore and then replicate the updates in a single transaction. The idea is to mark records with a given batch id and copy those records to a different datastore (or even a temporary table within the same datastore) that the batch process can use without impacting the live datastore. Once the processing is done (which could be done in parallel to improve performance), records can be marked as processed in the live system within in a single transaction (this is usually very fast and could be based on the batch id to identify which records to update).

How can I reset Kafka state to "start of universe"?

I'm still working on a Kafka Streams application that I described in
Why isn't Kafka consumer producing results?. In that posting, I asked why setting
kstreams_props.put( ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
doesn't appear to reset the state of Kafka to "start of the universe" before any data are pushed to any topic. I am now encountering a variant of that issue:
My application consists of a producer program that pushes data to a Kafka stream and a consumer program that groups the data, aggregates the groups, and then converts the resulting KTable back into a stream, which I print out.
The aggregation step is essentially adding up all the values, then putting those sums into the output stream as new data. What I observe, though, is that every time I run the program, the resulting aggregated values get bigger and bigger, almost as if Kafka is somehow retaining the previous results and including those in the aggregation.
In order to try fixing this, I deleted all my topics (except for __consumer_offsets, which Kafka would not allow), then re-ran my application, but the aggregated values continue to grow, as if Kafka were retaining the result of previous computations even though I thought that deleting the intermediate topics would fix things. I even tried stopping and restarting the Kafka server, to no avail.
What's going on here and, more to the point, how can I fix this? I've tried various suggestions about setting AUTO_OFFSET_RESET_CONFIG, also with no effect. I should mention that one aspect of my application is that my original producer creates its own Kafka timestamps in the Producer.send call, although disabling that also seemed to have no effect.
Thanks in advance, -- Mark
AUTO_OFFSET_RESET_CONFIG only triggers if there are not committed offsets: If an application starts, it first looks for committed offsets and applies the reset policy only, if there are no valid offsets.
Furthermore, for a Kafka Streams application, resetting offsets would not be sufficient and you should use the reset tool bin/kafka-streams-applicaion-reset.sh -- this blog post explains the tool in details: https://www.confluent.io/blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/

mongodb many inserts\updates performance

I am using mongodb to store user's events, there's a document for every user, containing an array of events. The system processes thousands of events a minute and inserts each one of them to mongo.
The problem is that I get poor performance for the update operation, using a profiler, I notice that the WriteResult.getError is the one that incur the performance impact.
That makes sense, the update is async, but if one wants to retrieve the operation result he needs to wait until the operation is completed.
My question, is there a way to keep the update async, but only get an exception if error occurs (99.999 of the times there is no error, so the system waits for nothing). I understand it means the exception will be raised somewhere further down the process flow, but I can live with that.
Any other suggestions?
The application is written in Java so we're using the Java driver, but I am not sure it's related.
have you done indexing on your records?
it may be a problem to your performance.
if not done before you should do Indexing on ur collection like
db.collectionName.ensureIndex({"event.type":1})
for more help visit http://www.mongodb.org/display/DOCS/Indexes

Resources