Restart Hazelcast Jet (v0.4) when there is an exception - spring-boot

We are using Hazelcast Jet 0.4 version to read messages from Kafka Source, process the messages and write to Kafka.
Since Kafka is managed by an external team, we could not control various exceptions thrown in Kafka.
For example, we receive the following exception: Commit cannot be completed since the group has already rebalanced and assigned the partitions
When we receive this error Hazelcast Jet instance is shutdown. So our application becomes unusable and we have to restart the application.
We are looking at the possibilities to restart the Jet instance automatically during these errors.
Thanks for your help!

It looks like you suffer from this issue: https://github.com/hazelcast/hazelcast-jet/issues/428. It is fixed in current Jet version where manual partition assignment is used and even very slow processing of the events won't cause heartbeat timeout.

Related

Debezium / JDBC and Kafka topic retention

I have Debezium in a container, capturing all changes of PostgeSQL database records. In addition a have a Kafka container to store the topic messages. At last I have a JDBC container to write all changes to another database.
These three containers are working as expected, performing snapshots of the old data in the database on specific tables and streaming new changes while there are reflected into the destination database.
I have figure out that during this streaming the PostgreSQL WAL is increasing, to overcome this situation I enabled the following property on the source connector to clear all retrieved logs.
"heartbeat.interval.ms": 1000
Now the PostgreSQL WAL file is getting cleared in every heartbeat as the retrieved changed as flushed. But meanwhile even the changes are committed into the secondary database the kafka topics are remaining with the exact size.
Is there any way or property into sink connector that will force kafka to delete commited messages?
Consumers have no control over topic retention.
You may edit the topic config directly to reduce the retention time, but then your consumer must read the data within that time.

Elasticsearch connector task stuck in failed and unknown status

In my elasticsearch connector, the tasks.max was set at 10 which I reduced to 7. Now the connector is running with 7 tasks and the other three tasks are stuck at "unknown" and "failed" status. I have restarted the connector but still the tasks did not get removed.
How do I make sure to remove these tasks which are unassigned/failed.
This is a known issue in Kafka Connect. Unsure if there is a JIRA, or if even fixed in versions later than the last one I have used (approx Kafka 2.6).
When you start with a higher max.tasks, then that information is stored in the internal status topic of the Connect API, and this is what the status HTTP endpoint will return. When you reduce this value, those previous task's metadata are still available, but is no longer being updated.
The only real fix I've found, is to delete this connector, and re-create it, potentially with a new name since the status topic is keyed by the name.

Does Cassandra session re-create when WAS disconnects to Cassandra cluster?(e.g. Network issue)

I have tested with Circle CI and docker(cassandra img)
and When I test, logs appear like below
"Tried to execute unknown prepared query. You may have used a PreparedStatement that was created with another Cluster instance."
But Cassandra Cluster exists as solo. So, I can't understand What makes this error.
Could it happen because of Cassandra Connection issue?
Tests have failed sometimes because WAS can't connect to Cassandra Cluster
(I think CircleCI causes this issue)
so I just guess
WAS can't connect to Cassandra Cluster during testing
Session re-created
Error logs with PreparedStatement happens
Is it Possible?
If not, How does this Error happen though just One Cassandra Cluster is operating?
The "Cluster instance" being referred to in this message is the Cluster object in your app code:
Tried to execute unknown prepared query. You may have used a PreparedStatement \
that was created with another Cluster instance.
That error implies that you have multiple Cluster objects in your app. You should only have one instance that is shared throughout your app code and you shouldn't create multiple Cluster objects. Cheers!

Out of Memory Issue suspecting due to suppress feature

Currently using DSL kafka streaming(2.1.1) suppress feature to store intermediate aggregation results.
Application gets continuous streaming and responsible for day window aggregation.
Application runs on total 9 servers, each server has enough memory(64 GB) and disk space(500GB) and also explicitly assign 21 GB memory for only aggregation service though it crashes with OOM issue.
Suppress Topic Defination:application-KTABLE-SUPPRESS-STATE-STORE-0000000004-changelog PartitionCount:100 ReplicationFactor:5 Configs:cleanup.policy=compact
My understanding of Suppress operator is as below
1) Suppress does not have statestore but it relays on buffer memory which is backed by change log topic.
2) When Suppress operator emits final results, per many forums it does send tombstone to corresponding change log topic and thus it gets deleted.
but other hand clean up policy for this change log topic is only compact so not very sure how does it works.
Application is rolled out in the production few days back and observing OOM issue very frequently.
Below is observation..
1) Disk Space is growing very as older window records are not getting deleted from application-KTABLE-SUPPRESS-STATE-STORE-0000000004-changelog.
2) Once node get OOM and upon restart cached memory does fills up very quickly by aggregation service .. 18-20 GB which is not anticipated based on low volume.
3)Observed that underneath changelog topic(application-KTABLE-SUPPRESS-STATE-STORE-0000000004-changelog) for suppress feature does not have retention period by default and it is emitting older records even window has already advanced. Observed when node is crashed due to memory issue and restarted again. Wondering why changelog still kept older window records even window is closed after day? probably clean.policy is only compact?
I am using kafka streaming 2.1.1 version and found a bug registered with kafka stream which is fixed in 2.2.1 and later releases.
OutOfMemoryError when restart my Kafka Streams appplication
Kafka Streams State Store Unrecoverable from Change Log Topic
In order to re-mediate issue I am planning below.
1) Reset kafka stream with reset application tool which delete internal topics..
2) Clean kafka statestore.
3) Upgrade kafka streaming version to 2.4.0 hoping it is stable.
Please let me know if you have other views on OOM issue.
Sudo code:
KTable<Windowed<String>, JsonNode> aggregateTable =
transactions
.groupByKey()
.windowedBy(
TimeWindows.of(Duration.ofSeconds(windowDuration)).grace(Duration.ofSeconds(windowGraceDuration)))
.aggregate(() -> new AggregationService().initialize(),
(key, transaction, previousStats) -> new AggregationService().buildAggregation(key, transaction, previousStats, runByUnit),
Materialized.<String, JsonNode, WindowStore<Bytes, byte[]>>as(statStoreName).withRetention(Duration.ofSeconds((windowDuration + windowGraceDuration + windowRetentionDuration)))
.withKeySerde(Serdes.String())
.withValueSerde(jsonSerde))
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded()));
Thank you for your help.
I am using kafka streaming 2.1.1 version and found a bug related with suppress which later resolved into 2.2.1 and later releases.
https://issues.apache.org/jira/plugins/servlet/mobile#issue/KAFKA-7895
Please let me know if 2.3.o can solved this issue?
Observed similler issue Suppress emitting records which were belongs to older window multiple times mainly after restarting node and Out of memory issue is related considering high volume(older and current window records)

springxd stream using HDFS-Dataset to save avro data unable to renew kerberos ticket

I have created a springxd stream ====> source-JMS queue -> Transform-Custom Java Processor (XML to AVRO) -> Sink - HDFS-Dataset.
Stream works perfectly fine but after 24 hours, since its continuous connection it is unable to renew the kerberos authentication ticket and stopped writing to HDFS. We are restarting the container where this stream deployed but still we face problems and losing the messages as they are not even sent to redis error queue.
I need help with -
If we can renew the kerberos ticket for the stream. Do I need to update the sink code and need to create custom sink.
I don't find any sink in springxd documentation similar to HDFS-Dataset and writes to local files system where I don't need to go through kerberos authentication.
Appreciate your help here.
Thanks,
This is a well known problem in spring xd which is not documented :). Something pretty similar happen to batch jobs which are deployed for long time and try to run later.. why? Because the hadoopConfiguration object is forcing the scope to singleton and it is getting instanced once you deploy your stream/job in spring-xd. In our case we created a listener for the spring batch jobs to renew the ticket before the jobs executions. You could do something similar in your streams, take this like guide
https://github.com/spring-projects/spring-hadoop/blob/master/spring-hadoop-core/src/main/java/org/springframework/data/hadoop/configuration/ConfigurationFactoryBean.java
I hope it helps.

Resources