Out of Memory Issue suspecting due to suppress feature - apache-kafka-streams

Currently using DSL kafka streaming(2.1.1) suppress feature to store intermediate aggregation results.
Application gets continuous streaming and responsible for day window aggregation.
Application runs on total 9 servers, each server has enough memory(64 GB) and disk space(500GB) and also explicitly assign 21 GB memory for only aggregation service though it crashes with OOM issue.
Suppress Topic Defination:application-KTABLE-SUPPRESS-STATE-STORE-0000000004-changelog PartitionCount:100 ReplicationFactor:5 Configs:cleanup.policy=compact
My understanding of Suppress operator is as below
1) Suppress does not have statestore but it relays on buffer memory which is backed by change log topic.
2) When Suppress operator emits final results, per many forums it does send tombstone to corresponding change log topic and thus it gets deleted.
but other hand clean up policy for this change log topic is only compact so not very sure how does it works.
Application is rolled out in the production few days back and observing OOM issue very frequently.
Below is observation..
1) Disk Space is growing very as older window records are not getting deleted from application-KTABLE-SUPPRESS-STATE-STORE-0000000004-changelog.
2) Once node get OOM and upon restart cached memory does fills up very quickly by aggregation service .. 18-20 GB which is not anticipated based on low volume.
3)Observed that underneath changelog topic(application-KTABLE-SUPPRESS-STATE-STORE-0000000004-changelog) for suppress feature does not have retention period by default and it is emitting older records even window has already advanced. Observed when node is crashed due to memory issue and restarted again. Wondering why changelog still kept older window records even window is closed after day? probably clean.policy is only compact?
I am using kafka streaming 2.1.1 version and found a bug registered with kafka stream which is fixed in 2.2.1 and later releases.
OutOfMemoryError when restart my Kafka Streams appplication
Kafka Streams State Store Unrecoverable from Change Log Topic
In order to re-mediate issue I am planning below.
1) Reset kafka stream with reset application tool which delete internal topics..
2) Clean kafka statestore.
3) Upgrade kafka streaming version to 2.4.0 hoping it is stable.
Please let me know if you have other views on OOM issue.
Sudo code:
KTable<Windowed<String>, JsonNode> aggregateTable =
transactions
.groupByKey()
.windowedBy(
TimeWindows.of(Duration.ofSeconds(windowDuration)).grace(Duration.ofSeconds(windowGraceDuration)))
.aggregate(() -> new AggregationService().initialize(),
(key, transaction, previousStats) -> new AggregationService().buildAggregation(key, transaction, previousStats, runByUnit),
Materialized.<String, JsonNode, WindowStore<Bytes, byte[]>>as(statStoreName).withRetention(Duration.ofSeconds((windowDuration + windowGraceDuration + windowRetentionDuration)))
.withKeySerde(Serdes.String())
.withValueSerde(jsonSerde))
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded()));
Thank you for your help.

I am using kafka streaming 2.1.1 version and found a bug related with suppress which later resolved into 2.2.1 and later releases.
https://issues.apache.org/jira/plugins/servlet/mobile#issue/KAFKA-7895
Please let me know if 2.3.o can solved this issue?
Observed similler issue Suppress emitting records which were belongs to older window multiple times mainly after restarting node and Out of memory issue is related considering high volume(older and current window records)

Related

Elasticsearch connector task stuck in failed and unknown status

In my elasticsearch connector, the tasks.max was set at 10 which I reduced to 7. Now the connector is running with 7 tasks and the other three tasks are stuck at "unknown" and "failed" status. I have restarted the connector but still the tasks did not get removed.
How do I make sure to remove these tasks which are unassigned/failed.
This is a known issue in Kafka Connect. Unsure if there is a JIRA, or if even fixed in versions later than the last one I have used (approx Kafka 2.6).
When you start with a higher max.tasks, then that information is stored in the internal status topic of the Connect API, and this is what the status HTTP endpoint will return. When you reduce this value, those previous task's metadata are still available, but is no longer being updated.
The only real fix I've found, is to delete this connector, and re-create it, potentially with a new name since the status topic is keyed by the name.

Azure Databricks stream fails with StorageException: Could not verify copy source

We have a Databricks job that has suddenly started to consistently fail. Sometimes it runs for an hour, other times it fails after a few minutes.
The inner exception is
ERROR MicroBatchExecution: Query [id = xyz, runId = abc] terminated with error
shaded.databricks.org.apache.hadoop.fs.azure.AzureException: hadoop_azure_shaded.com.microsoft.azure.storage.StorageException: Could not verify copy source.
The job targets a notebook which consumes from event-hub with PySpark structured streaming, calculates some values based on the data, and streams data back to another event-hub topic.
The cluster is a pool with 2 workers and 1 driver running on standard Databricks 9.1 ML.
We've tried to restart job many times, also with clean input data and checkpoint location.
We struggle to determine what is causing this error.
We cannot see any 403 Forbidden errors in logs, which is sometimes mentioned on forums as a reason
.
Any assistance is greatly appreciated.
Issue resolved by moving checkpointing (used internally by Spark) location from standard storage to premium. I don't know why it suddenly started failing after months of running hardly without hiccup.
Premium storage might be a better place for checkpointing anyway since I/O is cheaper.

Restart Hazelcast Jet (v0.4) when there is an exception

We are using Hazelcast Jet 0.4 version to read messages from Kafka Source, process the messages and write to Kafka.
Since Kafka is managed by an external team, we could not control various exceptions thrown in Kafka.
For example, we receive the following exception: Commit cannot be completed since the group has already rebalanced and assigned the partitions
When we receive this error Hazelcast Jet instance is shutdown. So our application becomes unusable and we have to restart the application.
We are looking at the possibilities to restart the Jet instance automatically during these errors.
Thanks for your help!
It looks like you suffer from this issue: https://github.com/hazelcast/hazelcast-jet/issues/428. It is fixed in current Jet version where manual partition assignment is used and even very slow processing of the events won't cause heartbeat timeout.

Elasticsearch Out of Memory Crash -- How to Delete Data?

Well, I started piping data into ES until it ran itself out of memory and crashed. I run free and i see that all memory is entirely used up.
I want to delete some data out of it (old data) but i can't query against localhost:9200, it rejects the connection.
How to fix the fact that i can't delete out the old data?
If you want to go hardcore about it, you can always delete anything in your data folder:
> rm $ES_HOME/data/<clustername>
Note: replace <clustername> with your real cluster name (the default is elasticsearch)
Stop indexing. If it stabilizes itself after few minute then try deleting the data again. Restart the cluster.
If it's still stuck, stop the indexing and restart the cluster.
In any case, if the nodes went OOM they need to be restarted, as the state the JVM is in is unknown.

dncp_block_verification log file increases size in HDFS

We are using cloudera CDH 5.3. I am facing a problem wherein the size of "/dfs/dn/current/Bp-12345-IpAddress-123456789/dncp-block-verification.log.curr" and "dncp-vlock-verification.log.prev" keeps increasing to TBs within hours. I read in some of the blogs and they mention it is an HDFS bug. A temporary solution to this problem is to stop the datanode services and delete these files. But we have observed that the log file increases in size on either of the datanodes (even on the same node after deleting it). Thus, it requires continuous monitoring.
Does anyone have a permanent solution to this problem?
One solution, although slightly drastic, is to disable the block scanner entirely, by setting into the HDFS DataNode configuration the key dfs.datanode.scan.period.hours to 0 (default is 504 in hours). The negative effect of this is that your DNs may not auto-detect corrupted block files (and would need to wait upon a future block reading client to detect them instead); this isn't a big deal if your average replication is 3-ish, but you can consider the change as a short term one until you upgrade to a release that fixes the issue.
Note that this problem will not happen if you upgrade to the latest CDH 5.4.x or higher release versions, which includes the HDFS-7430 rewrite changes and associated bug fixes. These changes have done away with the use of such a local file, thereby removing the problem.

Resources