Elasticsearch connector task stuck in failed and unknown status - elasticsearch

In my elasticsearch connector, the tasks.max was set at 10 which I reduced to 7. Now the connector is running with 7 tasks and the other three tasks are stuck at "unknown" and "failed" status. I have restarted the connector but still the tasks did not get removed.
How do I make sure to remove these tasks which are unassigned/failed.

This is a known issue in Kafka Connect. Unsure if there is a JIRA, or if even fixed in versions later than the last one I have used (approx Kafka 2.6).
When you start with a higher max.tasks, then that information is stored in the internal status topic of the Connect API, and this is what the status HTTP endpoint will return. When you reduce this value, those previous task's metadata are still available, but is no longer being updated.
The only real fix I've found, is to delete this connector, and re-create it, potentially with a new name since the status topic is keyed by the name.

Related

Cannot find datadog agent connected to elasticserch

I have an issue where i have multiple host dashboards for the same elasticsearch server. Both dashboards has its own name and way of collecting data. One is connected to the installed datadog-agent and the other is somehow connected to the elasticsearch service directly.
The weird thing is that i cannot seem to find a way to turn off the agent connected directly to the ES service, other than turning off the elasticsearch service completly.
I have tried to delete the datadog-agent completely. This stops the dashboard connected to it, to stop receiving data (of course) but the other dashboard keeps receiving data somehow. I cannot find what is sending this data and therefor is not able to stop it. We have multiple master and data node and this is an issue for all of them. ES version is 7.17
another of our clusters is running ES 6.8, and we have not made the final configuration of the monitoring of this cluster but for now it does not have this issue.
just as extra information:
The dashboard connected to the agent is called the same as the host server name, while the other only has the internal ip as it's host name.
Does anyone have any idea what it is that is running and how to stop it? I have tried almost everything i could think of.
i finally found the reason. as all datadog-agents on all master and data nodes was configured to not use the node name as the name and cluster stats was turned on for the elastic plugin for datadog. This resulted in the behavior that when even one of the datadog-agents in the cluster was running, data was coming in to the dashboard which was not named correclty. Leaving the answer here if anyone hits the same situation in the future.

Out of Memory Issue suspecting due to suppress feature

Currently using DSL kafka streaming(2.1.1) suppress feature to store intermediate aggregation results.
Application gets continuous streaming and responsible for day window aggregation.
Application runs on total 9 servers, each server has enough memory(64 GB) and disk space(500GB) and also explicitly assign 21 GB memory for only aggregation service though it crashes with OOM issue.
Suppress Topic Defination:application-KTABLE-SUPPRESS-STATE-STORE-0000000004-changelog PartitionCount:100 ReplicationFactor:5 Configs:cleanup.policy=compact
My understanding of Suppress operator is as below
1) Suppress does not have statestore but it relays on buffer memory which is backed by change log topic.
2) When Suppress operator emits final results, per many forums it does send tombstone to corresponding change log topic and thus it gets deleted.
but other hand clean up policy for this change log topic is only compact so not very sure how does it works.
Application is rolled out in the production few days back and observing OOM issue very frequently.
Below is observation..
1) Disk Space is growing very as older window records are not getting deleted from application-KTABLE-SUPPRESS-STATE-STORE-0000000004-changelog.
2) Once node get OOM and upon restart cached memory does fills up very quickly by aggregation service .. 18-20 GB which is not anticipated based on low volume.
3)Observed that underneath changelog topic(application-KTABLE-SUPPRESS-STATE-STORE-0000000004-changelog) for suppress feature does not have retention period by default and it is emitting older records even window has already advanced. Observed when node is crashed due to memory issue and restarted again. Wondering why changelog still kept older window records even window is closed after day? probably clean.policy is only compact?
I am using kafka streaming 2.1.1 version and found a bug registered with kafka stream which is fixed in 2.2.1 and later releases.
OutOfMemoryError when restart my Kafka Streams appplication
Kafka Streams State Store Unrecoverable from Change Log Topic
In order to re-mediate issue I am planning below.
1) Reset kafka stream with reset application tool which delete internal topics..
2) Clean kafka statestore.
3) Upgrade kafka streaming version to 2.4.0 hoping it is stable.
Please let me know if you have other views on OOM issue.
Sudo code:
KTable<Windowed<String>, JsonNode> aggregateTable =
transactions
.groupByKey()
.windowedBy(
TimeWindows.of(Duration.ofSeconds(windowDuration)).grace(Duration.ofSeconds(windowGraceDuration)))
.aggregate(() -> new AggregationService().initialize(),
(key, transaction, previousStats) -> new AggregationService().buildAggregation(key, transaction, previousStats, runByUnit),
Materialized.<String, JsonNode, WindowStore<Bytes, byte[]>>as(statStoreName).withRetention(Duration.ofSeconds((windowDuration + windowGraceDuration + windowRetentionDuration)))
.withKeySerde(Serdes.String())
.withValueSerde(jsonSerde))
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded()));
Thank you for your help.
I am using kafka streaming 2.1.1 version and found a bug related with suppress which later resolved into 2.2.1 and later releases.
https://issues.apache.org/jira/plugins/servlet/mobile#issue/KAFKA-7895
Please let me know if 2.3.o can solved this issue?
Observed similler issue Suppress emitting records which were belongs to older window multiple times mainly after restarting node and Out of memory issue is related considering high volume(older and current window records)

Restart Hazelcast Jet (v0.4) when there is an exception

We are using Hazelcast Jet 0.4 version to read messages from Kafka Source, process the messages and write to Kafka.
Since Kafka is managed by an external team, we could not control various exceptions thrown in Kafka.
For example, we receive the following exception: Commit cannot be completed since the group has already rebalanced and assigned the partitions
When we receive this error Hazelcast Jet instance is shutdown. So our application becomes unusable and we have to restart the application.
We are looking at the possibilities to restart the Jet instance automatically during these errors.
Thanks for your help!
It looks like you suffer from this issue: https://github.com/hazelcast/hazelcast-jet/issues/428. It is fixed in current Jet version where manual partition assignment is used and even very slow processing of the events won't cause heartbeat timeout.

Understanding the logstash retry policy

I have kibana and elasticsearch instance running on a machine. Logstash and filebeat are running on other machine.
The flow is working perfectly fine. I have one doubt and i need to understand that. I made elasticsearch go down and made logstash to pump some logs to elasticearch. Since elasticsearch is down, i am hoping data will be lost. But when i brought up the elasticsearch service, Kibana was able to show the logs which was sent when elasticsearch was down.
When i googled online, i got to know that logstash retries to connect in elasticsearch is down.
May i please know how to set this parameter
The reason is that the elasticsearch output implements exponential backoff using two parameters called:
retry_initial_interval
retry_max_interval
If a bulk call fails, Logstash will wait for retry_initial_interval seconds and try again. If it still fails, it will wait for 2 * retry_initial_interval and try again. Ans so on until the wait time reaches retry_max_interval, at which point it will keep trying every retry_max_interval seconds indefinitely.
Note that this retry policy only works when ES is unreachable. If there's another error, such as a mapping error (HTTP 400) or a conflict (HTTP 409), the bulk call will not be retried.

Novell eDirectory: Error while adding replica on new server

I want to add a replica of our whole eDirectory tree to a new server (OES11.2 SLES11.3).
So I wanted to do so via iManager. (Partitions and Replicas / Replica View / Add Replica)
Everthing looks normal. I see our other servers with added replicas and of course the server with the master image.
For addition information: I did that a lot of times without problems until now.
When I want to add a replica to the new server, i get the following error: (Error -636) The server is unreachable.
I checked the /etc/hosts file and the network settings on both servers.
Ndsrepair looks normal too. All servers are in sync and there are no connection errors. The replica depth of the new server is -1. I get that, because there is no replica on it yet.
But if i can connect from one server to another and there are no error messages, why does adding a replica not work?
I also tried to make a LAN trace, but didn't get any information that would help me out here. In the trace the communication seems normal!
Am I forgetting something here?
Every server in our environment runs OES11.2 except the master server which runs OES11.1
Thanks for your help!
Daniel
Nothing wrong.
Error -636 means that the replica is not yet available at the new server. When will the synchronization, the replica will be ready and available. Depending on the size of the Tree and the communication channel we can wait for up to some hours.

Resources