Kafka Streams - Volumes to state stores causes "Failed to delete the state directory" errors - apache-kafka-streams

Short overview:
I have a service written in scala using kafka-streams running inside a dedicated docker.
To allow state-stores dirs to be stored where i want, i created a volume from the state stores dir inside the container to some dir outside. Once i did that, started seeing such exceptions in the container logs:
Failed to delete the state directory.
java.nio.file.DirectoryNotEmptyException: /usr/src/app/stateStores/MyServiceStreams___20/0_0
at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:242)
at sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
at java.nio.file.Files.delete(Files.java:1126)
at org.apache.kafka.common.utils.Utils$2.postVisitDirectory(Utils.java:763)
at org.apache.kafka.common.utils.Utils$2.postVisitDirectory(Utils.java:746)
at java.nio.file.Files.walkFileTree(Files.java:2688)
at java.nio.file.Files.walkFileTree(Files.java:2742)
at org.apache.kafka.common.utils.Utils.delete(Utils.java:746)
at org.apache.kafka.streams.processor.internals.StateDirectory.cleanRemovedTasks(StateDirectory.java:290)
at org.apache.kafka.streams.processor.internals.StateDirectory.cleanRemovedTasks(StateDirectory.java:253)
at org.apache.kafka.streams.KafkaStreams$2.run(KafkaStreams.java:795)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
It seems it doesn't effect any functionality, just spams the logs.
Analysing the logs it show the following line before the error message:
Deleting obsolete state directory 0_0 for task 0_0 as 79998936ms has elapsed (cleanup delay is 6000000ms
Increasing the state.cleanup.delay.ms param didn't help.
[Edit]
Some technicals:
The /usr/src/app/stateStores/ are with root permissions
from inside the container, can remove the dirs (running as root). From outside, no (not root..)
kafka-streams version: 2.1.0
Please assist

Related

Apache MiNifi- Putelasticsearch

i make flow, which process real time data from local server and send relevant data to Elasticsearch. I use Minifi, but when I run MiNifi it returned the following error.
Does anyone know, where is the issue?
Thanks
ERROR [Timer-Driven Process Thread-10] o.a.n.p.elasticsearch.PutElasticsearch5 PutElasticsearch5[id=4ed70cbe-9838-35cd-0000-000000000000] PutElasticsearch5[id=4ed70cbe-9838-35cd-0000-000000000000] failed to process due to java.lang.NoClassDefFoundError: Could not initialize class org.elasticsearch.Version; rolling back session: {}
java.lang.NoClassDefFoundError: Could not initialize class org.elasticsearch.Version
at org.elasticsearch.common.io.stream.StreamOutput.(StreamOutput.java:73)
at org.elasticsearch.common.io.stream.BytesStreamOutput.(BytesStreamOutput.java:60)
at org.elasticsearch.common.io.stream.BytesStreamOutput.(BytesStreamOutput.java:57)
at org.elasticsearch.common.io.stream.BytesStreamOutput.(BytesStreamOutput.java:47)
at org.elasticsearch.common.xcontent.XContentBuilder.builder(XContentBuilder.java:67)
at org.elasticsearch.common.settings.Setting.arrayToParsableString(Setting.java:698)
at org.elasticsearch.common.settings.Setting.lambda$listSetting$26(Setting.java:656)
at org.elasticsearch.common.settings.Setting$2.getRaw(Setting.java:660)
at org.elasticsearch.common.settings.Setting.get(Setting.java:300)
at org.elasticsearch.plugins.PluginsService.(PluginsService.java:164)
at org.elasticsearch.client.transport.TransportClient.newPluginService(TransportClient.java:81)
at org.elasticsearch.client.transport.TransportClient.buildTemplate(TransportClient.java:106)
at org.elasticsearch.client.transport.TransportClient.(TransportClient.java:228)
at org.elasticsearch.transport.client.PreBuiltTransportClient.(PreBuiltTransportClient.java:69)
at org.elasticsearch.transport.client.PreBuiltTransportClient.(PreBuiltTransportClient.java:65)
at org.apache.nifi.processors.elasticsearch.AbstractElasticsearch5TransportClientProcessor.getTransportClient(AbstractElasticsearch5TransportClientProcessor.java:230)
at org.apache.nifi.processors.elasticsearch.AbstractElasticsearch5TransportClientProcessor.createElasticsearchClient(AbstractElasticsearch5TransportClientProcessor.java:170)
at org.apache.nifi.processors.elasticsearch.AbstractElasticsearch5Processor.setup(AbstractElasticsearch5Processor.java:94)
at org.apache.nifi.processors.elasticsearch.PutElasticsearch5.onTrigger(PutElasticsearch5.java:177)
at org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
at org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1122)
at org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:147)
at org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:47)
at org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:128)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
In order to reduce its footprint, MiNiFi java only ships with the standard bundle of processors. In order to use the other processors that are present within a standard NiFi deployment in MiNiFi, you need to put the appropriate "nar" file into the "lib" of the MiNiFi deployment.
For "PutElasticSearch" you need "nifi-elasticsearch-nar-.nar" where "" is the version of NiFi that your version of MiNiFi is built off of. Versions 0.4.0 of MiNiFi java uses NiFi 1.5.0.
For more information and a list of the processors that do come bundled with MiNiFi out of the box see the "MiNiFi Java Agent Quick Start" documentation, section "Using Processors Not Packaged with MiNiFi"[1]. For more information on the different versions of MiNiFi correspond to the version of NiFi frameworks, see here[2].
[1] https://nifi.apache.org/minifi/minifi-java-agent-quick-start.html
[2] https://cwiki.apache.org/confluence/display/MINIFI/MiNiFi+Versioning+and+Toolkit+Compatibility

Spark jobs failing because HDFS is caching jars

I upload Scala / Spark jars to HDFS to test them on our cluster. After running, I frequently realize there are changes that need to be made. So I make the changes locally then push the new jar back up to HDFS. However, often (not always) when I do this, hadoop throws an error essentially saying that this jar is not the same as the old jar (duh).
I try clearing my Trash, .staging, and .sparkstaging directories but that doesn't do anything. I try renaming the jar, which will work sometimes and other times it won't (it's still ridiculous I have to do this in the first place).
Does anyone know why this is occurring and how I can prevent it from occurring? Thanks for any help. Here are some logs if that helps (edited out some paths):
Application application_1475165877428_124781 failed 2 times due to AM
Container for appattempt_1475165877428_124781_000002 exited with
exitCode: -1000 For more detailed output, check application tracking
page:http://examplelogsite/ Then, click on links to logs of each
attempt. Diagnostics: Resource MYJARPATH/EXAMPLE.jar changed on src
filesystem (expected 1475433291946, was 1475433292850
java.io.IOException: Resource MYJARPATH/EXAMPLE.jar changed on src
filesystem (expected 1475433291946, was 1475433292850 at
org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253) at
org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)
at java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356) at
org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60) at
java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745) Failing this attempt. Failing
the application.
I haven't seen that exit code before, so to me, it doesn't say anything, I would suggest you to check the logs, like this:
yarn logs -applicationId <your_application_ID>
According to your log, I'm sure it comes from yarn side.
You can modify yarn yourself to skip this exception as workaround.
I ran into this thread cause the error log changed on src filesystem, I met this issue and skipped it by modify yarn src code.
For more details, you can refer to how-to-fix-resource-changed-on-src-filesystem-issue

Getting ClusterBlockException while running queries using node client

My elasticsearch cluster(version 2.0) is started and the node client is built successfully, but for some reason I'm getting the following error while running queries using node client.
20:15:15.479 [Pool:entitytaskscheduler: Thread#1] DEBUG c.b.o.e.t.c.DataCollectorStatusUpdateTask - collectors updated due to agent reconnected:{}
ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];]
at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:154)
at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(ClusterBlocks.java:144)
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.<init>(TransportSearchTypeAction.java:116)
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.<init>(TransportSearchQueryThenFetchAction.java:73)
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.<init>(TransportSearchQueryThenFetchAction.java:67)
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction.doExecute(TransportSearchQueryThenFetchAction.java:64)
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction.doExecute(TransportSearchQueryThenFetchAction.java:53)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:70)
at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:99)
at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:44)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:70)
at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:58)
at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:347)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:85)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:59)
at com.hidden.ppp.management.dc.DataCollectorPollStatusDAOESImpl.findDCIdsUpdatedInTime(DataCollectorPollStatusDAOESImpl.java:151)
at com.hidden.ppp.engine.taskexecutor.cptaskexecs.DataCollectorStatusUpdateTask.execute(DataCollectorStatusUpdateTask.java:199)
at com.hidden.ppp.engine.taskexecutor.cptaskexecs.DataCollectorStatusUpdateTaskRunner.run(DataCollectorStatusUpdateTaskRunner.java:27)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
20:15:15.558 [Pool:entitytaskscheduler: Thread#1] WARN c.b.o.m.d.DataCollectorPollStatusDAOESImpl - blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
20:15:15.558 [Pool:entitytaskscheduler: Thread#1] DEBUG c.b.o.e.t.c.DataCollectorStatusUpdateTask - collectors for which polls updated after epoc time:1453128243336 - dcids: []
ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];]
at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:154)
at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(ClusterBlocks.java:144)
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.<init>(TransportSearchTypeAction.java:116)
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.<init>(TransportSearchQueryThenFetchAction.java:73)
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.<init>(TransportSearchQueryThenFetchAction.java:67)
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction.doExecute(TransportSearchQueryThenFetchAction.java:64)
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction.doExecute(TransportSearchQueryThenFetchAction.java:53)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:70)
at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:99)
at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:44)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:70)
at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:58)
at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:347)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:85)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:59)
at com.hidden.ppp.management.dc.DataCollectorPollStatusDAOESImpl.findDCIdsNotUpdatedInTime(DataCollectorPollStatusDAOESImpl.java:182)
at com.hidden.ppp.engine.taskexecutor.cptaskexecs.DataCollectorStatusUpdateTask.execute(DataCollectorStatusUpdateTask.java:204)
at com.hidden.ppp.engine.taskexecutor.cptaskexecs.DataCollectorStatusUpdateTaskRunner.run(DataCollectorStatusUpdateTaskRunner.java:27)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
I've even disabled the "multicast" as per this post - still no luck. Surprisingly, I could access the elasticsearch from sense. Any clues on what is going wrong ?
I faced the same error message and was not able to understand the problem first. I was developing a node client Java application on my laptop, using an Elasticsearch data node on a remote server. For production use, I needed to deploy the Java application on this remote server.
I configured the Java application to talk to the local host only (being on the same host now):
elasticsearch.discovery.zen.ping.unicast.hosts=127.0.0.1
And got the same exception
ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];]
Looking at the logs I also found this entry:
[WARN] [TP-Processor2] DiscoveryService.waitForInitialState -> [cerbera] waited for 30s and no initial state was set by the discovery
So basically, the question was: Why doesn't it find the Elasticsearch data node? I changed port ranges and also played with the multicast setting - without success.
Finally, I checked elasticsearch.yml and found the data node not listening to localhost (127.0.0.1), but instead on the ethernet interface 192.168.1.2.
network.host: 192.168.1.2
http.port: 9200
The final change was simple, I just needed to reconfigure the node client configuration to talk to the correct interface
elasticsearch.discovery.zen.ping.unicast.hosts=192.168.1.2
Now my node client is talking to elasticsearch via the correct interface. Job done.
I had the same problem (using k8s ) I finally replaced my elastic image and the issue was solved...
moved from 6.5.4-debian-9-r41 to 6.8.16-debian-10-r5 (using bitnami images)
I know it is not the best answer - but I really tried suggested answers and nothing worked for me. so my recommendation is to update to a newer better version. (docker makes that easy:) )

Apache Spark EC2 job not running. No space left on device

I have been running my program multiple times on a 20 node cluster. All of a sudden every time I run the program I get the following error:
15/04/19 16:52:35 WARN scheduler.TaskSetManager: Lost task 35.0 in stage 9.0 (TID 384, ip-XXX.XXX.compute.internal): java.io.FileNotFoundException: /mnt/spark/spark-local-XXX-ebd3/18/shuffle_2_35_64 (No space left on device)
java.io.FileOutputStream.open(Native Method)
java.io.FileOutputStream.<init>(FileOutputStream.java:221)
org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123)
org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:192)
org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:67)
org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:65)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Checking the UI and it says there's absolutely nothing on the nodes. I have run the program maybe 15 times and only all of a sudden has this started. Why has this occurred out of the blue? And how do I fix it?
"No space left on device" is a quite clear exception: That node has no space left on the mount where the spark local files are being written: /mnt/spark/
Solution: go to the node (or nodes) and clean that up. rm -rf FTW.
If jobs are breaking before they terminate, due to manual intervention or failure, they will often leave temp data behind.

Storm-YARN : Application container fails to launch

I am running a storm (trident) topology that reads avro from kafka & writes the records in hbase.
The topology is running as expected in Localcluster mode, but while using Stormsubmitter I'm facing below issues.
In Distributed Hadoop mode I'm getting the below error [1] while launching the YARN application.
In Hadoop (local mode, with 1 box only) Yarn is spawnning the nimbus server and storm-ui. But there are no supervisor(s) running to run the spout/bolts in the topology. I guess the reason might be insufficient memory (4G to run the topology + hbase, hdfs, kafka, zookeeper etc...).
Can you help me out in understanding the reason of this container failure? There are no errors/info present in application logs.
[1] YARN container fails to launch with below error on running.
storm-yarn launch /homext/storm-yarn.yml --queue default -appname storm-yarn-demo --stormZip /tmp/storm-0.9.zip
Application application_1415038356032_0304 failed 2 times due to AM Container for appattempt_1415038356032_0304_000002 exited with exitCode: 127 due to: Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException:
org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 127
.Failing this attempt.. Failing the application.
This log is insufficient to diagnose. All it says is that the container failed to launch. You should look into the container output. Check the ${yarn.nodemanager.log-dirs} on the nodes, there will be an application folder (application_1415038356032_0304) and in there there will be a container folder for each attempt (...1415038356032_0304_000002) containing the stderr, stdout and syslog of this attempt. Read those and you'll likely identify the problem.
If these don't exist, look in ${yarn.nodemanager.local-dirs} you'll find the container launch script (I thinks is called container-launch.sh) for this app/container attempt. In it will be the actual command to launch the container. Try to run that from the shell prompt and see what you get.
If it fails at an early stage then the logs can be found in HDFS under:
/tmp/logs/<user>/logs/
This should give enough information to diagnose the problem.
In my case I found a log file:
/tmp/logs/hdfs/logs/application_1426618997634_0004/vagrant-cdh-node4_8041
With some errors like:
/bin/bash: /usr/lib/jvm/java-7-oracle/bin/java: No such file or directory
And fixing the JAVA_HOME environment variable did the trick.

Resources