Spring Cloud Data Flow stream cannot be deployed in Kubernetes - spring

I followed the how to install guide of Spring Cloud Data Flow to install the application on Azure Kubernetes Cluster with kubectl. I use Kafka as a message broker and I created a simple stream, time | log.
The stream cannot be deployed, I enclose the logs which I can't fully understand.
PS kubectl get pods
NAME READY STATUS RESTARTS AGE
grafana-7d7d77d54-m59dx 1/1 Running 0 5h36m
kafka-broker-64bfd5d6b5-9c7ld 1/1 Running 0 25m
kafka-zk-768b548468-mhrrn 1/1 Running 0 145m
mysql-9dbdc88c6-xz4hh 1/1 Running 0 21h
prometheus-64b45b746-zs7z4 1/1 Running 0 5h37m
prometheus-proxy-6764bf4968-4xjz5 1/1 Running 0 28m
scdf-server-7f864c96b7-s8cmm 1/1 Running 0 62m
skipper-7fbd7f47cd-b92v4 1/1 Running 0 6h13m
test-stream-log-v9-ffcd9d55f-8p96j 0/1 Running 13 68m
test-stream-time-v9-6c47699d94-pfzkr 0/1 Running 13 68m
Time app log. https://pastebin.com/JyS8azVk
Log app log. https://pastebin.com/pCe1NqSn
Kafka log. https://pastebin.com/Dj5KfVsQ

From the attached logs in time-source; specifically, this:
2019-12-19 21:15:23.963 ERROR 1 --- [ main] o.s.cloud.stream.binding.BindingService : Failed to create producer binding; retrying in 30 seconds
org.springframework.cloud.stream.provisioning.ProvisioningException: Provisioning exception; nested exception is java.util.concurrent.TimeoutException
at org.springframework.cloud.stream.binder.kafka.provisioning.KafkaTopicProvisioner.createTopic(KafkaTopicProvisioner.java:290) ~[spring-cloud-stream-binder-kafka-core-2.1.4.RELEASE
This is indicative that the Spring Cloud Stream binder provisioner is unable to create the desired topic for the producer (i.e. time-source).
Based on your kubectl get pods output, however, it appears your Kafka and ZK were deployed roughly recently as opposed to Skipper, which was deployed >6hrs ago.
It is likely that you may have deployed components in the wrong order or that you may have reprovisioned Kafka, but that IP/Host/Port changes aren't yet reflecting in Skipper's deployment. The reason being, Skipper keeps track of Kafka credentials in its config-map, so all the stream applications that it deploys (via SCDF) will automatically receive the credentials at deployment time.
I'd guess the credentials that the applications received might have changed when you reprovisioned Kafka/ZK — you could compare that. I'd suggest bouncing Skipper deployment so it can receive the latest in its configmap or clean-slate it all, and follow the deployment order described in the docs from the beginning.

Related

AccessDeniedException when deleting a topic on Windows Kafka

I just installed Kafka (from Confluent Platform) on my Windows machine. I started up Zookeeper and Kafka and creating topics, producing to and consuming from them works. However, as soon as I delete a topic, Kafka crashes like this:
PS C:\confluent-4.1.1> .\bin\windows\kafka-topics.bat -zookeeper 127.0.0.1:2181 --topic foo --create --partitions 1 --replication-factor 1
Created topic "foo".
PS C:\confluent-4.1.1> .\bin\windows\kafka-topics.bat -zookeeper 127.0.0.1:2181 --topic foo --delete
Topic foo is marked for deletion.
Note: This will have no impact if delete.topic.enable is not set to true.
This is the crash output:
[2018-06-08 09:44:54,185] ERROR Error while renaming dir for foo-0 in log dir C:\confluent-4.1.1\data\kafka (kafka.server.LogDirFailureChannel)
java.nio.file.AccessDeniedException: C:\confluent-4.1.1\data\kafka\foo-0 -> C:\confluent-4.1.1\data\kafka\foo-0.cf697a92ed5246c0977bf9a279f15de8-delete
at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:83)
at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:387)
at sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:287)
at java.nio.file.Files.move(Files.java:1395)
at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:697)
at kafka.log.Log$$anonfun$renameDir$1.apply$mcV$sp(Log.scala:579)
at kafka.log.Log$$anonfun$renameDir$1.apply(Log.scala:577)
at kafka.log.Log$$anonfun$renameDir$1.apply(Log.scala:577)
at kafka.log.Log.maybeHandleIOException(Log.scala:1678)
at kafka.log.Log.renameDir(Log.scala:577)
at kafka.log.LogManager.asyncDelete(LogManager.scala:828)
at kafka.cluster.Partition$$anonfun$delete$1.apply(Partition.scala:240)
at kafka.cluster.Partition$$anonfun$delete$1.apply(Partition.scala:235)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:250)
at kafka.utils.CoreUtils$.inWriteLock(CoreUtils.scala:258)
at kafka.cluster.Partition.delete(Partition.scala:235)
at kafka.server.ReplicaManager.stopReplica(ReplicaManager.scala:347)
at kafka.server.ReplicaManager$$anonfun$stopReplicas$2.apply(ReplicaManager.scala:377)
at kafka.server.ReplicaManager$$anonfun$stopReplicas$2.apply(ReplicaManager.scala:375)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at kafka.server.ReplicaManager.stopReplicas(ReplicaManager.scala:375)
at kafka.server.KafkaApis.handleStopReplicaRequest(KafkaApis.scala:205)
at kafka.server.KafkaApis.handle(KafkaApis.scala:116)
at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:69)
at java.lang.Thread.run(Thread.java:748)
Suppressed: java.nio.file.AccessDeniedException: C:\confluent-4.1.1\data\kafka\foo-0 -> C:\confluent-4.1.1\data\kafka\foo-0.cf697a92ed5246c0977bf9a279f15de8-delete
at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:83)
at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:301)
at sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:287)
at java.nio.file.Files.move(Files.java:1395)
at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:694)
... 23 more
[2018-06-08 09:44:54,187] INFO [ReplicaManager broker=0] Stopping serving replicas in dir C:\confluent-4.1.1\data\kafka (kafka.server.ReplicaManager)
[2018-06-08 09:44:54,192] INFO [ReplicaFetcherManager on broker 0] Removed fetcher for partitions (kafka.server.ReplicaFetcherManager)
[2018-06-08 09:44:54,193] INFO [ReplicaAlterLogDirsManager on broker 0] Removed fetcher for partitions (kafka.server.ReplicaAlterLogDirsManager)
[2018-06-08 09:44:54,195] INFO [ReplicaManager broker=0] Broker 0 stopped fetcher for partitions and stopped moving logs for partitions because they are in the failed log directory C:\confluent-4.1.1\data\kafka. (kafka.server.ReplicaManager)
[2018-06-08 09:44:54,195] INFO Stopping serving logs in dir C:\confluent-4.1.1\data\kafka (kafka.log.LogManager)
[2018-06-08 09:44:54,197] ERROR Shutdown broker because all log dirs in C:\confluent-4.1.1\data\kafka have failed (kafka.log.LogManager)
[2018-06-08 09:44:54,198] INFO [ReplicaFetcherManager on broker 0] Removed fetcher for partitions (kafka.server.ReplicaFetcherManager)
The user running Zookeeper and Kafka has full access rights to C:\confluent-4.1.1\data\kafka.
What am I missing?
I know I'm late to the party but keep in mind that even if you delete your topic manually or via some Kafka UI and you delete all the kafka logs, kafka still may not start because of the state that it syncs with ZK.
So, make sure you cleanup the ZK state by deleting ZK's log.
Please know these actions are irreversible. Also run as Administrator
I had a similar problem and it happen only under windows, see KAFKA-1194 and it still apply to Kafka 1.1.0
The only workaround available is to disable the cleaner log.cleaner.enable = false
For local development under windows you can ignore this issue since it does not apply in other OS.
I had similar problem after deleting a topic. I had to go to topic location and delete it manually and it worked.
/tmp/kafka-logs/[yourTopicName]
I am not sure if same will work for you, as I am also new to KAFKA.
1- stop zookeeper & Kafka server,
2- then go to ‘kafka-logs’ folder , there you will see list of kafka topic folders, delete folder with topic name
3- go to ‘zookeeper-data’ folder , delete data inside that.
4- start zookeeper & kafka server again.
note: if you get "The Cluster ID xxxxxxxxxx doesn't match stored clusterId" error, you have to delete all files in the kafkas log dir.
Problem:
I had similar problem after deleting a topic. zookeeper was started successfully but while running kafka I was getting above mentioned issue.
Analysis:
In my case, what I did was I redirected kafka logs to new folder location C:\Tools\kafka_2.13-2.6.0\kafka-test-logs. I forgot to create a folder kafka-test-logs. In this case it will create auto default folder with provided path name ex: Toolskafka_2.13-2.6.0kafka-test-logs. So even after deleting this logs folder it won't worked in my case.
Solution:
First I stopped zookeeper. I created new folder kafka-test-logs which I forgot earlier and then deleted default created logs for kafka and then restarted zookeeper and kafka server. That's all worked for me.
Thank you!! Cheers and Happy Coding.
I was also facing the same issue, then resolved it by downloading the following version of Kafka from this link,
Version 2.8.1
Then changed the zookeeper.properties file in the Config folder to
dataDir=C:/kafka/zookeeper
and server.properties file in the Config folder to
log.dirs=C:/kafka/kafka-logs
Make sure your Kafka folder is extracted and stored in the C:/ drive or else amend the path accordingly in the config file properties.

Spark Streaming: How to restart spark streaming job running on hdfs cleanly

We have a spark streaming job which reads data from kafka running on a 4 node cluster that uses a checkpoint directory on HDFS ....we had an I/O error where we ran out of the space and we had to go in and delete a few hdfs folders to free up some space and now we have bigger disks mounted ....and want to restart cleanly no need to preserve checkpoint data or kafka offset.....getting the error ..
Application application_1482342493553_0077 failed 2 times due to AM Container for appattempt_1482342493553_0077_000002 exited with exitCode: -1000
For more detailed output, check application tracking page:http://hdfs-name-node:8088/cluster/app/application_1482342493553_0077Then, click on links to logs of each attempt.
Diagnostics: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1266542908-96.118.179.119-1479844615420:blk_1073795938_55173 file=/user/hadoopuser/streaming_2.10-1.0.0-SNAPSHOT.jar
Failing this attempt. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1484420770001
final status: FAILED
tracking URL: http://hdfs-name-node:8088/cluster/app/application_1482342493553_0077
user: hadoopuser
From the error what i can make out is it's still looking for old hdfs blocks which we deleted ...
From research found that ..changing check point directory will help tried changing it and pointing to a new directory ...but still it's not helping to restart spark on clean slate ..it's still giving the same block exception ...Are we missing anything while doing the configuration changes? And how can we make sure that spark is started on a clean slate ?
Also this is how we are setting the checkpoint directory
val ssc = new StreamingContext(sparkConf, Seconds(props.getProperty("spark.streaming.window.seconds").toInt))
ssc.checkpoint(props.getProperty("spark.checkpointdir"))
val sc = ssc.sparkContext
current checkpoint directory in property file is like this
spark.checkpointdir:hdfs://hadoopuser#hdfs-name-node:8020/user/hadoopuser/.checkpointDir1
previously it used to be like this
spark.checkpointdir:hdfs://hadoopuser#hdfs-name-node:8020/user/hadoopuser/.checkpointDir

Can't get kafka console producer or consumer to work

I was able to get kafka to work fine when I spun it up on my local machine. But when I try to get it to work on an AWS instance nothing seems to work right. I tried spinning up my own server and doing just like I did locally spinning up zookeeper and kafka like so
curl http://apache.spinellicreations.com/kafka/0.10.0.0/kafka_2.11-0.10.0.0.tgz | tar -xzf
cd kafka_2.11-0.10.0.0
bin/zookeeper-server-start.sh config/zookeeper.properties &
bin/kafka-server-start.sh config/server.properties &
I also tried using the AMI from bitami which seems to be an all in one AMI. Creating the topic seems to work fine. But when I try to run the console producer I get an error
SEASPAULSON-MAC:kafka_2.11-0.10.0.0 spaulson$ bin/kafka-console-producer.sh --broker-list ec2-54-186-31-109.us-west-2.compute.amazonaws.com:9092 --topic test
blah
[2016-10-20 12:13:23,395] ERROR Error when sending message to topic test with key: null, value: 4 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Batch containing 1 record(s) expired due to timeout while requesting metadata from brokers for test-0
I also get an error when I try to start up a console consumer that repeats over and over.
bin/kafka-console-consumer.sh --zookeeper ec2-54-186-31-109.us-west-2.compute.amazonaws.com:2181 --topic test --from-beginning
[2016-10-19 18:26:47,175] WARN Fetching topic metadata with correlation id 152 for topics [Set(test)] from broker [BrokerEndPoint(0,ip-172-31-52-58.ec2.internal,9092)] failed (kafka.client.ClientUtils$)
java.nio.channels.ClosedChannelException
at kafka.network.BlockingChannel.send(BlockingChannel.scala:110)
at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:80)
at kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:79)
at kafka.producer.SyncProducer.send(SyncProducer.scala:124)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:59)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:94)
at kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:66)
I feel like these kinds of operations should be trivial but it's proving very challenging. I'm having trouble finding documentation on how to diagnose issues and figure out what's going wrong. The best I found is this command
KAFKA_HOME/bin/kafka-topics.sh --describe --topic test --zookeeper ec2-54-186-31-109.us-west-2.compute.amazonaws.com:2181
Topic:test PartitionCount:1 ReplicationFactor:1 Configs:
Topic: test Partition: 0 Leader: 0 Replicas: 0 Isr: 0
Does the Leader: 0 indicate something went wrong? But what?
For AWS or any other IaaS machines, you should set "advertised.listeners" for the clients. Here is what this options means in the Kafka doc:
Listeners to publish to ZooKeeper for clients to use, if different than the listeners above. In IaaS environments, this may need to be different from the interface to which the broker binds. If this is not set, the value for listeners will be used.

Yarn app timeout and no error

I am running a map-reduce job triggered by YARN's REST API. The yarn app starts, triggers another map-reduce job. But the actual yarn app timeout exactly around 12 mins.
This is the final log where it ends:
2016-09-01 13:22:53 DEBUG ProtobufRpcEngine:221 - Call: getJobReport took 0ms
2016-09-01 13:22:54 DEBUG Client:97 - stopping client from cache: org.apache.hadoop.ipc.Client#6bbe2511
There are literally no errors or exceptions. I don't know which setting in Hadoop is causing this.
The Diagnostics says Application *application_whatever* failed 1 times due to ApplicationMaster for attempt *appattempt_application_whatever* timed out. Failing the application.

Zookeeper & Kafka error KeeperErrorCode=NodeExists

I have written a kafka consumer and producer that worked fine until today.
This morning, when I started zookeeper and kafka, my consumer was not able to read messages, and I found this in the zookeeper logs:
INFO Got user-level KeeperException when processing sessionid:0x151c41e62e10000
type:create cxid:0x2a zxid:0x1e txntype:-1 reqpath:n/a
Error Path:/brokers/ids
Error:KeeperErrorCode = NodeExists for /brokers/ids
(org.apache.zookeeper.server.PrepRequestProcessor)
Look for log.dirs in your server.properties file and delete all the Kafka and zookeeper logs from there and try restarting zookeeper and Kafka respectively. I was facing the same issue and doing this resolved it.
According to Confluent at https://groups.google.com/forum/#!topic/confluent-platform/h0gEik_Ii1E on 2016/10/08
Those are not errors, you can see the log level is INFO. It is simply
logging that Kafka tried to create a node that already exists. Totally
normal behavior for Kafka and nothing to worry about.
Is there an actual problem related to the message or is everything working correctly?
go to Kafka root directory and look for the logs file. and clear all logs. For instance:
say your kafka is installed in the downloads folder:
cd ~/Downloads/kafka_2.13-2.6.0
rm -rf logs
It will resolve the issue.

Resources