Invalid state: The Flow Controller is initializing the Data Flow - apache-nifi

I'm trying out a test scenario to add a new node to the already existing cluster (for now 1-node) using external zookeeper.
I'm constantly getting the below repeated lines, and on UI "Invalid state: The Flow Controller is initializing the Data Flow."
2022-02-28 17:51:29,668 INFO [main] o.a.n.c.c.n.LeaderElectionNodeProtocolSender Determined that Cluster Coordinator is located at nifi-02:9489; will use this address for sending heartbeat messages
2022-02-28 17:51:29,668 INFO [main] o.a.n.c.p.AbstractNodeProtocolSender Cluster Coordinator is located at nifi-02:9489. Will send Cluster Connection Request to this address
2022-02-28 17:51:37,572 INFO [Cleanup Archive for default] o.a.n.c.repository.FileSystemRepository Successfully deleted 0 files (0 bytes) from archive
2022-02-28 17:52:36,914 INFO [Write-Ahead Local State Provider Maintenance] org.wali.MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog#13c90c06 checkpointed with 1 Records and 0 Swap Files in 4 milliseconds (Stop-the-world time = 1 milliseconds, Clear Edit Logs time = 1 millis), max Transaction ID 1
2022-02-28 17:52:37,581 INFO [Cleanup Archive for default] o.a.n.c.repository.FileSystemRepository Successfully deleted 0 files (0 bytes) from archive
NiFi-1.15.3 is being used (unsecure setup)
It seems that cluster coordinator is not running on mentioned port for node already in cluster. This I thought from timeout prospective, but new node is able to detect that a cluster coordinator is present at the mentioned node. (How to solve this?)
nc (netcat) is also timing out for the same port

Related

kafka streams logging disable INFO

Is there any way to disable Kafka streams processing summary info? because it will take alot of disk space
e.g INFO 21284 --- [-StreamThread-6] o.a.k.s.p.internals.StreamThread : stream-thread [test-20-37836474-d182-4066-a5f5-25b211e2fbdb-StreamThread-1] Processed 0 total records, ran 0 punctuators, and committed 0 total tasks since the last update

How to stop Preparing to rebalance group with old generation in Kafka?

I used Kafka for my web application and I found the below messages in kafka.log :
[2021-07-06 08:49:03,658] INFO [GroupCoordinator 0]: Preparing to rebalance group qpcengine-group in state PreparingRebalance with old generation 105 (__consumer_offsets-28) (reason: removing member consumer-1-7eafeb56-e6fe-4161-9c88-e69c06a0ab37 on heartbeat expiration) (kafka.coordinator.group.GroupCoordinator)
[2021-07-06 08:49:03,658] INFO [GroupCoordinator 0]: Group qpcengine-group with generation 106 is now empty (__consumer_offsets-28) (kafka.coordinator.group.GroupCoordinator)
But, kafka like as looping forever for one consumer.
How can I stop it?
Here the picture of the kafka log :
enter image description here
If you only have one partition,you dont'need to use consumer_group
just try to use Assign(not subscribe)

Multiple ProducerIds are created when there are multiple instances of Producers

In the .yaml file, we have set
spring.cloud.stream.kafka.binder.configuration.enable.idempotence as true.
Now when the application starts up, we can see a log like
[kafka-producer-network-thread | test_clientId] org.apache.kafka.clients.producer.internals.TransactionManager - [Producer clientId=test_clientId] ProducerId set to 0 with epoch 0
When the first message is being produced to the topic, we can see that another ProducerId is being used as shown in the below log
[Ljava.lang.String;#720a86ef.container-0-C-1] org.apache.kafka.clients.producer.KafkaProducer - [Producer clientId=test_clientId] Instantiated an idempotent producer.
[Ljava.lang.String;#720a86ef.container-0-C-1] org.apache.kafka.common.utils.AppInfoParser - Kafka version : 2.0.1
[Ljava.lang.String;#720a86ef.container-0-C-1] org.apache.kafka.common.utils.AppInfoParser - Kafka commitId : fa14705e51bd2ce5
kafka-producer-network-thread | test_clientId] org.apache.kafka.clients.Metadata - Cluster ID: -9nblycHSsiksLIUbVH6Vw
-9nblycHSsiksLIUbVH6Vw
1512361 INFO [kafka-producer-network-thread | test_clientId] org.apache.kafka.clients.producer.internals.TransactionManager - [Producer clientId=test_clientId] ProducerId set to 1 with epoch 0
Once the ProducerId is set to 1, when any new messages are sent from this application, no new ProducerIds are created.
But if we have multiple applications running(all connecting to same kafka server),
then new ProducerIds are created in that instance also while starting up as well as while sending the first message.
Please suggest if we can restrict creating of new ProducerIds and use the same one that was created while creating the application.
Also, since a lot of ProducerIds are created, is there some way in which we can re-use the already created ones?(Assuming the application has multiple producers and each one creates multiple ProducerIds)
The first producer is temporary - it is created to find the existing partitions for the topic during initialization. It is immediately closed.
The second producer is a single producer used for subsequent record sends.
The producerId and epoch are allocated by the broker. They have to be unique.
With a new broker you will get 0 and 1 for the first instance, 2 and 3 for the second instance, 4 and 5, ...
Even if you stop all instances, the next one will get 7 and 8.
Why do you worry about this?
On the other hand, if you set the client.id to, say foo, you will always get foo-1 and foo-2 on all instances.

kafka-streams instance on startup continuously logs "Found no committed offset for partition traces-1"

I have a kafka-streams app with 2 instances. This is a brand new kafka-cluster with all topics created and have no messages written to them yet.
I start the first instance and see that it has transitioned from REBALANCING to RUNNING state
Now I start the next instance and notice that it continuously logs the following:
2020-01-14 18:03:57.896 [streaming-app-f2457059-c9ec-4c21-a177-be54f8d59cb2-StreamThread-2] INFO o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=streaming-app-f2457059-c9ec-4c21-a177-be54f8d59cb2-StreamThread-2-consumer, groupId=streaming-app] Found no committed offset for partition traces-1

What is Apache Spark doing before a job start

I have an Apache Spark batch job running continuously on AWS EMR. It pulls from AWS S3, runs a couple of jobs with that data, and then stores the data in an RDS instance.
However, there seems to be a long period of inactivity between jobs.
This is the CPU use:
And this is the network:
Notice the gap between each column, it is almost the same size as the activity column!
At first I thought these two columns were shifted (when it was pulling from S3, it wasn't using a lot of CPU and vice-versa) but then I noticed that these two graphs actually follow each other. This makes sense since the RDDs are lazy and will thus pull as the job is running.
Which leads to my question, what is Spark doing during that time? All of the Ganglia graphs seem zeroed during that time. It is as if the cluster decided to take a break before each job.
Thanks.
EDIT: Looking at the logs, this is the part where it seems to take an hour of...doing nothing?
15/04/27 01:13:13 INFO storage.DiskBlockManager: Created local directory at /mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1429892010439_0020/spark-c570e510-934c-4510-a1e5-aa85d407b748
15/04/27 01:13:13 INFO storage.MemoryStore: MemoryStore started with capacity 4.9 GB
15/04/27 01:13:13 INFO netty.NettyBlockTransferService: Server created on 37151
15/04/27 01:13:13 INFO storage.BlockManagerMaster: Trying to register BlockManager
15/04/27 01:13:13 INFO storage.BlockManagerMaster: Registered BlockManager
15/04/27 01:13:13 INFO util.AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver#ip-10-0-3-12.ec2.internal:41461/user/HeartbeatReceiver
15/04/27 02:30:45 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 0
15/04/27 02:30:45 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 7
15/04/27 02:30:45 INFO executor.Executor: Running task 77251.0 in stage 0.0 (TID 0)
15/04/27 02:30:45 INFO executor.Executor: Running task 77258.0 in stage 0.0 (TID 7)
15/04/27 02:30:45 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 8
15/04/27 02:30:45 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 8)
15/04/27 02:30:45 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 15
15/04/27 02:30:45 INFO executor.Executor: Running task 7.0 in stage 0.0 (TID 15)
15/04/27 02:30:45 INFO broadcast.TorrentBroadcast: Started reading broadcast variable
Notice at 01:13:13, it just hangs there until 20:30:45.
I found the issue. The problem was in the way I was calling pulling from S3.
We have our data in S3 separated by a date pattern as in s3n://bucket/2015/01/03/10/40/actualData.txt for the data from 2015-01-03 at 10:40
So when we want to run the batch process on the whole set, we call sc.textFiles("s3n://bucket/*/*/*/*/*/*").
BUT that is bad. In retrospect, this makes sense; for each star (*), Spark needs to get all of the files in that "directory", and then get all of the files in the directory under that. A single month has about 30 files and each day has 24 files, and each of those has 60. So the above pattern would call a "list files" on each star AND the call list files on the files returned, all the way down to the minutes! This is so that is can eventually get all of the **/acutalData.txt files and then union all of their RDDs.
This, of course, is really slow. So the answer was to build these paths in code (a list of strings for all the dates. In our case, all possible dates can be determined) and reducing them into a comma-separated string that can be passed into textFiles.
If in your case you can't determine all of the possible paths, consider either restructuring your data or build as much as possible of the paths and only call * towards the end of the path, or use the AmazonS3Client to get all the keys using the list-objects api (which allows you to get ALL keys in a bucket with a prefix very quickly) and then pass them as comma-separated string into textFiles. It will still make a list Status call for each file and it will still be serial, but there will be a lot less calls.
However, all of these solutions just slow down the inevitable; as more and more data gets built, more and more list status calls will be made serially. The root of the issue seems to the that sc.textFiles(s3n://) pretends that s3 is a file system, which is not. It is a key-value store. Spark (and Hadoop) need a different way of dealing with S3 (and possibly other key-value stores) that don't assume a file system.

Resources