Schedule a downtime for NiFi flow at the end of the day - apache-nifi

I have a Nifi flow which reads data from a Kafka queue, splits the message into 2 different components and then writes it to 2 different locations in HDFS.
I want to schedule a downtime for 15 minutes at the end of the day (11:45pm to 12:00am) which could allow all the messages already split to be drained from the queues and landed to the respective HDFS locations on the same day.
Is there a way to get this done?
I have tried looking at the wait processor. I can schedule the processor to start at a certain time but I'm unable to identify how to stop the processor after 12:00am.

There are a couple of implementation options I can think of,
NiFi REST API call to stop-start the required processor
Routing - check current timestamp is between 11:45pm and 12:00am and route such FlowFiles to LogAttribute with Run Schedule every 15 mins.

Related

Apache NiFi - Can it scale at the processor level?

Newbie Alert to Apache NiFi!
Curious to understand (and read relevant material) on the scalability aspects of Apache NiFi pipeline in a clustered set up.
Imagine there is a 2 node cluster Node 1 & Node 2.
A simple use case as an example:
Query a Database Table in batches of 100 (Lets say there are 10 batches).
For each batch, call a REST API (invoke Http).
If a pipeline is triggered on Node 1 in a cluster, Does this mean all the 10 batches are run only in Node 1?
Is there any work distribution "out-of-the-box" available in NiFi at every processor level? Along the lines of 5 batches are executed for the REST API calls per node.
Is the built-in queue of NiFi distributed in nature?
Or is the recommended way to scale at the processor level is to publish the output of the previous processors to a messaging middleware (like Kafka) and then make the subsequent NiFi processor to consume from it?
What's the recommended way to scale at every processor level in NiFi?
every queue has a load balancing strategy parameter with following options:
Do not load balance: Do not load balance FlowFiles between nodes in the cluster. This is the default.
Partition by attribute: Determines which node to send a given FlowFile to based on the value of a user-specified FlowFile Attribute.
Round robin: FlowFiles will be distributed to nodes in the cluster in a round-robin fashion.
Single node: All FlowFiles will be sent to a single node in the cluster.
Details in documentation:
https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#Load_Balancing

Processor stucks and shows slowness in nifi

I use Nifi standalone for extraction of data from mysql and write it on S3. When i extract multiple tables in batches, 2 processors gets slowed down after processing 15-20 flowfiles of around 20-30 MB each and remains in the queue.
When i restart the processor, all the flow files are processed in one second and to process everything i have to do this again and again.
Processors on which i am facing - ConvertAvroToParquet and
AttributesToJson

Kafka Stream to divide job & joining them

I have requirement where I get record into topic. With single record I create n different jobs(which should get distributed). Once I successfully process n jobs I need push successfully processed record. Does it qualify for Kafka streams? Basically what I am looking at is , I have video (lets say 20min duration) which needs to be transcoded. I will create 4 tasks(each 5 min) , each worker will process these 4 tasks individually. Once all 4 tasks are completed I need to stitch it back together. I am trying to see if KafaStreams is possible fit to distribute the jobs & then join.
I probably wouldn't recommend using kafka streams for the data itself as it is not meant to deal with such large messages like videos are. But you can use it as a messaging system, saying:
First event comes, hey I got a new video to process
You trigger the software from within your stream
After completion, the software writes another event into a kafka topic to say, it has finished it's job
The kafka streams application could then process this event to trigger the other software to do it's job
Hope it helps.

How do I add a custom monitoring feature in my Spark application?

I am developing a Spark application. The application takes data from Kafka queue and processes that data. After processing it stores data in Hbase table.
Now I want to monitor some of the performance attributed such as,
Total count of input and output records.(Not all records will be persisted to Hbase, some of the data may be filtered out in processing)
Average processing time per message
Average time taken to persist the messages.
I need to collect this information and send it to a different Kafka queue for monitoring.
Considering that the monitoring should not incur a significant delay in the processing.
Please suggest some ideas for this.
Thanks.

Storm bolt doesn't guarantee to process the records in order they receive?

I had a storm topology that reads records from kafka, extracts timestamp present in the record, and does a lookup on hbase table, apply business logic, and then updates the hbase table with latest values in the current record!!
I have written a custom hbase bolt extending BaseRichBolt, where, the code, does a lookup on the hbase table and apply some business logic on the message that has been read from kafka, and then updates the hbase table with latest data!
The problem i am seeing is, some times, the bolt is receiving/processing the records in a jumbled order, due to which my application is thinking that a particular record is already processed, and ignoring the record!!! Application is not processing a serious amount of records due to this!!
For Example:
suppose there are two records that are read from kafka, one record belongs to 10th hour and second records belongs to 11th hour...
My custom HBase bolt, processing the 11th hour record first... then reading/processing the 10th hour record later!! Because, 11th hour record is processed first, application is assuming 10th record is already processed and ignoring the 10th hour record from processing!!
Can someone pls help me understand, why my custom hbase bolt is not processing the records in order it receive ?
should i have to mention any additional properties to ensure, the bolt processes the records in the order it receives ? what are possible alternatives i can try to fix this ?
FYI, i am using field grouping for hbase bolt, thru which i want to ensure, all the records of a particular user goes into same task!! Nevertheless to mention, thinking field grouping might causing the issue, reduces the no.of tasks for my custom hbase bolt to 1 task, still the same issue!!
Wondering why hbase bolt is not reading/processing records in the order it receives !!! Please someone help me with your thoughts!!
Thanks a lot.
Kafka doesn't provide order of messages in multiple partition.
So theres no orderring when you read messages. To avoid that, you need to create kafka topic with a single partition, but you will loose parallelism advantage.
Kafka guarantees ordering by partition not by topic. Partitioning really serves two purposes in Kafka:
It balances data and request load over brokers
It serves as a way to divvy up processing among consumer processes while allowing local state and preserving order within the partition.
For a given use case you may care about only #2. Please consider using Partitioner as part of you Producer using ProducerConfig.PARTITIONER_CLASS_CONFIG. The default Java Producer in .9 will try to level messages across all available partitions. https://github.com/apache/kafka/blob/6eacc0de303e4d29e083b89c1f53615c1dfa291e/clients/src/main/java/org/apache/kafka/clients/producer/internals/DefaultPartitioner.java
You can create your own with something like this:
return hash(key)%num_partitions

Resources