Distributing tasks with equal load in a multinode cluster - caching

Spring scheduler is been triggered every one hour and the scheduler will be triggered from all the nodes deployed.
The scheduler reads a list of data from DB and does the process for each data in the list..
The processing of the data should not be duplicated across multiple nodes.
Each node should be able to uniquely identify the data in the list that it can process.
Is there any microservice pattern/distributed architecture pattern to achieve the same using distributed cache.
Please note:-
Would not be able to acquire lock on DB.
Each data in the list will have a unique id.

Related

Apache NiFi - Can it scale at the processor level?

Newbie Alert to Apache NiFi!
Curious to understand (and read relevant material) on the scalability aspects of Apache NiFi pipeline in a clustered set up.
Imagine there is a 2 node cluster Node 1 & Node 2.
A simple use case as an example:
Query a Database Table in batches of 100 (Lets say there are 10 batches).
For each batch, call a REST API (invoke Http).
If a pipeline is triggered on Node 1 in a cluster, Does this mean all the 10 batches are run only in Node 1?
Is there any work distribution "out-of-the-box" available in NiFi at every processor level? Along the lines of 5 batches are executed for the REST API calls per node.
Is the built-in queue of NiFi distributed in nature?
Or is the recommended way to scale at the processor level is to publish the output of the previous processors to a messaging middleware (like Kafka) and then make the subsequent NiFi processor to consume from it?
What's the recommended way to scale at every processor level in NiFi?
every queue has a load balancing strategy parameter with following options:
Do not load balance: Do not load balance FlowFiles between nodes in the cluster. This is the default.
Partition by attribute: Determines which node to send a given FlowFile to based on the value of a user-specified FlowFile Attribute.
Round robin: FlowFiles will be distributed to nodes in the cluster in a round-robin fashion.
Single node: All FlowFiles will be sent to a single node in the cluster.
Details in documentation:
https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#Load_Balancing

How do we know when a flow is completed in case we have multiple flowfiles running parallely?

I have a requirement where we have a template which uses SQL as source and SQL as destination and data would be more than 100GB for each table so here template will be instantiated multiple times based on tables to be migrated and also each table is partitioned into multiple flowfiles. How do we know when the process is completed? As here there will be multiple flowfiles we are unable to conclude as it hits end processor.
I have tried using SitetoSiteStatusReportingTask to check queue count, but it provides count based on connection and its difficult to fetch connectionid for each connection then concatenate as we have large number of templates. Here we have another problem in reporting task as it provides data on all process groups which are available on NIFI canvas which will be huge data if all templates are running and may impact in performance even though I used avro schema to fetch only queue count and connection id.
Can you please suggest some ideas and help me to achieve this?
you have multiple solution :
1 - you can use the wait/notify duo processor.
if you dont want multiple flowfile running parallely :
2 - set backpressure on Queue
3 - specify group level flow file concurrency (recommended but Nifi 1.12 only )

How do I get two topics that have the same partition key and the number of partitions land on the same consumer within a kafka streams application

I am trying to create a Kafka Streams service where
I am trying to initialize a cache in a processor, that will then be updated by consuming messages with a topic say "nodeStateChanged" for a partition key lets say locationId.
I need to check the node state when I consume another topic lets say "Report" again keyed by the same locationId. Effectively I am joining with the table created by nodeStateChanged.
How do I ensure that all the updates for nodeStateChanged fall on the same instance as the Report topic so that the lookup for a location is possible when a new report is recieved. Do 1 and 2 need to be created by the same topology or it okay to create two seperate topologies that share the same APPLICATION_ID_CONFIG.
You don't need to do anything. Kafka Streams will always co-partition topics. Ie, if you have a sub-topology that reads from multiple topics with N partitions each, you get N tasks and each task is processing corresponding partitions, ie, task 0 processes partitions zero of both input topics, task 1 processes partitions one of both input topics, etc.

Distribute processing of records of scheduler job

I am working on a use case where I have a cron job scheduled (via quartz) which reads certain entries from db and process them.
Now in each schedule, I can get thousands of records which need to be processed. Processing each record takes time (in seconds/minutes). Currently all those records are getting processed on single node (node elected by quartz). Now my challenge is to parallelize these records processing. Please help me in solving below concerns :
How I can distribute these records/tasks to a cluster of machines
If any machine fails after processing few records then remaining records should be processed by healthy nodes in cluster
Get a signal that all record processing is finished.
Create cron jobs to run separately on each host at the desired frequency. You will need some form of lock on each record or some form of range lock on the record set to ensure that servers process mutually exclusive set of records.
e.g. : You can add following new field to all records:
Locked By Server:
Locked for Duration (or lock expiration time):
On each run, each cron picks a set of records that have expired or empty locks and then it aquires the lock on a small set of records by putting these two entries. Then it proceeds to process them. If it crashes or gets stuck the lock expires, otherwise it is released on completion.

Is there any way to control in Hadoop MapReduce framework on which node reducer will be started?

shortly speaking I need a way to give Hadoop MapRedice API hint on what host I'd like to run certain reducer based on its partition. Is there any way?
Somewhat longer story:
I have few mapper tasks which generate (or import from another source) records for certain HBase table. Emitted records have ImmutableBytesWritable as keys. Number of reducers for this job exactly matches number of table regions and custom partitioner is used to distribute records so records of every region gets to appropriate reducer.
Reducers are intended to generate HFile images, one image per region so later bulk load could be used on them. The only serious problem here is I'd like reducers at least to 'try to run' on the same hosts appropriate region servers are running. This is to get good probability of generated HFiles locality (in terms of HDFS) for appropriate HBase region servers.
Any idea how to get this behavior?
Alternative could be how to 'request' HDFS file to 'get local'. Having this I could start another MR job with mappers bound to region servers (through splits) and request corresponding HFile to get local.
There is no out-of-box way to do this yet, short of writing a custom scheduler, which would be an overkill.
An upstream ticket does track this feature request at https://issues.apache.org/jira/browse/MAPREDUCE-199.

Resources