We have a spark-streaming micro batch process which consumes data from kafka topic with 20 partitions. The data in the partitions are independent and can be processed independently. The current problem is the micro batch waits for processing to be complete in all 20 partitions before starting next micro batch. So if one partition completes processing in 10 seconds and other partition takes 2 mins then the first partition will have to wait for 110 seconds before consuming next offset.
I am looking for a streaming solution where we can process the 20 partitions independently without having to wait for other partition to complete a process. The steaming solution should consume data from each partition and progress offsets at its own rate independent of other partitions.
Anyone have suggestion on which streaming architecture would allow to achieve my goal?
Any of Flink (AFAIK), KStreams, and Akka Streams will be able to progress through the partitions independently: none of them does Spark-style batching unless you explicitly opt in.
Flink is similar to Spark in that it has a job server model; KStreams and Akka are both libraries that you just integrate into your project and deploy like any other JVM application (e.g. you can build a container and run on a scheduler like kubernetes). I personally prefer the latter approach: it generally means less infrastructure to worry about and less of an impedance mismatch to integrate with observability tooling used elsewhere.
Flink is an especially good choice when it comes to time-window based processing and joins.
KStreams fundamentally models everything as a transformation from one kafka topic to another: the topic topology is managed by KStreams, but there can be some gotchas there (especially if you're dealing with anything time-seriesy).
Akka is the most general and (in some senses) the least opinionated of the toolkits: you will have to make more decisions with less handholding (I'm saying this as someone who could probably fairly be called an Akka cheerleader); as a pure stream processing library, it may not be the ideal choice (though in terms of resource consumption, being able to more explicitly manage backpressure (basically, what happens when data comes in faster than it can be processed) may make it more efficient than the alternatives). I'd probably tend to only choose it if you were going to also take advantage of cluster sharded (and almost certainly event-sourced) actors: the benefit of doing that is that you can completely decouple your processing parallelism from the number of input Kafka partitions (e.g. you may be able to deploy 40 instances of processing and have each working on half of the data from Kafka).
One of the benefits of Microservice architecture is one can scale heavily used parts of the application without scaling the other parts. This supposedly provides benefits around cost.
However, my question is, if a heavily used microservice is dependent on other microservice to do it's work wouldn't you have to scale the other services as well seemingly defeating the purpose. If a microservice is calling other micro service at real time to do it's job, does it mean that Micro service boundaries are not established correctly.
There's no rule of thumb for that.
Scaling usually depends on some metrics and when some thresholds are reached then new instances are created. Same goes for the case when they are not needed anymore.
Some services are doing simple, fast tasks, like taking an input and writing it to the database and others may be longer running task which can take any amount of time.
If a service that needs scale is calling a service that can easily handle heavy loads in a reliable way then there is no need to scale that service.
That idea behind scaling is to scale up when needed in order to support the loads and then scale down whenever loads get in the regular metrics ranges in order to reduce the costs.
There are two topics to discuss here.
First is that usually, it is not a good practice to communicate synchronously two microservices because you are coupling them in time, I mean, one service has to wait for the other to finish its task. So normally it is a better approach to use some message queue to decouple the producer and consumer, this way the load of one service doesn't affect the other.
However, there are situations in which it is necessary to do synchronous communication between two services, but it doesn't mean necessarily that both have to scale the same way, for example: if a service has to make several calls to other services, queries to database, or other kind of heavy computational tasks, and one of the service called only do an array sorting, probably the first service has to scale much more than the second in order to process the same number of request because the threads in the first service will be occupied longer time than the second
According to the MPI implementation of Storm the workers manage connections to other workers and maintain a mapping from task to task. Also, transferring takes in a task id and a tuple, and it serializes the tuple and puts it onto a "transfer queue”.
The question is, if there is a way to organise scheduling, such that certain tasks of an operator communicate to only certain tasks of the following operator at a given time according to the application’s topology (could ZeroMQ possibly do something like this?).
Q : "If there is a way to organise scheduling, such that certain tasks of an operator communicate to only certain tasks of the following operator at a given time according to the application’s topology ( could ZeroMQ possibly do something like this? )."
Obviously could,it does allow smart & flexible creation of signalling/messaging meta-plane(s) infrastructure(s) for the distributed-computing, improving itself in doing this for about the last 12+ years.
The #HristoIlliev attached comment's URL details that Apache-Storm itself reports to already use the ZeroMQ-layer for its own services *[in ver.0.8.0, almost all implementation (source-code) links unfortunately already dead there]:
The implementation for distributed mode uses ZeroMQ code
The implementation for local mode uses in-memory Java queues (so that it's easy to use Storm locally without needing to get ZeroMQ installed) code
...
Tasks listen on an in-memory ZeroMQ port for messages from the virtual port code
So the topology-related part of your question is related to the decision already made on this subject in the "outer" Apache-Storm architecture, that was done.
Tasks are responsible for message routing. A tuple is emitted either to a direct stream (where the task id is specified) or a regular stream. In direct streams, the message is only sent if that bolt subscribes to that direct stream. In regular streams, the stream grouping functions are used to determine the task ids to send the tuple to.
The MPI does the same for the HPC-focused computing ecosphere, since FORTRAN jobs started to run on first HPC distributed infrastructures. Due to the most of the HPC-computing problems were "simply" scaled onto larger footprints of the computing hardware, the MPI focus was more on efficiency of such uniform scaling, not visiting thus the opposite corner of adaptive, almost ad-hoc setup of message-passing infrastructure, layered topologies of specialised ZeroMQ Scalable Formal Communication Archetypes Patterns, so each of the tools focus on other factors.
If you feel you want to read a bit more on ZeroMQ, this answer might help to fast understand the core underlying concepts.
I have a cluster application, which is divided into a controller and a bunch of workers. The controller runs on a dedicated host, the workers phone in over the network and get handed jobs, so far so normal. (Basically the "divide-and-conquer pipeline" from the zeromq manual, with job-specific wrinkles. That's not important right now.)
The controller's core data structure is unordered_map<string, queue<string>> in pseudo-C++ (the controller is actually implemented in Python, but I am open to the possibility of rewriting it in something else). The strings in the queues define jobs, and the keys of the map are a categorization of the jobs. The controller is seeded with a set of jobs; when a worker starts up, the controller removes one string from one of the queues and hands it out as the worker's first job. The worker may crash during the run, in which case the job gets put back on the appropriate queue (there is an ancillary table of outstanding jobs). If it completes the job successfully, it will send back a list of new job-strings, which the controller will sort into the appropriate queues. Then it will pull another string off some queue and send it to the worker as its next job; usually, but not always, it will pick the same queue as the previous job for that worker.
Now, the question. This data structure currently sits entirely in main memory, which was fine for small-scale test runs, but at full scale is eating all available RAM on the controller, all by itself. And the controller has several other tasks to accomplish, so that's no good.
What approach should I take? So far, I have considered:
a) to convert this to a primarily-on-disk data structure. It could be cached in RAM to some extent for efficiency, but jobs take tens of seconds to complete, so it's okay if it's not that efficient,
b) using a relational database - e.g. SQLite, (but SQL schemas are a very poor fit AFAICT),
c) using a NoSQL database with persistency support, e.g. Redis (data structure maps over trivially, but this still appears very RAM-centric to make me feel confident that the memory-hog problem will actually go away)
Concrete numbers: For a full-scale run, there will be between one and ten million keys in the hash, and less than 100 entries in each queue. String length varies wildly but is unlikely to be more than 250-ish bytes. So, a hypothetical (impossible) zero-overhead data structure would require 234 – 237 bytes of storage.
Ultimately, it all boils down on how you define efficiency needed on part of the controller -- e.g. response times, throughput, memory consumption, disk consumption, scalability... These properties are directly or indirectly related to:
number of requests the controller needs to handle per second (throughput)
acceptable response times
future growth expectations
From your options, here's how I'd evaluate each option:
a) to convert this to a primarily-on-disk data structure. It could be
cached in RAM to some extent for efficiency, but jobs take tens of
seconds to complete, so it's okay if it's not that efficient,
Given the current memory hog requirement, some form of persistent storage seems a reaonsable choice. Caching comes into play if there is a repeatable access pattern, say the same queue is accessed over and over again -- otherwise, caching is likely not to help.
This option makes sense if 1) you cannot find a database that maps trivially to your data structure (unlikely), 2) for some other reason you want to have your own on-disk format, e.g. you find that converting to a database is too much overhead (again, unlikely).
One alternative to databases is to look at persistent queues (e.g. using a RabbitMQ backing store), but I'm not sure what the per-queue or overall size limits are.
b) using a relational database - e.g. SQLite, (but SQL schemas are a
very poor fit AFAICT),
As you mention, SQL is probably not a good fit for your requirements, even though you could surely map your data structure to a relational model somehow.
However, NoSQL databases like MongoDB or CouchDB seem much more appropriate. Either way, a database of some sort seems viable as long as they can meet your throughput requirement. Many if not most NoSQL databases are also a good choice from a scalability perspective, as they include support for sharding data across multiple machines.
c) using a NoSQL database with persistency support, e.g. Redis (data
structure maps over trivially, but this still appears very RAM-centric
to make me feel confident that the memory-hog problem will actually go
away)
An in-memory database like Redis doesn't solve the memory hog problem, unless you set up a cluster of machines that each holds a part of the overall data. This makes sense only if keeping all data in-memory is needed due to low response times requirements. Yet, given the nature of your jobs, taking tens of seconds to complete, response times, respective to workers, hardly matter.
If you find, however, that response times do matter, Redis would be a good choice, as it handles partitioning trivially using either client-side consistent-hashing or at the cluster level, thus also supporting scalability scenarios.
In any case
Before you choose a solution, be sure to clarify your requirements. You mention you want an efficient solution. Since efficiency can only be gauged against some set of requirements, here's the list of questions I would try to answer first:
*Requirements
how many jobs are expected to complete, say per minute or per hour?
how many workers are needed to do so?
concluding from that:
what is the expected load in requestes/per second, and
what response times are expected on part of the controller (handing out jobs, receiving results)?
And looking into the future:
will the workload increase, i.e. does your solution need to scale up (more jobs per time unit, more more data per job?)
will there be a need for persistency of jobs and results, e.g. for auditing purposes?
Again, concluding from that,
how will this influence the number of workers?
what effect will it have on the number of requests/second on part of the controller?
With these answers, you will find yourself in a better position to choose a solution.
I would look into a message queue like RabbitMQ. This way it will first fill up the RAM and then use the disk. I have up to 500,000,000 objects in queues on a single server and it's just plugging away.
RabbitMQ works on Windows and Linux and has simple connectors/SDKs to about any kind of language.
https://www.rabbitmq.com/
We have a existing setup where upstream systems send messages to us on a Message Queue and we process these messages.The content is xml and we simply unmarshal.This unmarshalling step is followed by a write to db (to put relevant values onto relevant columns).
The system is set to interface with many more upstream systems and our volumes are going to increase to a peak size of 40mm per day.
Our current way of processing is have listeners on the queues and then have a multiple threads of producers and consumers which do the unmarshalling and subsequent db write.
My question : Can this process fit into the Storm use case scenario?
I mean can MQ be my spout and I have 2 bolts one to unmarshal and this then becomes the spout for the next bolt which does the write to db?
If yes,what is the benefit that I can derive? Is it a goodbye to cumbersome multi threaded producer/worker pattern of code.
If its as simple as the above then where/why would one want to resort to the conventional multi threaded approach to producer/consumer scenario
My point being is there a data volume/frequency at which Storm starts to shine when compared to the conventional approach.
PS : I'm very new to this and trying to get a hang of this and want to ascertain if the line of thinking is right
Regards,
CVM
Definitely this scenario can fit into a storm topology. The spouts can pull from MQ and the bolts can handle the unmarshalling and subsequent processing.
The major benefit over conventional multi threaded pattern is the ability to add more worker nodes as the load increases. This is not so easy with traditional producer consumer patterns.
Specific data volume number is a very broad question since it depends on a large number of factors like hardware etc.