Best practice to control the flow on a ZMQ Streamer device? - zeromq

Imagine I have a streaming device like the one in the bellow template.
image from here
def worker(backend_port):
context = zmq.Context()
socket = context.socket(zmq.PULL)
socket.connect("tcp://127.0.0.1:%d" % backend_port)
while True:
task = socket.recv()
# do something
def streamer(frontend_port):# it's called streamer because streams files from hdfs
context = zmq.Context()
socket = context.socket(zmq.PUSH)
socket.connect("tcp://127.0.0.1:%d" % frontend_port)
while True:
# get the file
# prepare it
socket.send(msg, copy=False)
number_of_workers = 16
number_streamers = 10
frontend_port = 7559
backend_port = 7560
streamerdevice = ProcessDevice(zmq.STREAMER, zmq.PULL, zmq.PUSH)
streamerdevice.bind_in("tcp://127.0.0.1:%d" % frontend_port )
streamerdevice.bind_out("tcp://127.0.0.1:%d" % backend_port)
streamerdevice.setsockopt_in(zmq.IDENTITY, b'PULL')
streamerdevice.setsockopt_out(zmq.IDENTITY, b'PUSH')
streamerdevice.start()
for i in range(number_of_workers):
p = Process(target=yolo, args=(backend_port,))
p.start()
for i in range(number_streamers):
s = Process(target=streamer, args=(frontend_port,))
s.start()
To give more information messages are images so are on the bigger size. I am using zero-copy. Performance is the most important point in my case since I am running on a very big scale.
Is there a way to multi-thread that streamerdevice?
How can I control the flow? is there a way to know how many messages are waiting to get processed?
one of the main goals is to make the receiving time as fast as possible on the worker side. Any suggestions for that?
Does it help to start the zmq.Context(nIoTHREADs) for nIoTHREADs>1 in worker side?

How to run a multi-threaded ZMQ device?
The simplest way to run multi-threaded ZeroMQ services is to instantiate the respective zmq.Context( nIoTHREADs ) for nIoTHREADs > 1. There are more performance tweaking configuration options for scaling performance. Review the pyzmq wrapper, how does this ZeroMQ tooling apply to actual the pyzmq wrapper code and ask maintainers, whether their setup and/or configuration remained fixed or just unpublished with this regard. ZeroMQ is flexible enough to provide means for doing this.
...device (which is a one way queue) is not scalable as you increase the number of input connections
Well, a device is neither a one way queue, nor is not scalable. Right the very opposite.
Is it possible to run it in a multi-threaded way?
Sure it is.
what is the best way to scale it?
That depends a lot on exact messaging and signaling requirements that your system design ought meet. High volume, ultra-low latency, almost linear scaling, adaptive connection handing, app-domain user-defined protocols are just a few of such properties to take into account.
Is it because the queue gets so big and that is why it gets slow?
The post contained zero pieces of information about neither the sizes, nor about any speed observations and/or expectations. In other, documented, test setups, ZeroMQ is known to be very cute in buffer-management, if zero-copy and other performance motivated approaches are used, so no one can help on a more detailed level.
is there a way to know how big the queue is at any moment?
No, there is no such "built-in" feature, that would spend time and resources on counting this. If such add-on feature is needed in your system, you are free to add such add-on, without making other use-cases wear a single bit of such additional code for use-cases, where a message counting and queue-size reporting is meaningless.
any load balancing strategy for the device without a REQ-REP pattern?
Yes, there are many ready-made Scalable Formal Communication Archetype Patterns plus there are scenarios one may compose for going beyond the ready-made bilateral distributed-behaviour Archetype primitives, that permit to build a custom-specific, controlled as required, load-balancer of one's own wish and taste.

Related

What are down sides of using ZeroMQ for sending large messages (up to gigabytes)?

I found that people don't recommend sending large messages with ZeroMQ. But it is a real headache for me to split the data (it is somewhat twisted). Why this is not recommended is there some specific reason? Can it be overcome?
Why this is not recommended?
Resources ...
Even the best Zero-Copy implementation has to have spare resources to store the payloads in several principally independent, separate locations:
|<fatMessageNo1>|
|...............|__________________________________________________________ RAM
|...............|<fatMessageNo1>|
|...............|...............|__________________Context().Queue[peerNo1] RAM
|...............|...............|<fatMessageNo1>|
|...............|...............|...............|________O/S.Buffers[L3/L2] RAM
Can it be overcome?
Sure, do not send Mastodon-sized-GB+ messages. May use any kind of an off-RAM representation thereof and send just a lightweight reference to allow a remote peer to access such an immense beast.
Many new questions added via comment:
I was concern more about something like transmission failure: what will zeromq do (will it try to retransmit automatically, will it be transparent for me etc). RAM is not so crucial - servers can have it more than enough and service that we write is not intended to have huge amount of clients at the same time. The data that I talk about is very interrelated (we have molecules/atoms info and bonds between them) so it is impossible to send a chunk of it and use it - we need it all)) – Paul 25 mins ago
You may be already aware that ZeroMQ is working under a Zen-of-Zero, where also a zero-warranty got its place.
So, a ZeroMQ dispatched message will either be delivered "through" error-free, or not delivered at all. This is a great pain-saver, as your code will receive only a fully-protected content atomically, so no tortured trash will ever reach your target post-processing. Higher level soft-protocol handshaking allows one to remain in control, enabling mitigations of non-delivered cases from higher levels of abstractions, so if your design apetite and deployment conditions permit, one can harness a brute force and send whatever-[TB]-BLOBs, at one's own risk of blocked both local and infrastructure resources, if others permit and don't mind ( ... but never on my advice :o) )
Error-recovery self-healing - from lost-connection(s) and similar real-life issues - is handled if configuration, resources and timeouts permit, so a lot of troubles with keeping L1/L2/L3-ISO-OSI layers issues are efficiently hidden from user-apps programmers.

ZeroMQ - Handling slow receivers without dropping

I have an architecture where I have a ROUTER socket and multiple DEALER sockets.
Some DEALER sockets will only send data, some will only receive data and others can do a mixture of both.
I have a scenario, where I have one DEALER socket, that is sending data at an extremely fast rate. This data is received by another DEALER, that will process this as fast as it can. The send rate is always going to be higher than the receive.
In my current setup the ZMQ_SNDHWM on my ROUTER socket gets hit for the receive client and will silently drop messages. I do not want this to be the case.
What is the best way to do so as to deal with this scenario?
I have looked at DEALER->DEALER on a different port, but this could be hard to maintain, depending on the number of sessions that are created I could potentially have to have one port per session.
The other way I can think of solving this is to do some pipe-lining in which the receiving DEALER socket will tell the sender when it is ready to receive but this seems to add a lot of complication to the overall protocol and a lot more state management. It also seems to defeat the ability to be able to naturally block on DEALER sockets which is really what I need in this case; the DEALER sockets will never have to communicate with any other socket.
Do not rely on blocking, the less on uncontrolled use of resources
In distributed-system there is not much space for optimistic beliefs and expectations. In many posts I advocate, where possible, not to rely on blocking states, as your code gets out-of-control and you cannot do anything about leaving such a state, but pray for a message to come any time soon, if ever.
Rather bear the responsibility, end-to-end, which in distributed-system means that you need to also design strategies how to survive a "remote" death and similar situations, that are outside of the range of your-domain-of-control, but which your code design has no right to abstract from.
Even if not willing to, an explicit flow-management is the way to go
Late 90-ies have demonstrated many flow-control strategies for distributed systems, so this is definitely not a new field.
Increasing the box-of-worm size does not help to manage the un-controlled / un-managed flow of events. While ZMQ_???HMW + ZMQ_???BUF may help somehow tweak non-brutal cases, where having a bit more space may temporarily postpone the principal weakness of un-controlled message-flows, yet the problem is like remaining stand still with closed eyes right in the middle of the cross-fire shooting. Such agent may survive, but it's survival is not a result of it's design cleverness, but just an accidental luck.
Token-passing seems to be the cheapest way how to throttle the flow so as to remain acceptable / process-able on the slowest node. Increasing such a node-processing performance may be done with a use of a static, an incrementally expanded or a fully adaptive pooling-proxy, so even this known bottleneck remains manageable and under your design-control.
The highest layer of robustness is in making your distributed-system's design resilient to spurious bursts of events ( be it messages or .connect() attempts ). So, independently of selecting the building blocks, designer has also the responsibility to design in all the survival-strategies. Not doing this leaves your system vulnerable to capacity-directed vector of attack or other sort of unhandled exploits of these kinds of known weakness vulnerabilites.

Splitting work between multiple pollers?

My current setup for work on server side is like this -- I have a manager (with poller) which waits for incoming requests for work to do. Once something is received it creates worker (with separate poller, and separate ports/sockets) for the job, and further on worker communicates directly with client.
What I observe that when there is some intense traffic with any of the worker it disables manager somewhat -- ReceiveReady events are fired with significant delays.
NetMQ documentation states "Receiving messages with poller is slower than directly calling Receive method on the socket. When handling thousands of messages a second, or more, poller can be a bottleneck." I am so far below this limit (say 100 messages in a row) but I wonder whether having multiple pollers in single program does not clip performance even further.
I prefer having separate instances because the code is cleaner (separation of concerns), but maybe I am going against the principles of ZeroMQ? The question is -- is using multiple pollers in single program performance wise? Or in reverse -- do multiple pollers starve each other by design?
Professional system analysis may even require you to run multiple Poller() instances:
Design system based on facts and requirements, rather than to listen to some popularised opinions.
Implement performance benchmarks and measure details about actual implementation. Comparing facts against thresholds is a.k.a. a Fact-Based-Decision.
If not hunting for the last few hundreds of [ns], a typical scenario may look this way:
your core logic inside an event-responding loop is to handle several classes of ZeroMQ integrated signallin / messaging inputs/outputs, all in a principally non-blocking mode plus your design has to spend specific amount of relative-attention to each such class.
One may accept some higher inter-process latencies for a remote-keyboard ( running a CLI-interface "across" a network, while your event-loop has to meet a strict requirement not to miss any "fresh" update from a QUOTE-stream. So one has to create a light-weight Real-Time-SCHEDULER logic, that will introduce one high-priority Poller() for non-blocking ( zero-wait ), another one with ~ 5 ms test on reading "slow"-channels and another one with a 15 ms test on reading the main data-flow pipe. If you have profiled your event-handling routines not to last more than 5 ms worst case, you still can handle TAT of 25 ms and your event-loop may handle systems with a requirement to have a stable control-loop cycle of 40 Hz.
Not using a set of several "specialised" pollers will not allow one to get this level of scheduling determinism with an easily expressed core-logic to integrate in such principally stable control-loops.
Q.E.D.
I use similar design so as to drive heterogeneous distributed systems for FOREX trading, based on external AI/ML-predictors, where transaction times are kept under ~ 70 ms ( end-to-end TAT, from a QUOTE arrival to an AI/ML advised XTO order-instruction being submitted ) right due to a need to match the real-time constraints of the control-loop scheduling requirements.
Epilogue:
If the documentation says something about a poller performance, in the ranges above 1 kHz signal delivery, but does not mention anything about a duration of a signal/message handling-process, it does a poor service for the public.
The first step to take is to measure the process latencies, next, analyse the performance envelopes. All ZeroMQ tools are designed to scale, so has the application infrastructure -- so forget about any SLOC-sized examples, the bottleneck is not the poller instance, but a poor application use of the available ZeroMQ components ( given a known performance envelope was taken into account ) -- one can always increase the overall processing capacity available, with ZeroMQ we are in a distributed-systems realm from a Day 0, aren't we?
So in concisely designed + monitored + adaptively scaled systems no choking will appear.

Can I prevent ZeroMQ from occupying file descriptors?

I have a setup where a master process is setting up a ZMQ_ROUTER and then forks many child processes, which then connect to that router.
Whenever a child zmq_connect()'s to the master, one file descriptor is occupied.
This however limits the number of interacting processes to the number of allowed file descriptors ( per process ). For me ( linux ), this currently is just 1024.
That is way too small for my intended use ( a multi-agent / swarm simulation ).
Answer:
You can't, except when using an inter-thread socket type ( using an inproc:// transport-class ). All other protocols use one file descriptor per connection.
One new approach to reduce the number of necessary file descriptors per application, if that application has several services ( e.g. several tcp://<address:port> connections can be made to ), seems to be to use the resource property resource property, which allows one to combine several services to one endpoint.
The Swarm first:
First of all, a smart solution of the massive herd of agents requires both flexibility ( in swarm framework design for features' additions ) and efficiency ( for both scalability and speed ) so as to achieve fastest possible simulation run-times, in spite of PTIME & PSPACE obstacles, with possible risk of wandering into EXPTIME zone in more complex inter-agent communication schemes.
Efficiency next:
At the first moment, my guess was to rather use and customise a bit light-weight-er POSIX-based signalling/messaging framework nanomsg -- a younger sister of ZeroMQ from Martin SUSTRIK, co-father of ZeroMQ -- where a Context()-less design plus additional features alike SURVEY and BUS messaging archetype are of particular attractivity for swarms with your own software-designed problem-domain-specific messaging/signalling protocols:
100k+ with file_descriptors
Well, you need a courage. Doable, but sleeves up, it will require hands on efforts in kernel, tuning in system settings and you will pay for having such scale by increased overheads.
Andrew Hacking has explained both the PROS and CONS of the "just" increasing fd count ( not only on the kernel side of the system tuning and configuration ).
Other factors to consider are that while some software may use sysconf(OPEN_MAX) to dynamically determine the number of files that may be open by a process, a lot of software still uses the C library's default FD_SETSIZE, which is typically 1024 descriptors and as such can never have more than that many files open regardless of any administratively defined higher limit.
Andrew has also directed your kind attention to this, that may serve as an ultimate report on how to setup a system for 100k-200k connections.
Do static scales above 100k per host make any real sense for swarm simulations?
While still "technically" doable, there are further limits -- even nanomsg will not be able to push more than about 1.000.000 [MSGs/s] which is fairly well enough for most applications, that cannot keep the pace with this native speed of message-dispatch. Citations state some ~6 [us] for CPU-core to CPU-core transfer latencies and if the user-designed swarm-processing application cannot make the sending loop under some 3-4 [us] the performance ceiling is not anywhere close to cause an issue.
How to scale above that?
A distributed multi-host processing is the first dimension to attack the static scale of the swarm. Next would be a need to introduce an RDMA-injection so as to escape from the performance bottleneck of any stack-processing in the implementation of the distributed messaging / signalling. Yes, this can move your Swarm system into nanosecond-scale latencies zone, but at the cost of building an HPC / high-tech computing infrastructure ( which would be a great Project, if your Project sponsor can adjust financing of such undertaking -- + pls. pls. do let me know if yes, would be more than keen to join such swarm intelligence HPC-lab ), but worth to know about this implication before deciding on architecture and knowing the ultimate limits is the key to do it well from the very beginning.

How to do performance and scalability testing without clear requirements?

Any idea how to do performance and scalability testing if no clear performance requirements have been defined?
More information about my application.
The application has 3 components. One component can only run on Linux, the other two components are Java programs so they can run on Linux/Windows/Mac... The 3 components can be deployed to one box or each component can be deployed to one box. Deployment is very flexible. The Linux-only component will capture raw TCP/IP packages over the network, then one Java component will get those raw data from it and assemble them into the data end users will need and output them to hard disk as data files. The last Java component will upload data from data files to my database in batch.
In the absence of 'must be able to perform X iterations within Y seconds...' type requirements, how about these kinds of things:
Does it take twice as long for twice the size of dataset? (yes = good)
Does it take 10x as long for twice the size of dataset? (yes = bad)
Is it CPU bound?
Is it RAM bound (eg lots of swapping to virtual memory)?
Is it IO / Disk bound?
Is there a certain data-set size at which performance suddenly falls off a cliff?
Surprisingly this is how most perf and scalability tests start.
You can clearly do the testing without criteria, you just define the tests and measure the results. I think your question is more in the lines 'how can I establish test passing criteria without performance requirements'. Actually this is not at all uncommon. Many new projects have no clear criteria established. Informally it would be something like 'if it cannot do X per second we failed'. But once you passed X per second (and you better do!) is X the 'pass' criteria? Usually not, what happens is that you establish a new baseline and your performance tests guard against regression: you compare your current numbers with the best you got, and decide if the new build is 'acceptable' as build validation pass (usually orgs will settle here at something like 70-80% as acceptable, open perf bugs, and make sure that by ship time you get back to 90-95% or 100%+. So basically the performance test themselves become their own requirement.
Scalability is a bit more complicated, because there there is no limit. The scope of your test should be to find out where does the product break. Throw enough load at anything and eventually it will break. You need to know where that limit is and, very importantly, find out how does your product break. Does it give a nice error message and revert or does it spills its guts on the floor?
Define your own. Take the initiative and describe the performance goals yourself.
To answer any better, we'd have to know more about your project.
If there has been 'no performance requirement defined', then why are you even testing this?
If there is a performance requirement defined, but it is 'vague', can you indicate in what way it is vague, so that we can better help you?
Short of that, start from the 'vague' requirement, and pick a reasonable target that at least in your opinion meets or exceeds the vague requirement, then go back to the customer and get them to confirm that your clarification meets their requirements and ideally get formal sign-off on that.
Some definitions / assumptions:
Performance = how quickly the application responds to user input, e.g. web page load times
Scalability = how many peak concurrent users the applicaiton can handle.
Firstly perfomance. Performance testing can be quite simple, such as measuring and recording page load times in a development environment and using techniques like applicaiton profiling to identify and fix bottlenecks.
Load. To execute a load test there are four key factors, you will need to get all of these in place to be successfull.
1. Good usage models of how users will use your site and/or application. This can be easy of the application is already in use, but it can be extermely difficult if you are launching a something new, e.g. a Facebook application.
If you can't get targets as requirements, do some research and make some educated assumptions, document and circulate them for feedback.
2. Tools. You need to have performance testing scripts and tools that can excute the scenarios defined in step 1, with the number of expected users in step 1. (This can be quite expensive)
3. Environment. You will need a production like environment that is isolated so your tests can produce repoducible results. (This can also be very expensive.)
4. Technical experts. Once the applicaiton and environment starts breaking you will need to be able to identify the faults and re-configure the environment and or re-code the application once faults are found.
Generally most projects have a "performance testing" box that they need to tick because of some past failure, however they never plan or budget to do it properley. I normally recommend to do budget for and do scalability testing properley or save your money and don't do it at all. Trying to half do it on the cheap is a waste of time.
However any good developer should be able to do performance testing on their local machine and get some good benefits.
rely on tools (fxcop comes to mind)
rely on common sense
If you want to test performance and scalability with no requirements then you should create your own requirements / specs that can be done in the timeline / deadline given to you. After defining the said requirements, you should then tell your supervisor about it if he/she agrees.
To test scalability (assuming you're testing a program/website):
Create lots of users and data and check if your system and database can handle it. MyISAM table type in MySQL can get the job done.
To test performance:
Optimize codes, check it in a slow internet connection, etc.
Short answer: Don't do it!
In order to get a (better) definition write a performance test concept you can discuss with the experts that should define the requirements.
Make assumptions for everything you don't know and document these assumptions explicitly. Assumptions comprise everything that may be relevant to your system's behaviour under load. Correct assumptions will be approved by the experts, incorrect ones will provoke reactions.
For all of those who have read Tom DeMarcos latest book (Adrenaline Junkies ...): This is the strawman pattern. Most people who are not willing to write some specification from scratch will not hesitate to give feedback to your document. Because you need to guess several times when writing your version you need to prepare for being laughed at when being reviewed. But at least you will have better information.
The way I usually approach problems like this is just to get a real or simulated realistic workload and make the program go as fast as possible, within reason. Then if it can't handle the load I need to think about faster hardware, doing parts of the job in parallel, etc.
The performance tuning is in two parts.
Part 1 is the synchronous part, where I tune each "thread", under realistic workload, until it really has little room for improvement.
Part 2 is the asynchronous part, and it is hard work, but needs to be done. For each "thread" I extract a time-stamped log file of when each message sent, each message received, and when each received message is acted upon. I merge these logs into a common timeline of events. Then I go through all of it, or randomly selected parts, and trace the flow of messages between processes. I want to identify, for each message-sequence, what its purpose is (i.e. is it truly necessary), and are there delays between the time of receipt and time of processing, and if so, why.
I've found in this way I can "cut out the fat", and asynchronous processes can run very quickly.
Then if they don't meet requirements, whatever they are, it's not like the software can do any better. It will either take hardware or a fundamental redesign.
Although no clear performance and scalability goals are defined, we can use the high level description of the three components you mention to drive general performance/scalability goals.
Component 1: It seems like a network I/O bound component, so you can use any available network load simulators to generate various work load to saturate the link. Scalability can be measure by varying the workload (10MB, 100MB, 1000MB link ), and measuring the response time , or in a more precise way, the delay associated with receiving the raw data. You can also measure the working set of the links box to drive a realistic idea about your sever requirement ( how much extra memory needed to receive X more workload of packets, ..etc )
Component 2: This component has 2 parts, an I/O bound part ( receiving data from Component 1 ), and a CPU bound part ( assembling the packets ), you can look at the problem as a whole, make sure to saturate your link when you want to measure the CPU bound part, if is is a multi threaded component, you can look for ways to improve look if you don't get 100% CPU utilization, and you can measure time required to assembly X messages, from this you can calculate average wait time to process a message, this can be used later to drive the general performance characteristic of your system and provide and SLA for your users ( you are going to guarantee a response time within X millisecond for example ).
Component 3: Completely I/O bound, and depends on both your hard disk bandwidth, and the back-end database server you use, however you can measure how much do you saturate disk I/O to optimize throughput, how much I/O counts do you require to read X MB of data, and improve around these parameters.
Hope that helps.
Thanks

Resources