My current setup for work on server side is like this -- I have a manager (with poller) which waits for incoming requests for work to do. Once something is received it creates worker (with separate poller, and separate ports/sockets) for the job, and further on worker communicates directly with client.
What I observe that when there is some intense traffic with any of the worker it disables manager somewhat -- ReceiveReady events are fired with significant delays.
NetMQ documentation states "Receiving messages with poller is slower than directly calling Receive method on the socket. When handling thousands of messages a second, or more, poller can be a bottleneck." I am so far below this limit (say 100 messages in a row) but I wonder whether having multiple pollers in single program does not clip performance even further.
I prefer having separate instances because the code is cleaner (separation of concerns), but maybe I am going against the principles of ZeroMQ? The question is -- is using multiple pollers in single program performance wise? Or in reverse -- do multiple pollers starve each other by design?
Professional system analysis may even require you to run multiple Poller() instances:
Design system based on facts and requirements, rather than to listen to some popularised opinions.
Implement performance benchmarks and measure details about actual implementation. Comparing facts against thresholds is a.k.a. a Fact-Based-Decision.
If not hunting for the last few hundreds of [ns], a typical scenario may look this way:
your core logic inside an event-responding loop is to handle several classes of ZeroMQ integrated signallin / messaging inputs/outputs, all in a principally non-blocking mode plus your design has to spend specific amount of relative-attention to each such class.
One may accept some higher inter-process latencies for a remote-keyboard ( running a CLI-interface "across" a network, while your event-loop has to meet a strict requirement not to miss any "fresh" update from a QUOTE-stream. So one has to create a light-weight Real-Time-SCHEDULER logic, that will introduce one high-priority Poller() for non-blocking ( zero-wait ), another one with ~ 5 ms test on reading "slow"-channels and another one with a 15 ms test on reading the main data-flow pipe. If you have profiled your event-handling routines not to last more than 5 ms worst case, you still can handle TAT of 25 ms and your event-loop may handle systems with a requirement to have a stable control-loop cycle of 40 Hz.
Not using a set of several "specialised" pollers will not allow one to get this level of scheduling determinism with an easily expressed core-logic to integrate in such principally stable control-loops.
Q.E.D.
I use similar design so as to drive heterogeneous distributed systems for FOREX trading, based on external AI/ML-predictors, where transaction times are kept under ~ 70 ms ( end-to-end TAT, from a QUOTE arrival to an AI/ML advised XTO order-instruction being submitted ) right due to a need to match the real-time constraints of the control-loop scheduling requirements.
Epilogue:
If the documentation says something about a poller performance, in the ranges above 1 kHz signal delivery, but does not mention anything about a duration of a signal/message handling-process, it does a poor service for the public.
The first step to take is to measure the process latencies, next, analyse the performance envelopes. All ZeroMQ tools are designed to scale, so has the application infrastructure -- so forget about any SLOC-sized examples, the bottleneck is not the poller instance, but a poor application use of the available ZeroMQ components ( given a known performance envelope was taken into account ) -- one can always increase the overall processing capacity available, with ZeroMQ we are in a distributed-systems realm from a Day 0, aren't we?
So in concisely designed + monitored + adaptively scaled systems no choking will appear.
Related
I am not sure if my ZeroMQ Majordomo implementation is correct. I believe the guide suggests that this pattern can handle tens of thousands of messages per second using a synchronous-roundtrip pattern. My present solution seems to struggles to send more than one thousand messages per second. My goal is to get as close to I can to running at least ten thousand messages per second.
I am running all of the components of the Majrodomo pattern on separate Windows 2012 servers, each with 12 processors and 32 GB of ram, so I am sure there cannot be resource constraints. All of these servers are running within the same network as well, meaning they are not traversing a firewall. My code is runs slower due to the business logic incorporated into it, so for my speed testing I went back to the simple test code the is provided in the ZeroMQ guide. These are supposedly the clients that are used in the guide to show the messages per second. I have tried in a few different languages as well, including Delphi and C#, neither of which seem to be able to reach the promised speeds.
Code can be found for those here:
http://zguide.zeromq.org/page:all
I am wondering if I am expecting too much from the pattern. Roundtrip message times in the pattern seem to sit around 25ms when sending from 100 clients to a single broker, and then to 100 workers and back through the pattern. This seems slow, and sending 10000 messages from these clients takes about 4 seconds, which isn't anywhere near the promise of tens of thousands per second. Am I expecting too much of this pattern, or is there something I'm missing here.
By the way, I've seen posts about HWM (high water mark) being hit, but given that were in a synchronous pattern I don't believe that could be an issue, since we only are able to queue a max number of messages equivalent to our client count.
Q : ...struggles to send more than one thousand messages per second ... goal is to get as close to I can to running at least ten thousand messages per second.
ZeroMQ eco-system evolves beyond the state, when the performance testing was published.
The best first step :
is to explicitly publish all details of the System-under-Test - a.k.a. SuT ( for each of the hosts, its hardware details ( best by hwloc-alike "fingerprint" ), O/S configuration details ( prio, backgound workloads, costs of virtualised machine ( if applicable - i.e. how much vCPU-clocks were actually stolen from "inside" of the VM by the external hypervisor ), interfaces - O/S buffer settings, I/F-free-capacity ( workload background ), ToS / VLAN or other LAN-interconnect performance modifying details, LAN-interconnect L2-switching silicon theoretical performance ceilings and background workloads present prior, during and after the SuT episodes.
Next comes the code :
a full copy of all the code - a.k.a. the MCVE-(inspect-able and repeat-able)-code. Given no code present here so far, no one can judge your actual observations as per the root-cause of the observed ~ 1k [msgs/s]
Next comes a definition of "reference point" to be used in test :
at which defined / given "reference"-point(s) the SuT ought get measured and what are the expected ranges for the expected TaT in [ms].
I work rather long with ZeroMQ DLL-s ( since 2.11 - so it's indeed quite a long time ) and I believe one can easily either overload an AccessNode with an ingress traffic or, as easily as the former case, suffocate the ZeroMQ infrastructure, if improper configuration was let in and a service performance will decrease, if not degrade, if no due measures were taken to configure all the resources that stay on the service's critical path.
Without hard facts ( as declared in the previous few steps ) there remains but guesses or unsupported ( yet ) opinions.
I found that people don't recommend sending large messages with ZeroMQ. But it is a real headache for me to split the data (it is somewhat twisted). Why this is not recommended is there some specific reason? Can it be overcome?
Why this is not recommended?
Resources ...
Even the best Zero-Copy implementation has to have spare resources to store the payloads in several principally independent, separate locations:
|<fatMessageNo1>|
|...............|__________________________________________________________ RAM
|...............|<fatMessageNo1>|
|...............|...............|__________________Context().Queue[peerNo1] RAM
|...............|...............|<fatMessageNo1>|
|...............|...............|...............|________O/S.Buffers[L3/L2] RAM
Can it be overcome?
Sure, do not send Mastodon-sized-GB+ messages. May use any kind of an off-RAM representation thereof and send just a lightweight reference to allow a remote peer to access such an immense beast.
Many new questions added via comment:
I was concern more about something like transmission failure: what will zeromq do (will it try to retransmit automatically, will it be transparent for me etc). RAM is not so crucial - servers can have it more than enough and service that we write is not intended to have huge amount of clients at the same time. The data that I talk about is very interrelated (we have molecules/atoms info and bonds between them) so it is impossible to send a chunk of it and use it - we need it all)) – Paul 25 mins ago
You may be already aware that ZeroMQ is working under a Zen-of-Zero, where also a zero-warranty got its place.
So, a ZeroMQ dispatched message will either be delivered "through" error-free, or not delivered at all. This is a great pain-saver, as your code will receive only a fully-protected content atomically, so no tortured trash will ever reach your target post-processing. Higher level soft-protocol handshaking allows one to remain in control, enabling mitigations of non-delivered cases from higher levels of abstractions, so if your design apetite and deployment conditions permit, one can harness a brute force and send whatever-[TB]-BLOBs, at one's own risk of blocked both local and infrastructure resources, if others permit and don't mind ( ... but never on my advice :o) )
Error-recovery self-healing - from lost-connection(s) and similar real-life issues - is handled if configuration, resources and timeouts permit, so a lot of troubles with keeping L1/L2/L3-ISO-OSI layers issues are efficiently hidden from user-apps programmers.
I have a setup where a master process is setting up a ZMQ_ROUTER and then forks many child processes, which then connect to that router.
Whenever a child zmq_connect()'s to the master, one file descriptor is occupied.
This however limits the number of interacting processes to the number of allowed file descriptors ( per process ). For me ( linux ), this currently is just 1024.
That is way too small for my intended use ( a multi-agent / swarm simulation ).
Answer:
You can't, except when using an inter-thread socket type ( using an inproc:// transport-class ). All other protocols use one file descriptor per connection.
One new approach to reduce the number of necessary file descriptors per application, if that application has several services ( e.g. several tcp://<address:port> connections can be made to ), seems to be to use the resource property resource property, which allows one to combine several services to one endpoint.
The Swarm first:
First of all, a smart solution of the massive herd of agents requires both flexibility ( in swarm framework design for features' additions ) and efficiency ( for both scalability and speed ) so as to achieve fastest possible simulation run-times, in spite of PTIME & PSPACE obstacles, with possible risk of wandering into EXPTIME zone in more complex inter-agent communication schemes.
Efficiency next:
At the first moment, my guess was to rather use and customise a bit light-weight-er POSIX-based signalling/messaging framework nanomsg -- a younger sister of ZeroMQ from Martin SUSTRIK, co-father of ZeroMQ -- where a Context()-less design plus additional features alike SURVEY and BUS messaging archetype are of particular attractivity for swarms with your own software-designed problem-domain-specific messaging/signalling protocols:
100k+ with file_descriptors
Well, you need a courage. Doable, but sleeves up, it will require hands on efforts in kernel, tuning in system settings and you will pay for having such scale by increased overheads.
Andrew Hacking has explained both the PROS and CONS of the "just" increasing fd count ( not only on the kernel side of the system tuning and configuration ).
Other factors to consider are that while some software may use sysconf(OPEN_MAX) to dynamically determine the number of files that may be open by a process, a lot of software still uses the C library's default FD_SETSIZE, which is typically 1024 descriptors and as such can never have more than that many files open regardless of any administratively defined higher limit.
Andrew has also directed your kind attention to this, that may serve as an ultimate report on how to setup a system for 100k-200k connections.
Do static scales above 100k per host make any real sense for swarm simulations?
While still "technically" doable, there are further limits -- even nanomsg will not be able to push more than about 1.000.000 [MSGs/s] which is fairly well enough for most applications, that cannot keep the pace with this native speed of message-dispatch. Citations state some ~6 [us] for CPU-core to CPU-core transfer latencies and if the user-designed swarm-processing application cannot make the sending loop under some 3-4 [us] the performance ceiling is not anywhere close to cause an issue.
How to scale above that?
A distributed multi-host processing is the first dimension to attack the static scale of the swarm. Next would be a need to introduce an RDMA-injection so as to escape from the performance bottleneck of any stack-processing in the implementation of the distributed messaging / signalling. Yes, this can move your Swarm system into nanosecond-scale latencies zone, but at the cost of building an HPC / high-tech computing infrastructure ( which would be a great Project, if your Project sponsor can adjust financing of such undertaking -- + pls. pls. do let me know if yes, would be more than keen to join such swarm intelligence HPC-lab ), but worth to know about this implication before deciding on architecture and knowing the ultimate limits is the key to do it well from the very beginning.
IOCP is great for many connections, but what I'm wondering is, is there a significant benefit to allowing multiple pending receives or multiple pending writes per individual TCP socket, or am I not really going to lose performance if I just allow one pending receive and one pending send per each socket (which really simplifies things, as I don't have to deal with out-of-order completion notifications)?
My general use case is 2 worker threads servicing the IOCP port, handling several connections (more than 2 but less than 10), where the transmitted data is ether of two forms: one is frequent very small messages (which I combine if possible manually, but generally need to send often enough that the per-send data is still pretty small), and the other is transferring large files.
Multiple pending recvs tend to be of limited use unless you plan to turn off the network stack's recv buffering in which case they're essential. Bear in mind that if you DO decide to issue multiple pending recvs then you must do some work to make sure you process them in the correct sequence. Whilst the recvs will complete from the IOCP in the order that they were issued thread scheduling issues may mean that they are processed by different I/O threads in a different order unless you actively work to ensure that this is not the case, see here for details.
Multiple pending sends are more useful to fully utilise the TCP connection's available TCP window (and send at the maximum rate possible) but only if you have lots of data to send, only if you want to send it as efficiently as you can and only if you take care to ensure that you don't have too many pending writes. See here for details of issues that you can come up against if you don't actively manage the number of pending writes.
For less than 10 connections and TCP, you probably won't feel any difference even at high rates. You may see better performance by simply growing your buffer sizes.
Queuing up I/Os is going to help if your application is bursty and expensive to process. Basically it lets you perform the costly work up front so that when the burst comes in, you're using a little of the CPU on I/O and as much of it on processing as possible.
We have to make our system highly scalable and it has been developed for windows platform using VC++. Say initially, we would like to process 100 requests(from msmq) simultaneously. What would be the best approach? Single process with 100 threads or 2 processes with 50-50 threads? What is the gain apart from process memory in case of second approach. does in windows first CPU time is allocated to process and then split between threads for that process, or OS counts the number of threads for each process and allocate CPU on the basis of threads rather than process. We notice that in first case, CPU utilization is 15-25% and we want to consume more CPU. Remember that we would like to get optimal performance thus 100 requests are just for example. We have also noticed that if we increase number of threads of the process above 120, performance degrades due to context switches.
One more point; our product already supports clustering, but we want to utilize more CPU on the single node.
Any suggestions will be highly appreciated.
You cant process more requests than you have CPU cores. "fast" scalable solutions involve setting up thread pools, where the number of active (not blocked on IO) threads == the number of CPU cores. So creating 100 threads because you want to service 100 msmq requests is not good design.
Windows has a thread pooling mechanism called IO Completion Ports.
Using IO Completion ports does push the design to a single process as, in a multi process design, each process would have its own IO Completion Port thread pool that it would manage independently and hence you could get a lot more threads contending for CPU cores.
The "core" idea of an IO Completion Port is that its a kernel mode queue - you can manually post events to the queue, or get asynchronous IO completions posted to it automatically by associating file (file, socket, pipe) handles with the port.
On the other side, the IO Completion Port mechanism automatically dequeues events onto waiting worker threads - but it does NOT dequeue jobs if it detects that the current "active" threads in the thread pool >= the number of CPU cores.
Using IO Completion Ports can potentially increase the scalability of a service a lot, usually however the gain is a lot smaller than expected as other factors quickly come into play when all the CPU cores are contending for the services other resource.
If your services are developed in c++, you might find that serialized access to the heap is a big performance minus - although Windows version 6.1 seems to have implemented a low contention heap so this might be less of an issue.
To summarize - theoretically your biggest performance gains would be from a design using thread pools managed in a single process. But you are heavily dependent on the libraries you are using to not serialize access to critical resources which can quickly loose you all the theoretical performance gains.
If you do have library code serializing your nicely threadpooled service (as in the case of c++ object creation&destruction being serialized because of heap contention) then you need to change your use of the library / switch to a low contention version of the library or just scale out to multiple processes.
The only way to know is to write test cases that stress the server in various ways and measure the results.
The standard approach on windows is multiple threads. Not saying that is always your best solution but there is a price to be paid for each thread or process and on windows a process is more expensive. As for scheduler i'm not sure but you can set the priory of the process and threads. The real benefit to threads is their shared address space and the ability to communicate without IPC, however synchronization must be careful maintained.
If you system is already developed, which it appears to be, it is likely to be easier to implement a multiple process solution especially if there is a chance that latter more then one machine may be utilized. As your IPC from 2 process on one machine can scale to multiple machines in the general case. Most attempts at massive parallelization fail because the entire system is not evaluated for bottle necks. for example if you implement a 100 threads that all write to the same database you may gain little in actual performance and just wait on your database.
just my .02