Can I prevent ZeroMQ from occupying file descriptors? - zeromq

I have a setup where a master process is setting up a ZMQ_ROUTER and then forks many child processes, which then connect to that router.
Whenever a child zmq_connect()'s to the master, one file descriptor is occupied.
This however limits the number of interacting processes to the number of allowed file descriptors ( per process ). For me ( linux ), this currently is just 1024.
That is way too small for my intended use ( a multi-agent / swarm simulation ).
Answer:
You can't, except when using an inter-thread socket type ( using an inproc:// transport-class ). All other protocols use one file descriptor per connection.
One new approach to reduce the number of necessary file descriptors per application, if that application has several services ( e.g. several tcp://<address:port> connections can be made to ), seems to be to use the resource property resource property, which allows one to combine several services to one endpoint.

The Swarm first:
First of all, a smart solution of the massive herd of agents requires both flexibility ( in swarm framework design for features' additions ) and efficiency ( for both scalability and speed ) so as to achieve fastest possible simulation run-times, in spite of PTIME & PSPACE obstacles, with possible risk of wandering into EXPTIME zone in more complex inter-agent communication schemes.
Efficiency next:
At the first moment, my guess was to rather use and customise a bit light-weight-er POSIX-based signalling/messaging framework nanomsg -- a younger sister of ZeroMQ from Martin SUSTRIK, co-father of ZeroMQ -- where a Context()-less design plus additional features alike SURVEY and BUS messaging archetype are of particular attractivity for swarms with your own software-designed problem-domain-specific messaging/signalling protocols:
100k+ with file_descriptors
Well, you need a courage. Doable, but sleeves up, it will require hands on efforts in kernel, tuning in system settings and you will pay for having such scale by increased overheads.
Andrew Hacking has explained both the PROS and CONS of the "just" increasing fd count ( not only on the kernel side of the system tuning and configuration ).
Other factors to consider are that while some software may use sysconf(OPEN_MAX) to dynamically determine the number of files that may be open by a process, a lot of software still uses the C library's default FD_SETSIZE, which is typically 1024 descriptors and as such can never have more than that many files open regardless of any administratively defined higher limit.
Andrew has also directed your kind attention to this, that may serve as an ultimate report on how to setup a system for 100k-200k connections.
Do static scales above 100k per host make any real sense for swarm simulations?
While still "technically" doable, there are further limits -- even nanomsg will not be able to push more than about 1.000.000 [MSGs/s] which is fairly well enough for most applications, that cannot keep the pace with this native speed of message-dispatch. Citations state some ~6 [us] for CPU-core to CPU-core transfer latencies and if the user-designed swarm-processing application cannot make the sending loop under some 3-4 [us] the performance ceiling is not anywhere close to cause an issue.
How to scale above that?
A distributed multi-host processing is the first dimension to attack the static scale of the swarm. Next would be a need to introduce an RDMA-injection so as to escape from the performance bottleneck of any stack-processing in the implementation of the distributed messaging / signalling. Yes, this can move your Swarm system into nanosecond-scale latencies zone, but at the cost of building an HPC / high-tech computing infrastructure ( which would be a great Project, if your Project sponsor can adjust financing of such undertaking -- + pls. pls. do let me know if yes, would be more than keen to join such swarm intelligence HPC-lab ), but worth to know about this implication before deciding on architecture and knowing the ultimate limits is the key to do it well from the very beginning.

Related

Is there a standard for roundtrip message time in the majordomo pattern using ZeroMQ and how is that measured or found?

I am not sure if my ZeroMQ Majordomo implementation is correct. I believe the guide suggests that this pattern can handle tens of thousands of messages per second using a synchronous-roundtrip pattern. My present solution seems to struggles to send more than one thousand messages per second. My goal is to get as close to I can to running at least ten thousand messages per second.
I am running all of the components of the Majrodomo pattern on separate Windows 2012 servers, each with 12 processors and 32 GB of ram, so I am sure there cannot be resource constraints. All of these servers are running within the same network as well, meaning they are not traversing a firewall. My code is runs slower due to the business logic incorporated into it, so for my speed testing I went back to the simple test code the is provided in the ZeroMQ guide. These are supposedly the clients that are used in the guide to show the messages per second. I have tried in a few different languages as well, including Delphi and C#, neither of which seem to be able to reach the promised speeds.
Code can be found for those here:
http://zguide.zeromq.org/page:all
I am wondering if I am expecting too much from the pattern. Roundtrip message times in the pattern seem to sit around 25ms when sending from 100 clients to a single broker, and then to 100 workers and back through the pattern. This seems slow, and sending 10000 messages from these clients takes about 4 seconds, which isn't anywhere near the promise of tens of thousands per second. Am I expecting too much of this pattern, or is there something I'm missing here.
By the way, I've seen posts about HWM (high water mark) being hit, but given that were in a synchronous pattern I don't believe that could be an issue, since we only are able to queue a max number of messages equivalent to our client count.
Q : ...struggles to send more than one thousand messages per second ... goal is to get as close to I can to running at least ten thousand messages per second.
ZeroMQ eco-system evolves beyond the state, when the performance testing was published.
The best first step :
is to explicitly publish all details of the System-under-Test - a.k.a. SuT ( for each of the hosts, its hardware details ( best by hwloc-alike "fingerprint" ), O/S configuration details ( prio, backgound workloads, costs of virtualised machine ( if applicable - i.e. how much vCPU-clocks were actually stolen from "inside" of the VM by the external hypervisor ), interfaces - O/S buffer settings, I/F-free-capacity ( workload background ), ToS / VLAN or other LAN-interconnect performance modifying details, LAN-interconnect L2-switching silicon theoretical performance ceilings and background workloads present prior, during and after the SuT episodes.
Next comes the code :
a full copy of all the code - a.k.a. the MCVE-(inspect-able and repeat-able)-code. Given no code present here so far, no one can judge your actual observations as per the root-cause of the observed ~ 1k [msgs/s]
Next comes a definition of "reference point" to be used in test :
at which defined / given "reference"-point(s) the SuT ought get measured and what are the expected ranges for the expected TaT in [ms].
I work rather long with ZeroMQ DLL-s ( since 2.11 - so it's indeed quite a long time ) and I believe one can easily either overload an AccessNode with an ingress traffic or, as easily as the former case, suffocate the ZeroMQ infrastructure, if improper configuration was let in and a service performance will decrease, if not degrade, if no due measures were taken to configure all the resources that stay on the service's critical path.
Without hard facts ( as declared in the previous few steps ) there remains but guesses or unsupported ( yet ) opinions.

Best practice to control the flow on a ZMQ Streamer device?

Imagine I have a streaming device like the one in the bellow template.
image from here
def worker(backend_port):
context = zmq.Context()
socket = context.socket(zmq.PULL)
socket.connect("tcp://127.0.0.1:%d" % backend_port)
while True:
task = socket.recv()
# do something
def streamer(frontend_port):# it's called streamer because streams files from hdfs
context = zmq.Context()
socket = context.socket(zmq.PUSH)
socket.connect("tcp://127.0.0.1:%d" % frontend_port)
while True:
# get the file
# prepare it
socket.send(msg, copy=False)
number_of_workers = 16
number_streamers = 10
frontend_port = 7559
backend_port = 7560
streamerdevice = ProcessDevice(zmq.STREAMER, zmq.PULL, zmq.PUSH)
streamerdevice.bind_in("tcp://127.0.0.1:%d" % frontend_port )
streamerdevice.bind_out("tcp://127.0.0.1:%d" % backend_port)
streamerdevice.setsockopt_in(zmq.IDENTITY, b'PULL')
streamerdevice.setsockopt_out(zmq.IDENTITY, b'PUSH')
streamerdevice.start()
for i in range(number_of_workers):
p = Process(target=yolo, args=(backend_port,))
p.start()
for i in range(number_streamers):
s = Process(target=streamer, args=(frontend_port,))
s.start()
To give more information messages are images so are on the bigger size. I am using zero-copy. Performance is the most important point in my case since I am running on a very big scale.
Is there a way to multi-thread that streamerdevice?
How can I control the flow? is there a way to know how many messages are waiting to get processed?
one of the main goals is to make the receiving time as fast as possible on the worker side. Any suggestions for that?
Does it help to start the zmq.Context(nIoTHREADs) for nIoTHREADs>1 in worker side?
How to run a multi-threaded ZMQ device?
The simplest way to run multi-threaded ZeroMQ services is to instantiate the respective zmq.Context( nIoTHREADs ) for nIoTHREADs > 1. There are more performance tweaking configuration options for scaling performance. Review the pyzmq wrapper, how does this ZeroMQ tooling apply to actual the pyzmq wrapper code and ask maintainers, whether their setup and/or configuration remained fixed or just unpublished with this regard. ZeroMQ is flexible enough to provide means for doing this.
...device (which is a one way queue) is not scalable as you increase the number of input connections
Well, a device is neither a one way queue, nor is not scalable. Right the very opposite.
Is it possible to run it in a multi-threaded way?
Sure it is.
what is the best way to scale it?
That depends a lot on exact messaging and signaling requirements that your system design ought meet. High volume, ultra-low latency, almost linear scaling, adaptive connection handing, app-domain user-defined protocols are just a few of such properties to take into account.
Is it because the queue gets so big and that is why it gets slow?
The post contained zero pieces of information about neither the sizes, nor about any speed observations and/or expectations. In other, documented, test setups, ZeroMQ is known to be very cute in buffer-management, if zero-copy and other performance motivated approaches are used, so no one can help on a more detailed level.
is there a way to know how big the queue is at any moment?
No, there is no such "built-in" feature, that would spend time and resources on counting this. If such add-on feature is needed in your system, you are free to add such add-on, without making other use-cases wear a single bit of such additional code for use-cases, where a message counting and queue-size reporting is meaningless.
any load balancing strategy for the device without a REQ-REP pattern?
Yes, there are many ready-made Scalable Formal Communication Archetype Patterns plus there are scenarios one may compose for going beyond the ready-made bilateral distributed-behaviour Archetype primitives, that permit to build a custom-specific, controlled as required, load-balancer of one's own wish and taste.

Splitting work between multiple pollers?

My current setup for work on server side is like this -- I have a manager (with poller) which waits for incoming requests for work to do. Once something is received it creates worker (with separate poller, and separate ports/sockets) for the job, and further on worker communicates directly with client.
What I observe that when there is some intense traffic with any of the worker it disables manager somewhat -- ReceiveReady events are fired with significant delays.
NetMQ documentation states "Receiving messages with poller is slower than directly calling Receive method on the socket. When handling thousands of messages a second, or more, poller can be a bottleneck." I am so far below this limit (say 100 messages in a row) but I wonder whether having multiple pollers in single program does not clip performance even further.
I prefer having separate instances because the code is cleaner (separation of concerns), but maybe I am going against the principles of ZeroMQ? The question is -- is using multiple pollers in single program performance wise? Or in reverse -- do multiple pollers starve each other by design?
Professional system analysis may even require you to run multiple Poller() instances:
Design system based on facts and requirements, rather than to listen to some popularised opinions.
Implement performance benchmarks and measure details about actual implementation. Comparing facts against thresholds is a.k.a. a Fact-Based-Decision.
If not hunting for the last few hundreds of [ns], a typical scenario may look this way:
your core logic inside an event-responding loop is to handle several classes of ZeroMQ integrated signallin / messaging inputs/outputs, all in a principally non-blocking mode plus your design has to spend specific amount of relative-attention to each such class.
One may accept some higher inter-process latencies for a remote-keyboard ( running a CLI-interface "across" a network, while your event-loop has to meet a strict requirement not to miss any "fresh" update from a QUOTE-stream. So one has to create a light-weight Real-Time-SCHEDULER logic, that will introduce one high-priority Poller() for non-blocking ( zero-wait ), another one with ~ 5 ms test on reading "slow"-channels and another one with a 15 ms test on reading the main data-flow pipe. If you have profiled your event-handling routines not to last more than 5 ms worst case, you still can handle TAT of 25 ms and your event-loop may handle systems with a requirement to have a stable control-loop cycle of 40 Hz.
Not using a set of several "specialised" pollers will not allow one to get this level of scheduling determinism with an easily expressed core-logic to integrate in such principally stable control-loops.
Q.E.D.
I use similar design so as to drive heterogeneous distributed systems for FOREX trading, based on external AI/ML-predictors, where transaction times are kept under ~ 70 ms ( end-to-end TAT, from a QUOTE arrival to an AI/ML advised XTO order-instruction being submitted ) right due to a need to match the real-time constraints of the control-loop scheduling requirements.
Epilogue:
If the documentation says something about a poller performance, in the ranges above 1 kHz signal delivery, but does not mention anything about a duration of a signal/message handling-process, it does a poor service for the public.
The first step to take is to measure the process latencies, next, analyse the performance envelopes. All ZeroMQ tools are designed to scale, so has the application infrastructure -- so forget about any SLOC-sized examples, the bottleneck is not the poller instance, but a poor application use of the available ZeroMQ components ( given a known performance envelope was taken into account ) -- one can always increase the overall processing capacity available, with ZeroMQ we are in a distributed-systems realm from a Day 0, aren't we?
So in concisely designed + monitored + adaptively scaled systems no choking will appear.

Are there practical limits to the number of cores accessing the same memory?

Will the current trend of adding cores to computers continue? Or is there some theoretical or practical limit to the number of cores that can be served by one set of memory?
Put another way: is the high powered desktop computer of the future apt to have 1024 cores using one set of memory, or is it apt to have 32 sets of memory, each accessed by 32 cores?
Or still another way: I have a multi-threaded program that runs well on a 4-core machine, using a significant amount of the total CPU. As this program grows in size and does more work, can I be reasonably confident more powerful machines will be available to run it? Or should I be thinking seriously about running multiple sessions on multiple machines (or at any rate multiple sets of memory) to get the work done?
In other words, is a purely multithreaded approach to design going to leave me in a dead end? (As using a single threaded approach and depending on continued improvements in CPU speed years back would have done?) The program is unlikely to be run on a machine costing more than, say $3,000. If that machine cannot do the work, the work won't get done. But if that $3,000 machine is actually a network of 32 independent computers (though they may share the same cooling fan) and I've continued my massively multithreaded approach, the machine will be able to do the work, but the program won't, and I'm going to be in an awkward spot.
Distributed processing looks like a bigger pain than multithreading was, but if that might be in my future, I'd like some warning.
Will the current trend of adding cores to computers continue?
Yes, the GHz race is over. It's not practical to ramp the speed any more on the current technology. Physics has gotten in the way. There may be a dramatic shift in the technology of fabricating chips that allows us to get round this, but it's not obviously 'just around the corner'.
If we can't have faster cores, the only way to get more power is to have more cores.
Or is there some theoretical or practical limit to the number of cores that can be served by one set of memory?
Absolutely there's a limit. In a shared memory system the memory is a shared resource and has a limited amount of bandwidth.
Max processes = (Memory Bandwidth) / (Bandwidth required per process)
Now - that 'Bandwidth per process' figure will be reduced by caches, but caches become less efficient if they have to be coherent with one another because everyone is accessing the same area of memory. (You can't cache a memory write if another CPU may need what you've written)
When you start talking about huge systems, shared resources like this become the main problem. It might be memory bandwidth, CPU cycles, hard drive access, network bandwidth. It comes down to how the system as a whole is structured.
You seem to be really asking for a vision of the future so you can prepare. Here's my take.
I think we're going to see a change in the way software developers see parallelism in their programs. At the moment, I would say that a lot of software developers see the only way of using multiple threads is to have lots of them doing the same thing. The trouble is they're all contesting for the same resources. This then means lots of locking needs to be introduced, which causes performance issues, and subtle bugs which are infuriating and time consuming to solve.
This isn't sustainable.
Manufacturing worked out at the beginning of the 20th Century, the fastest way to build lots of cars wasn't to have lots of people working on one car, and then, when that one's done, move them all on to the next car. It was to split the process of building the car down into lots of small jobs, with the output of one job feeding the next. They called it assembly lines. In hardware design it's called pipe-lining, and I'll think we'll see software designs move to it more and more, as it minimizes the problem of shared resources.
Sure - There's still a shared resource on the output of one stage and the input of the next, but this is only between two threads/processes and is much easier to handle. Standard methods can also be adopted on how these interfaces are made, and message queueing libraries seem to be making big strides here.
There's not one solution for all problems though. This type of pipe-line works great for high throughput applications that can absorb some latency. If you can't live with the latency, you have no option but to go the 'many workers on a single task' route. Those are the ones you ideally want to be throwing at SIMD machines/Array processors like GPUs, but it only really excels with a certain type of problem. Those problems are ones where there's lots of data to process in the same way, and there's very little or no dependency between data items.
Having a good grasp of message queuing techniques and similar for pipelined systems, and utilising fine grained parallelism on GPUs through libraries such as OpenCL, will give you insight at both ends of the spectrum.
Update: Multi-threaded code may run on clustered machines, so this issue may not be as critical as I thought.
I was carefully checking out the Java Memory Model in the JLS, chapter 17, and found it does not mirror the typical register-cache-main memory model of most computers. There were opportunities there for a multi-memory machine to cleanly shift data from one memory to another (and from one thread running on one machine to another running on a different one). So I started searching for JVMs that would run across multiple machines. I found several old references--the idea has been out there, but not followed through. However, one company, Terracotta, seems to have something, if I'm reading their PR right.
At any rate, it rather seems that when PC's typically contain several clustered machines, there's likely to be a multi-machine JVM for them.
I could find nothing outside the Java world, but Microsoft's CLR ought to provide the same opportunities. C and C++ and all the other .exe languages might be more difficult. However, Terracotta's websites talk more about linking JVM's rather than one JVM on multiple machines, so their tricks might work for executable langauges also (and maybe the CLR, if needed).

How to do performance and scalability testing without clear requirements?

Any idea how to do performance and scalability testing if no clear performance requirements have been defined?
More information about my application.
The application has 3 components. One component can only run on Linux, the other two components are Java programs so they can run on Linux/Windows/Mac... The 3 components can be deployed to one box or each component can be deployed to one box. Deployment is very flexible. The Linux-only component will capture raw TCP/IP packages over the network, then one Java component will get those raw data from it and assemble them into the data end users will need and output them to hard disk as data files. The last Java component will upload data from data files to my database in batch.
In the absence of 'must be able to perform X iterations within Y seconds...' type requirements, how about these kinds of things:
Does it take twice as long for twice the size of dataset? (yes = good)
Does it take 10x as long for twice the size of dataset? (yes = bad)
Is it CPU bound?
Is it RAM bound (eg lots of swapping to virtual memory)?
Is it IO / Disk bound?
Is there a certain data-set size at which performance suddenly falls off a cliff?
Surprisingly this is how most perf and scalability tests start.
You can clearly do the testing without criteria, you just define the tests and measure the results. I think your question is more in the lines 'how can I establish test passing criteria without performance requirements'. Actually this is not at all uncommon. Many new projects have no clear criteria established. Informally it would be something like 'if it cannot do X per second we failed'. But once you passed X per second (and you better do!) is X the 'pass' criteria? Usually not, what happens is that you establish a new baseline and your performance tests guard against regression: you compare your current numbers with the best you got, and decide if the new build is 'acceptable' as build validation pass (usually orgs will settle here at something like 70-80% as acceptable, open perf bugs, and make sure that by ship time you get back to 90-95% or 100%+. So basically the performance test themselves become their own requirement.
Scalability is a bit more complicated, because there there is no limit. The scope of your test should be to find out where does the product break. Throw enough load at anything and eventually it will break. You need to know where that limit is and, very importantly, find out how does your product break. Does it give a nice error message and revert or does it spills its guts on the floor?
Define your own. Take the initiative and describe the performance goals yourself.
To answer any better, we'd have to know more about your project.
If there has been 'no performance requirement defined', then why are you even testing this?
If there is a performance requirement defined, but it is 'vague', can you indicate in what way it is vague, so that we can better help you?
Short of that, start from the 'vague' requirement, and pick a reasonable target that at least in your opinion meets or exceeds the vague requirement, then go back to the customer and get them to confirm that your clarification meets their requirements and ideally get formal sign-off on that.
Some definitions / assumptions:
Performance = how quickly the application responds to user input, e.g. web page load times
Scalability = how many peak concurrent users the applicaiton can handle.
Firstly perfomance. Performance testing can be quite simple, such as measuring and recording page load times in a development environment and using techniques like applicaiton profiling to identify and fix bottlenecks.
Load. To execute a load test there are four key factors, you will need to get all of these in place to be successfull.
1. Good usage models of how users will use your site and/or application. This can be easy of the application is already in use, but it can be extermely difficult if you are launching a something new, e.g. a Facebook application.
If you can't get targets as requirements, do some research and make some educated assumptions, document and circulate them for feedback.
2. Tools. You need to have performance testing scripts and tools that can excute the scenarios defined in step 1, with the number of expected users in step 1. (This can be quite expensive)
3. Environment. You will need a production like environment that is isolated so your tests can produce repoducible results. (This can also be very expensive.)
4. Technical experts. Once the applicaiton and environment starts breaking you will need to be able to identify the faults and re-configure the environment and or re-code the application once faults are found.
Generally most projects have a "performance testing" box that they need to tick because of some past failure, however they never plan or budget to do it properley. I normally recommend to do budget for and do scalability testing properley or save your money and don't do it at all. Trying to half do it on the cheap is a waste of time.
However any good developer should be able to do performance testing on their local machine and get some good benefits.
rely on tools (fxcop comes to mind)
rely on common sense
If you want to test performance and scalability with no requirements then you should create your own requirements / specs that can be done in the timeline / deadline given to you. After defining the said requirements, you should then tell your supervisor about it if he/she agrees.
To test scalability (assuming you're testing a program/website):
Create lots of users and data and check if your system and database can handle it. MyISAM table type in MySQL can get the job done.
To test performance:
Optimize codes, check it in a slow internet connection, etc.
Short answer: Don't do it!
In order to get a (better) definition write a performance test concept you can discuss with the experts that should define the requirements.
Make assumptions for everything you don't know and document these assumptions explicitly. Assumptions comprise everything that may be relevant to your system's behaviour under load. Correct assumptions will be approved by the experts, incorrect ones will provoke reactions.
For all of those who have read Tom DeMarcos latest book (Adrenaline Junkies ...): This is the strawman pattern. Most people who are not willing to write some specification from scratch will not hesitate to give feedback to your document. Because you need to guess several times when writing your version you need to prepare for being laughed at when being reviewed. But at least you will have better information.
The way I usually approach problems like this is just to get a real or simulated realistic workload and make the program go as fast as possible, within reason. Then if it can't handle the load I need to think about faster hardware, doing parts of the job in parallel, etc.
The performance tuning is in two parts.
Part 1 is the synchronous part, where I tune each "thread", under realistic workload, until it really has little room for improvement.
Part 2 is the asynchronous part, and it is hard work, but needs to be done. For each "thread" I extract a time-stamped log file of when each message sent, each message received, and when each received message is acted upon. I merge these logs into a common timeline of events. Then I go through all of it, or randomly selected parts, and trace the flow of messages between processes. I want to identify, for each message-sequence, what its purpose is (i.e. is it truly necessary), and are there delays between the time of receipt and time of processing, and if so, why.
I've found in this way I can "cut out the fat", and asynchronous processes can run very quickly.
Then if they don't meet requirements, whatever they are, it's not like the software can do any better. It will either take hardware or a fundamental redesign.
Although no clear performance and scalability goals are defined, we can use the high level description of the three components you mention to drive general performance/scalability goals.
Component 1: It seems like a network I/O bound component, so you can use any available network load simulators to generate various work load to saturate the link. Scalability can be measure by varying the workload (10MB, 100MB, 1000MB link ), and measuring the response time , or in a more precise way, the delay associated with receiving the raw data. You can also measure the working set of the links box to drive a realistic idea about your sever requirement ( how much extra memory needed to receive X more workload of packets, ..etc )
Component 2: This component has 2 parts, an I/O bound part ( receiving data from Component 1 ), and a CPU bound part ( assembling the packets ), you can look at the problem as a whole, make sure to saturate your link when you want to measure the CPU bound part, if is is a multi threaded component, you can look for ways to improve look if you don't get 100% CPU utilization, and you can measure time required to assembly X messages, from this you can calculate average wait time to process a message, this can be used later to drive the general performance characteristic of your system and provide and SLA for your users ( you are going to guarantee a response time within X millisecond for example ).
Component 3: Completely I/O bound, and depends on both your hard disk bandwidth, and the back-end database server you use, however you can measure how much do you saturate disk I/O to optimize throughput, how much I/O counts do you require to read X MB of data, and improve around these parameters.
Hope that helps.
Thanks

Resources