How to profile set of process in freeBSD? - performance

I am trying to debug a service with respect to its performance. The service I am trying to debug, internally spawns instances of the same binary. To improve the through-put, I am planning to increase number of instances of the binary. After a point in number of processes of the binary, through-put is not increasing. Now I am trying to reason-out why this is happening.
I need some help on where to start, tools available for process level profiling. I am using freeBSD platform.

If using more processes doesn't improve output, then your service isn't CPU bound. It might be constrained by e.g. disk or network throughput instead.
Start with systat. Especially systat -vmstat. See man systat.
This will show you several aspects (like memory usage, interrupts, processot usage and disk activity) of how busy your system is.
If your program does a lot of network activity, using systat -tcp might give insight as well.
If your service is a HTTP server, you might want to look at varnish.

Related

One large machine or cluster of small machines for Jmeter load generator in load testing?

I want to simulate up to 100,000 requests per second and I know that tools like Jmeter and Locust can run in distributed mode to generate load.
But since there are cloud VMs with up to 64 vCPUs and 240GB of RAM on a single VM, is it necessary to run in a cluster of smaller machines, or can I just use 1 large VM?
Will I be able to achieve more "concurrency" with more machines due to a network bottleneck coming from the 1 large machine?
If I just use one big machine, would I be limited by the number of ports there are?
In the load generator, does every simulated "user" that sends a request also require a port on the machine to receive a 200 response? (Sorry, my understanding of how TCP ports work is a bit weak.)
Also, we use Kubernetes pretty heavily, but with Jmeter or Locust, I feel like it'd be easier to run it on bare VM, without containerizing (even in distributed mode) while still maintaining reproducibility. Should I be trying to containerize Jmeter or Locust and running in Kubernetes instead?
According to KISS principle it is better to go for a single machine assuming it is capable of conducting the required load.
Make sure you're following JMeter Best Practices
Make sure you have monitoring of baseline OS health metrics (CPU, RAM, swap, network and disk IO, JVM statistics, etc.)
Start with low number of users and gradually increase the load until you reach the desired throughput or limit of any of the monitored metrics, whatever comes the first. If there will be a lack of CPU or RAM or something - see what could be done to overcome the limitation.
More information: What’s the Max Number of Users You Can Test on JMeter?

How do I find a marathon runaway process

I have a mesos / marathon system, and it is working well for the most part. There are upwards of 20 processes running, most of them using only part of a CPU. However, sometimes (especially during development), a process will spin up and start using as much CPU as is available. I can see on my system monitor that there is a pegged CPU, but I can't tell what marathon process is causing it.
Is there a monitor app showing CPU usage for marathon jobs? Something that shows it over time. This would also help with understanding scaling and CPU requirements. Tracking memory usage would be good, but secondary to CPU.
It seems that you haven't configured any isolation mechanism on your agent (slave) nodes. mesos-slave comes with an --isolation flag that defaults to posix/cpu,posix/mem. Which means isolation at process level (pretty much no isolation at all). Using cgroups/cpu,cgroups/mem isolation will ensure that given task will be killed by kernel if exceeds given memory limit. Memory is a hard constraint that can be easily enforced.
Restricting CPU is more complicated. If you have machine that offers 8 CPU cores to Mesos and each of your tasks is set to require cpu=2.0, you'll be able run there at most 4 tasks. That's easy, but at given moment any of your 4 tasks might be able to utilize all idle cores. In case some of your jobs is misbehaving, it might affect other jobs running on the same machine. For restricting CPU utilization see Completely Fair Scheduler (or related question How to understand CPU allocation in Mesos? for more details).
Regarding monitoring there are many possibilities available, choose an option that suits your requirements. You can combine many of the solutions, some are open-source other enterprise level solutions (in random order):
collectd for gathering stats, Graphite for storing, Grafana for visualization
Telegraf for gathering stats, InfluxDB for storing, Grafana for visualization
Prometheus for storing and gathering data, Grafana for visualization
Datadog for a cloud based monitoring solution
Sysdig platform for monitoring and deep insights

monitoring application (CPU and cache usage) on single Linux box with 80 cores

I am looking for a performance monitoring tool for my application which will collect/visualize in realtime the CPU and cache usage on single Linux box like IBM System or HP ProLiant with typical configuration 8 processors / 80 cores.
Application is the home-grown multithreaded C+ code which uses OpenMP.
This monitoring tool should not run 24 hours per day; it should not do e-mail notification.
I will run this tool just before sending commands to my apps, the apps will execute the command (it may take as a maximum few minutes only). During this time interval I need to analyze:
- usage of cores
- data movement between processors
- usage of L1, L2, L3 caches
- some other metrics (help me here) which can help to find bottleneck in application
performance and resource utilization
I guess that tools like Nagios / Zabbix are too heavy for this task.
From another side using the command-line tools like "top" and "sar" for 80 cores not very convenient and plotting (not necessary real-time) would be nice to have...
While getting the per core usage is rather easy - the other values might prove to be not practical, not at least without running that application within a profiler of some sorts.
Measuring QPI utilization is something highly non-trivial if at all possible. Intel's vTune might be able to acquire such things but only when running instrumented version of your binaries.
Also on x86 there is no way to figure out L1,L2,L3 usage of any kind - you can grab the low level CPU counters to measure cache misses though (but would probably need to use instrumented/profiled binaries and always withan something like vTune or PAPI).
You could "easily" setup something to pull all the lower level performance counters into SNMP and grab the SNMP values via standard SNMP capable monitoring tools but be aware that SNMP pulling is something that you don't want to occur more than 1-2/s max. Or pull that info into something like collectd.
I'm also having the impression that you don't understand the problem domain of monitoring tools. They are not ment to be used as low level analysis probes for finding application level/system bottlenecks - at best you could get some hints which resource (from a 10K feet view) is running under full utilization. Monitoring and alterting tools are something that operations staff needs to use to understand which part of their IT system is currently used and how, to gather historical data and predict future resource utilization and to be alerted when something breaks.
SiteScope, Hyperic or any combination of shell scripts, native OS utilities and a DB to store the results may do the job.

Diagnosing pathological behavior of a piece of cluster software

I'm using a kind of load balancer over a small cluster that is able to achieve >2000rps on zero-duration requests (t.i. ones that are immediately satisfied by the worker nodes).
But as soon as the requests stop being zero-duration and start taking even 1ms, performance immediately drops >10x. The data being transfered in both directions is identical and is about 2kb in size.
This is for sure not related to saturation of the cluster or network throughput, because 200rps of 1ms requests is a very tiny load and the network is 10Gbit. Besides, the CPU load is just some 2-5% both on the load balancer and on the worker nodes.
I wonder whether that might be related to some pathological behavior of the OS scheduler, or the OS network stack (t.i. there is some special case behavior for very short interactions).
How might I diagnose the reason? Which perfcounters to watch? What tools or methodologies to use?
(Just in case someone simply knows the answer to my particular problem, I'm talking about the MS HPC Server 2008 R2's "WCF Broker", running on Windows Server 2008 R2 over Hyper-V)
One thing you can do is use ETW tracing to try and understand what the nodes are doing while your WCF job is running. On HPC server, I sometimes clusrun xperf to collect traces on all or specific nodes. There are a number of tools that you can use for analyzing ETW traces, including xperf itself. I haven't done any serious work using HPC SOA (WCF), but I did write a simple WCF raytracer app and then used xperf to profile it on several of the nodes.
Turned out it was a completely network-unrelated issue having to do with peculiarities of the scheduling mechanism of HPC Server. I resolved the issue by tweaking a configuration option "serviceRequestPrefetchCount" to 0 in the loadBalancing section of the WCF service config file.
I'm assuming that there are some shared resources with some kind of locking system in place? Is locking a bottleneck? It's hard to guess without seeing the system.
Do you have a way to profile the workers? What are they spending most of their time on, especially in the fast vs slow scenarios?

CPU Utiliztion per process in Win32 API

I am doing a project on a centralized LAN management system. I need to know how many CPU cycles is each process of a remote PC consuming(as in a Task Manager )so that the network admin can close few processes,in case the CPU utilization of a system in network goes beyond acceptable rates..
I would like to know if there is a Win32 API for this requirement of mine n if so ,i request you to give me information about it..
thank you in advance..
Win32 API has lots of functions to find all kinds of information about currently running processes and threads, here's a link to the full list of them: http://msdn.microsoft.com/en-us/library/ms683223(VS.85).aspx
Explore the list and you should be able to find the function(s) there that meet your requirements, for example GetProcessTimes() returns structures that contain the amounts of time the process has executed in kernel mode, in user mode, etc.
You need to look at the performance monitor system. You can get the stats from there (in the Process counter).
Here's a (delphi) explanation of it, that's pretty good and simple to understand.
When you understand how it all works, you then need the Performance Counters API to read the data counters.

Resources