Turbo-charge Ansible playbooks - performance

I am using Ansible to run puppet and other scripts on large number of VM (5000).
I tried using these options :
Strategy : free
async: 1000
poll: 0
serial : 20 or 50 or 100
forks : 20
I tried many combinations, but playbooks keep failing for some VMs , only correct option is when I used forks : 10 and serial : 2
Any idea why ? I need to reduce the time execution.
Thanks

Mitogen could help: https://mitogen.readthedocs.io/en/latest/ansible.html
Expect a 1.25x - 7x speedup and a CPU usage reduction of at least 2x, depending on network conditions, modules executed, and time already spent by targets on useful work. Mitogen cannot improve a module once it is executing, it can only ensure the module executes as quickly as possible.
One connection is used per target, in addition to one sudo invocation per user account. This is much better than SSH multiplexing combined with pipelining, as significant state can be maintained in RAM between steps, and system logs aren’t spammed with repeat authentication events.
A single network roundtrip is used to execute a step whose code already exists in RAM on the target. Eliminating multiplexed SSH channel creation saves 4 ms runtime per 1 ms of network latency for every playbook step.
Processes are aggressively reused, avoiding the cost of invoking Python and recompiling imports, saving 300-800 ms for every playbook step.
Code is ephemerally cached in RAM, reducing bandwidth usage by an order of magnitude compared to SSH pipelining, with around 5x fewer frames traversing the network in a typical run.
Fewer writes to the target filesystem occur. In typical configurations, Ansible repeatedly rewrites and extracts ZIP files to multiple temporary directories on the target. Security issues relating to temporary files in cross-account scenarios are entirely avoided.
The effect is most potent on playbooks that execute many short-lived actions, where Ansible’s overhead dominates the cost of the operation, for example when executing large with_items loops to run simple commands or write files.

Related

Records are inserting less in the database when we increase the thread group count from 100 to 200 in Jmeter

Initially i have ran a load test with 100 users for 10 minutes and 1000 records got inserted in the database for the below scenarios.
Employee Creation -- Test script design took 1 minute
Employee Update -- Test script design took 2 minutes
And then I ran the same load test with 200 users for 10 minutes and 1100 records got inserted without any error logs or deadlocks.
My question is when we increase/double the thread group count from 100 to 200, Records insertion also should be double or approximately double. then why is it not happening? Same case with the number requests/samples.
You reached a maximum in your test throughput at about 110 records per min. In other words, you have a bottleneck on client or server, which doesn't allow 200 users to process request concurrently and/or within the same amount of time (either some users wait until they can start processing a request, or each request takes longer, so total number of requests is lower).
Some bottlenecks can be resolved by you (if they are related to script, JMeter configuration or JMeter machine), others have to be resolved on server side (by whoever has access to it), and some cannot be resolved at all (they are true bottlenecks of your app).
Without knowing your application, it's hard to suggest anything beyond general "checklist" items:
Verify JMeter script and check if it has any places where it may wait, take a long time, and so on. For example if your ramp-up period is too high, it may be that "first" user will finish execution, before "last" user even started it. Scriptable samplers, pre- and post-processors may cause delays as well.
Make sure JMeter is configured properly to handle 200 concurrent threads. For example if JMeter heap is set too low, it could be that JMeter is very slow, as it constantly needs to run GC. See this question for how to look at and configure memory (it discusses out of memory error, but even without that error inadequate memory can cause slowness)
Make sure JMeter machine is configured correctly to allow creation of 200+ HTTP connections concurrently. A common issue on both Windows and Linux machine is that people assume that they can have 65535 connections (as maximal number of ports), but in reality, both Windows and Linux limit number of ports they allow by default to be used. Also after the use port may remain in TIME_WAIT or CLOSE_WAIT state for several minutes, which makes it unusable. As a result, running out of ports is quite common. Here's how to monitor and resolve this issue on Windows and Linux.
Check JMeter machine performance as a whole: does it have enough CPU, memory; is it swapping memory, etc.
If none of the above is a problem, you need to look at how requests arrive to the server. If client is capable of sending 200 concurrent requests (which you should have established in previous steps), but server receives them at slower rate, then maybe something in the network slows things down. For example something like slow DNS resolution or slow routing between JMeter and server can cause issues.
Also Item #3 on the client is also applicable to the server.
If requests do arrive to the server at the same speed as they are sent from the client, then probably their processing by the server slows down as number of parallel requests goes up. This is where you are on dev and devOP territory, and probably need to work with them to identify bottlenecks on server side. It could be configuration of the web or application server, application itself, ... anything on app way pretty much.
Performance testing is 10% execution, and 90% analysis and identification of bottlenecks, so here you go.

With JMETER on one machine, how many concurrent users can it handle? [duplicate]

This question already has answers here:
How do threads and number of iterations impact test and what is JMeter’s max. thread limit
(6 answers)
Closed 5 years ago.
Can anybody explain about how many concurrent users one Jmeter will handle?
I want to run 2000 concurrent users for my project.
No one can "explain" this to you, you can only measure it.
The number of virtual users which can be simulated by JMeter depends on several factors:
machine hardware specifications (CPU, RAM, NIC, etc)
software specifications and versions (OS, JVM and JMeter version and architecture)
the nature of your test (number of requests, size of request/response, number of pre/post processors, assertions, etc)
So your actions should look like:
Make sure you're following JMeter Best Practices
Set up monitoring of baseline OS health metrics (CPU, RAM, disk and network usage). This can be done using i.e. JMeter PerfMon Plugin
Start with 1 virtual user and gradually increase the load until resource consumption won't exceed some reasonable threshold (i.e. 90% of maximum capacity)
Once you start running out of resources - mention the number of virtual users which were active at this moment - this will be the maximum you can simulate on particular this machine for particular this test. This can be done using i.e. Active Threads Over Time listener
If the number is 2000 or more - you're good to go, if it's less - you will have to go for Distributed Testing
See What’s the Max Number of Users You Can Test on JMeter? article for more detailed explanation of the above points and few more hints.
Well, it is the same as with any other software: One CPU-Core can handle exactly one operation (-step) at the same time. What JMeter does is ramping up x threads and then starts them. No "magic" there. So in order to give you good coverage in respect to collision you will want to dedicate a machine with a decent number of CPU cores (Server, not your local machine. Your Machine will occasionally be distracted by other tasks) and make sure your processes takes a decent amount of runtime themselves. Additionally, run the same Test several times to fade out warm up times. How many concurrent users you will get (to answer the question) depends on your environment and "it all depends". Basically, it is not limited by JMeter but by the System you execute it on. You will need to "try it out" and "finetune" your test.

How to speed up nagios to monitor hosts over the cloud

while using nagios with multiple hosts spread over the network,hosts status shows a recognizable lag and takes a long time to reflect on nagios server cgi.Thus what is the optimal nrpe/nagios configration to speed up the status process for a distributed host environment.
In my case I use nagios core 4.1
nrpe 1.5
server/clients: Amazon ec2
The GUI is usually only updated once each minute (automatically), though clicking refresh can provide you with 'nearly' the latest information. I say nearly because there is a distinct processing loop inside of the Nagios core that causes it to never be real time. NRPE is going to run at the speed of your network connection - it does little else besides sending and receiving tiny amounts of data. About the only delay here is the time it takes to actually perform the check and send back the response - which, of course, has way to many factors to mention. Try looking at the output of
[nagioshome]/bin/nagiostats
There are several entries that tell you:
'Latency' - the time between when the check was scheduled to start, and the actual start time.
'Execution Time' - the amount of time checks are actually taking to run.
These entries will have three numbers, which are; Min / Max / Avg
High latency numbers (in my book that means Avg is greater than 1 second) usually means your Nagios server is over worked. There are a few things you can do to improve latency times, and these are outlined in the 'nagios.cfg' file. This latency has nothing to do with network speed or the speed of NRPE - it is primarily hardware speed. If you're already using the optimal values specified in nagios.cfg, then its time to find some faster hardware.
High execution times (for me an Avg greater than 5 seconds) can be blamed on just about everything except your Nagios system. This can be caused by faulty networks (improper packet routing), over loaded networks, faulty and/or poorly designed checks, slow target systems, ... the list is endless. Nothing you do with the Nagios and/or NRPE configs will help lower these values. Well, you could disable NRPE's encryption to improve wire time; but if you have encryption enabled in the first place, then its not likely you'd want it disabled.

Individual Spark Task consume more time on computation if more cores are assigned

I am running a spark job with input file of size 6.6G (hdfs) with master as local. My Spark Job with 53 partitions completed quickly when I assign local[6] than local[2], however the individual task takes more computation time when number of cores are more. Say if I assign 1 core(local[1]) then each task takes 3 secs where the same goes up to 12 seconds if I assign 6 cores (local[6]). Where the time gets wasted? The spark UI shows increase in computation time for each task in local[6] case, I couldn't understand the reason why the same code takes different computation time when more cores are assigned.
Update:
I could see more %iowait in iostat output if I use local[6] than local[1]. Please let me know this is the only reason or any possible reasons. I wonder why this iowait is not reported in sparkUI. I see the increase in computing time than iowait time.
I am assuming you are referring to spark.task.cpus and not spark.cores.max
With spark.tasks.cpus each task get assigned more cores, but it doesn't necessarily have to use them. If you process is single threaded it really can't use them. You wind up with additional overhead without additional benefit and those cores are taken away from other single threaded tasks that can use them.
With spark.cores.max it is simply and overhead issue with transferring data around at the same time.

Does PPL take the load of the system into account when creating threads or not?

I am starting to use PPL to create tasks and dispatch them [possibly] to other threads, like this:
Concurrency::task_group tasks;
auto simpleTask = Concurrency::make_task(&simpleFunction);
tasks.run(simpleTask);
I experimented with a small application that creates a task every second. Task performs heavy calculations during 5 seconds and then stops.
I wanted to know how many threads PPL creates on my machine and whether the load of the machine influences the number of threads or the tasks assigned to the threads. When I run one or more instances of my application on my 12-core machine, I notice this:
When running 1 application, it creates 6 threads. Total CPU usage is 50%.
When running 2 applications, both of them create 6 threads. Total CPU usage is 100% but my machine stays rather responsive.
When running 3 applications, all of them create 6 threads (already 18 threads in total). Total CPU usage is 100%.
When running 4 applications, I already have 24 threads in total.
I investigated the running applications with Process Explorer and with 4 applications I can clearly see that they all have 6 (sometimes even 12) threads that are all trying to consume as much CPU as possible.
PPL allows you to limit the number of threads by configuring the default scheduler, like this:
Concurrency::SchedulerPolicy policy(1, Concurrency::MaxConcurrency,2);
Concurrency::Scheduler::SetDefaultSchedulerPolicy(policy);
With this you statically limit the number of threads (2 in this case). It can be handy if you know beforehand that on a server with 24 cores there are 10 simultaneous users (so you can limit every application to 2 threads), but if one of the 10 users is working late, he still only uses 2 threads, while the rest of the machine is idling.
My question: is there a way to configure PPL so that it dynamically decides how many threads to create (or keep alive or keep active) based on the load of the machine? Or does PPL already does this by default and my observations are incorrect.
EDIT: I tried starting more instances of my test application, and although my machine remains quite responsive (I was wrong in the original question) I can't see the applications reducing their number of simultaneous actions.
The short answer to your question is "No." The default PPL scheduler and resource manager will only use process-local information to decide when to create/destroy threads. As stated in the Patterns and Practices article on MSDN:
The resource manager is a singleton that works across one process. It
does not coordinate processor resources across multiple
operating-system processes. If your application uses multiple,
concurrent processes, you may need to reduce the level of concurrency
in each process for optimum efficiency.
If you're willing to accept the complexity, you may be able to implement a custom scheduler/resource manager to take simple system-level performance readings (e.g. using the PDH functions) to achieve what you want.

Resources