Chaos engineering practices are becoming very widely used. One common example is Netflix' own Chaos Monkey. However, Chaos Monkey is often run ad-hoc against random targets. I'm curious how chaos experiments might work in a typical CI/CD pipeline to enhance a specific service's resiliency.
Since chaos experiments (usually) require a fully functional environment, when would they run? Would it run parallel to testing, or downstream?
Would you run a chaos experiment with every commit, or just some?
How long would allow the chaos experiments to run? A 60 minute CPU spike might interfere with a "fail fast" approach, for example.
Would a chaos experiment ever fail the pipeline? What would constitute a 'failure'?
We are just getting started with our chaos engineering efforts, but I'll offer some thoughts regarding your questions.
There are at least three distinct classes of experiment:
Instance/container kills that we expect the underlying infrastructure to handle automatically.
Higher-level but fairly localized failures like slow or unavailable dependencies.
Large-scale failures like data center or region down.
For a build pipeline the sweet spot would be in the middle there (i.e. higher-level but localized failures), because usually the software itself plays a role in responding to the failure. For example the software might include a circuit breaker that trips, throttling, automated failover, etc. If those are software functions, then they can either work or not work, and the build should uncover that.
To the extent that resiliency to failure is a system requirement, then yeah, a failed experiment would fail the pipeline. Suppose for instance that build 392 has a correctly working circuit breaker, and that build 393 doesn't. That would be a failure since the build goes from meeting the requirement to not.
We usually have some chaos experiments, like large-scale failures outside the pipeline.
During the build pipeline, we usually combine chaos experiments with a short performance test to simulate activity and then kill some instances/container to check the resilience of the system. And fail if the system is not able to recover.
Related
I'm kind of new to the pprof tool, and am wondering if its ok to keep running this in production. From the articles I have seen, it seems to be ok and standard, however I'm confused as to how this does not affect performance since it does a sampling N times every second and how come this does not lead to a degradation in performance.
Jaana Dogan does say in her article "Continuous Profiling of Go programs"
Profiling in production
pprof is safe to use in production.
We target an additional 5% overhead for CPU and heap allocation profiling.
The collection is happening for 10 seconds for every minute from a single instance. If you have multiple replicas of a Kubernetes pod, we make sure we do amortized collection.
For example, if you have 10 replicas of a pod, the overhead will be 0.5%. This makes it possible for users to keep the profiling always on.
We currently support CPU, heap, mutex and thread profiles for Go programs.
Why?
Before explaining how you can use the profiler in production, it would be helpful to explain why you would ever want to profile in production. Some very common cases are:
Debug performance problems only visible in production.
Understand the CPU usage to reduce billing.
Understand where the contention cumulates and optimize.
Understand the impact of new releases, e.g. seeing the difference between canary and production.
Enrich your distributed traces by correlating them with profiling samples to understand the root cause of latency.
So if you are using pprof for the right reason, yes, you can leave it in production.
But for basic monitoring, as commented, the system is enough.
As noted in "Continuous Profiling and Go" by Vladimir Varankin
Depending on the state of the infrastructure in the company, an “unexpected” HTTP server inside the application’s process can raise questions from your systems operations department ;)
At the same time, depending on the peculiar nature of a company, the very ability to access something inside a production application, that doesn’t directly relate to application’s business logic, can raise questions from the security department ;)) I
So the overhead is not the only criteria to consider when leaving active such a feature.
Mirth Connect is a software that is designed to handle a message flow and it has built-in support to handle HL7 messages in particular and therefore this software is widely used for interfacing in Healthcare applications. Over the years I have seen the Mirth software experiencing performance issues primarily due to the message build up over time and in scenarios where it receives a heavy message load in quick succession.
Mirth has a channel-based architecture and it's ideal if there is some way we can performance test the Mirth channel and get JMeter statistics for its performance. Whereby we can gather the necessary information to optimize the channel transformers and also to set the purge routines accordingly.
However in the Internet there was little to no information on this area, that is how one can use JMeter to test a Mirth channel. A team in Sri Lanka did some research on this area back in 2013 and I found their findings and achievements below
http://pragmatictestlabs.com/2016/10/09/performance-testing-healthcare-application-hl7-jmeter/
However this is very specific the output here was a JSon object which they extracted, in Mirth however we can have outputs in various forms and there need to be a better way to do this. An important takeaway from this is the input that is the input is general we can use JMeter to generate HL7 messages and pass them to Mirth that's great but how to capture the response generally, it would be ideal if there is a way to read the Mirth Dashboard through JMeter, all the output statistics are there it's just a matter of reading them.
I have an application where Mirth reads HL7 messages both ADT and RDE and creates a text file accordingly with appropriate content and drops it to a shared location. Then the application reads the files and shows the information to the user.
I wish to do two performance tests here
Measure how much time the complete system takes and how it varies with load from the arrival of a message to its information being available to the user
Measure how much time the channel takes and how it does it as the load increases
I can do the first one because I can generate HL7 messages using JMeter and I can get JMeter to read the output in the application or the database. The problem is with the second, can I do this in a general way.
You asked for suggestions, so I'm going to share my general strategy for performance testing Mirth channels. I suspect that this won't be a complete answer to your question, and I might not be telling you anything you don't already know, but I'm hoping this will help you find an answer that you are comfortable with.
For several reasons, try not to spend too much time "testing the complete system":
Firstly, testing the entire system necessarily includes testing low-level configuration like the number of CPU cores, the NICs being used in the box, and kernel level software like the TCP/IP stack. You don't usually have any control over these things, so you can't optimize them in any way.
Secondly, the performance of the entire system is going to be heavily dependant on whatever ancillary code is running on the box. If a sysadmin decides to 'nice' my Mirth process down, or to use that box to also host a SQL server, that will have an impact on the system that I (again) have no control over.
Thirdly and most frankly, I find that the "performance of an entire system" is something that management asks about during system setup so they can get a cost estimate; but they know that they're only getting an estimate. You do your best to use test metrics to give a good guess for the initial hardware provisioning, but everyone knows that it's really the production performance metrics that will drive later provisioning costs.
Make sure that you build your channels for testability. I find that it's much easier to test a channel when the source and destination can be changed to "Channel Reader" and "Channel Writer" without changing message handling. One way to look at this is that you're not going to overhaul Mirth's MLLP stack or Java's TCP stack, so just eliminate these things from your testing.
I keep a source of useful test messages. I have a couple of files on a network drive that have around a hundred messages that test for nasty edge cases that I've run into over the years on my HL7 interfaces. I wrote a small Mirth channel that reads these in from a file and spews out copies as fast as it can. By turning on "Queueing" on the destination side of that channel, I can queue up a bajillion test messages that are ready to send to the channel I want to test. In the past I took the time to build a test interface that acted like a fake EMR to spew out randomly constructed messages, but there didn't seem to be any advantage over just spewing copies of the same messages from my test files.
Finally, and most importantly, it's critical that you measure the performance of your test instance using the same metrics that you'll use to measure the performance of your production instance. If the sole production metric you care about is 'messages per second', then that's what you need to measure on your test box. If memory footprint is a concern in production, then you need to measure memory usage in your test environment as well. When you make a change to to your test instance that decreases an important metric by 10%, you'll need to make sure your management is aware before you push that change to production.
Note that getting some of these metrics can be tricky, since Mirth doesn't include good tools to monitor its own performance. The Mirth dashboard is a good place to keep an eye on errors or crashes, but it's not a great place to find performance data. During my testing I make sure that I use whatever resource monitoring tool that the sysadmins will be using to monitor the performance of the production instance. Beyond that, I use a manual process to test performance: If I want to count message per second, I send through a batch of messages and look at the timestamps of the first and last messages. If I want to get an idea of the CPU load of a Mirth channel, I use the Windows Performance Monitor or the posix 'top' command.
I'm using JMeter on development environment and I think of executing sanity tests on production servers.
Sanity of web sites login and other actions.
Is it reasonable to use JMeter on production servers? How to limit JMeter so it won't impact real users? I found only tutorial which doesn't advice it.
Do not run these tests against your production servers unless you know they can handle the load, or you may negatively impact your server's performance.
From JMeter's point of view it doesn't really matter where you run your tests. Running load tests against production environment is very useful as this way you can discover "real" limitations, bottlenecks, integration and interoperabitity problems opposite to load testing in scaled down environments where you can only guess or calculate the anticipated production metrics.
Ideally you should have some form of "staging environment" which is an exact replica of production environment in terms of hardware, software and data.
If you cannot afford having "staging" environment to play with you can run your tests on production, however you need to keep in mind several important constraints to avoid "surprises"
Run your tests in "dead" time when your application real life usage is minimal, i.e. over night or during weekends.
Make sure JMeter test leaves the system at the same state as it was before test, i.e. if you create users, content, data, etc. - make sure you clean it up after the test so your system is not filled with "junk" data used for load testing. So consider using setUp Thread Group for setting up all the necessary test data and tearDown Thread Group to clean up after yourself
Make sure you monitor your servers health so you will be notified when (if) your system is overloaded. You can use JMeter PerfMon Plugin for this.
It would be also good to have AutoStop Listener enabled so JMeter test would stop automatically
Consider adding SMTP Sampler to your test plan so you would be informed in case of unexpected errors.
As a engineering manager I would say: not in my life time ;-)
So what do you want to hear: that it is not a problem?
Only you can tell whether it would be an issue if something behaves different from what you expect.
My advise would be the same as what you are quoting: don't do it. Unless you know what you are doing, and even then...
I'm currently in an environment where we are parsing data off of the client's website. I want to use my tests to ensure that when the client changes their site, I know when we are no longer receiving the information.
My first approach was to do pure integration tests where my tests hit the client's site and assert that the data was found. However half way through and 500 tests in, the test run has become unbearable and in some cases started timing out. So I cleared out as many tests that I could without loosing the core protection they are providing and I'm down to 350 or so. I'm left with a fear to add more tests to only break all the tests. I also find myself not running the 5+ minute duration (some clients will be longer as this is based on speed of communication with their site) when I make changes anymore. I consider this a complete failure.
I've been putting a lot of thought into this and asking around the office, my thoughts for my next attempt at this is to pull down the client's pages and write tests against these embedded resources in my projects. This will give me my higher test coverage and allow me to go back to testing in isolation. However I would need to be notified when they make changes and then re-pull down the pages to test against. I don't think the clients will adhere to this.
A suggestion was made to me to augment this with a suite of 'random' integration tests that serve the same function as my failed tests (hit the clients site) but in a lot less number than before. I really don't like the idea of random testing, where the possibility of sometimes getting red lights and some times getting green lights with the same code. But this so far sounds like the best idea I've heard to still gain the awareness of when the client's site has changed and my code no longer finds the data.
Has anyone found themselves testing an environment like this? Any suggestions from the testing community for me?
When you say the big test has become unbearable, it suggests that you are running this test suite manually. You shouldn't have to. It should just be running constantly in the background, at whatever speed it takes to complete the suite - and then start over again (perhaps after a delay if there are associated costs). Only when something goes wrong should you get an alert.
If there is something about your tests that causes them to get slower as their number grows - find it and fix it. Tests should be independent of one another, so simply having more of them shouldn't cause individual tests to time out.
My recommendation would be to try to isolate as much as possible the part of code that deals with the uncertainty. This part should be an API that works as a service used by all the other code. This way you would be protecting most of your code against changes.
The stable parts of the code should be unit-tested. With that part being independent from the connection to client's site running the tests should be way quicker and it would also make those tests more reliable.
The part that has to deal with the changes on the client's websites can be reduced. This way you are not solving the problem but at least you're minimising it and centralising it in only one module of your code.
Suggesting to the clients to expose the data as a web service would be the best for you. But I guess that doesn't depend on you :P.
You should look at dividing your tests up, maybe into separate assemblies that can be run independently. I typically have a unit tests assembly and a slower running integration tests assembly.
My unit tests assembly is very fast (because the code is tested in isolation using mocks) and gets run very frequently as I develop. The integration tests are slower and I only run them when I finish a feature / check in or if I have a bad feeling about breaking something.
Maybe you could do something similar or even take the idea further and have 3 test suites with the third containing even slower client UI polling tests.
If you don't have a continuous integration server / process you should look at setting one up. This would continuously build you software and execute the tests. This could be set up to monitor check-ins and work in the background, sending out a notification if anything fails. With this in place you wouldn't care how long your client UI polling tests take because you wouldn't ever have to run them yourself.
Definitely split the tests out - separate unit tests from integration tests as a minimum.
As Martyn said, get a Continuous Integration system in place. I use Teamcity, which is excellent, easy to use, free for the first 20 builds, and you can happily run it on your own machine if you don't have a server at your disposal - http://www.jetbrains.com/teamcity/
Set up one build to run on every check in, and make that build run your unit tests, or fast-running tests if you will.
Set up a second build to run at midnight every night (or some other convenient time), and include in this the longer running client-calling integration tests. With this in place, it won't matter how long the tests take, and you'll get a big red flag first thing in the morning if your client has broken your stuff. You can also run these manually on demand, if you suspect there might be a problem.
In SOA practice, what strategies work better (or work at all) to update long running processes (in particular for Oracle BPEL)? For example, process may involve several human steps, which by their nature are time consuming. SOA Suites support starting new instances on new version of process and continue of running processes execution. But, what to do if the orchestration logic need to be updated and applied to already running instances? Let assume we do not want purchase orders to pass management approval, and would like this change to be applied to all orders, even those beying executed.
You cannot change the business process for anything which is in flight. Changes can only be applied to new processes. This is not a technical limitation, it is just common sense. Apart from anything, it would confuse audit trails or regulatory compliance.
If you have so catastrophically mis-designed a process - "we forgot to include management approval for orders!" "facepalm* - all you can do is shut off the server and clean up any half-completed processes. But that would be a really drastic step to take.
So the only strategy which is going to work is review and acceptance testing.