Conducting Chaos Engineering - spring-boot

I have 28 micro-services, some of which communicates with each other. All of them are built with SpringBoot 2x and they use their own resources (database, rabbitmq etc.). They are deployed in PCF.
I need to identify weakness of the overall system. That's when I resorted to Chaos Engineering. As this is my first time, I can use some help regarding how to design the effort, what metrics should I collect, tools that I can use, how long should I run such tests etc.
TIA

Done with my research. Posting it here so that someone else might find it useful.
Very good introductions including how to start your first chaos experiment: https://www.gremlin.com/community/tutorials/chaos-engineering-the-history-principles-and-practice/
Good summary of people, tools, companies doing chaos experiments:
https://coggle.it/diagram/WiKceGDAwgABrmyv/t/chaos-engineeringcompanies%2C-people%2C-tools-practices/0a2d4968c94723e48e1256e67df51d0f4217027143924b23517832f53c536e62
Tools:
Spinnaker: https://www.spinnaker.io/. Netflix Chaos Monkey does not support deployments that are managed by anything other than Spinnaker. That makes it pretty hard to use Chaos Monkey from Netflix.
ChaosMonkey for SpringBoot: https://docs.chaostoolkit.org/drivers/cloudfoundry/. Very easy to follow instructions. Easy to turn on/off using Spring profile.
Chaos Toolkit - https://docs.chaostoolkit.org/drivers/cloudfoundry/. This tool is particularly helpful to my situation since my applications are deployed in Cloud Foundry and this tool has a CloudFoundry extension. Pretty elaborate, but easy to follow instructions. My preferred tool so far.
Chaos Lemur - https://content.pivotal.io/blog/chaos-lemur-testing-high-availability-on-pivotal-cloud-foundry. This tool has promise but network admin won't share AWS credentials for me to muck with Pivotal cells.

You already found Chaos Toolkit, which is what I would recommend for the actual experimenting, especially when starting out.
The basis of Chaos is that you need to plan it out and analyse properly before starting out. And then you need to find out what kind of information you will need to monitor the results of the experiment. What happens to service X if we implement latency? Monitor requests, errors etc. Is there a functioning degraded state or is it just dead?
Your question is very broad and as such it's difficult to answer.
Make sure to try to prevent cascading failures whenever you start an experiment and always have stop that terminates the experiment and rolls everything back should something break really bad.

Related

Micro Layers: Do not add functionality on top but simplify overall Dependencies

I was going through design principles but could not understand this principle(avoid Micro layers), what would the significance be. I tried to google it but could not find any examples or explanation for this design principle. Could it possible for someone to explain this with example,what advantages it has in which scenarios? Does layering not localizes changes and reduces ripple effect of changes in software?
You’ve misinterpreted the way the principle is written. The author wasn’t trying to say “avoid micro services”. They were trying to say “When dealing with a micro service, don’t keep adding features or functionality to it. Instead, add an additional micro service to deliver the new functionality.”
The intent is to help you keep each micro service focused on a single task. This simplifies any system that depends on your service. And, it means you can more easily update your service — possibly by quickly rewriting it if you come up with a better performing design, for example. It’s hard to say “we’re going to rewrite our server” if that’s a six month task. It’s much easier when it’s only a one- or two-sprint task.
This thread needs a reboot coz of 2 reasons.
The Clean Code book that I have doesn't have a mention of Micro Layers. So not sure where the now "omni"-downloaded Clean Code cheat sheet got this from.
It would help if someone can guide me to where in the Clean Code book I can read on this one.
Am not fully satisfied that we re discussing Micro Layers in the scope of Micro Services. Bringing in an arch pattern Micro Services is not helping discuss a topic in a book that was written at a base level of Code and OOAD and a bit of design.
Instead for practical illustration purpose a code level example of the above statement is needed.

Jaeger integration Effort in Microservices in nodejs

I am analysing at a very hight level how much effort would it be for jaeger integration in nodejs microservices.
Does it require code changes or only deployment. and if code changes is required, is code changes needed in first service (i.e. api-gateway) or all the services need to have code changes.
I would really appreciate if someone can give a rough idea of tasks and effort.
Great question and a very popular one too. In short, yes, code changes are required. Not just in one service but in all the services that a request will go through. You need to instrument all services to get continuous traces that will be able to tell you the story of a request as it travels through the system.

Looking for help on how to manage microservices in Golang

Currently, I deal with microservices on a daily basis at my 9-5. Most everything that I touch is written in PHP, and as only a software engineer, SysOps manages everything that has to do with apps running, etc. I have a little familiarity in how the infrastructure and build pipeline is setup, but I still am not a SysOps or DevOps guy.
With that said, I love Golang and for a side project, I am creating a fairly large web application with a lot of moving parts. Writing and designing the code is easy as I have learned a lot from my day job, but deploying and managing Golang web apps (as they are executables) is quite different than updating files for apache to serve.
I have researched a lot on how I would build and deploy my microservice apps, but I keep on thinking of more problems that will need to be solved along the way. I have tinkered with the idea of using Docker for all of this, but I would rather not have the added complexity of learning that and managing storage for all of the images as that could be large.
Is there a best practice or a good way to manage Golang applications after they have been deployed? I would need a way to keep track of all the microservice processes to be able to see if they are still up and to be able to stop them when a new build is going to be deployed.
As for the setup, just assume that all the microservices will be run on the server, not in a container or in a VM. They will all need to be managed, but also able to act upon independently. Jenkins will be used for building and deploying. I will be using Consul for service discovery and possibly configuration, and most likely health checks on the services. I'm thinking of having each microservice register itself to consul when started and deregister when stopping.
Again, I am looking for a solution that is hopefully not just "Docker". I also had thoughts into creating a deploy service that manages the services (add and remove), as well as registering them in Consul. So if I cannot find a better solution, I might go that path. Any help is appreciated.
** Sorry if my question was confusing, but since a couple people answered on the wrong topic at hand, I will try to clarify. I don't need any help making the microservices, or even know anything about them. I brought that point up as to why I need to ask my question. Basically what I need is just the ability to manage the running go processes of all my microservices so I can do deployments and be able to stop and start processes to update the code. It is easier when you have to worry about one app, but when you can have up to 10-15 difference microservices they become harder to keep track of. After my own research, it seems that Supervisord is what I am looking for, but I'm not sure. That is the direction I am going in with this question. Thanks.
Golang is great to use for microservices, but I would say there is not so big difference of managing golang or other languages microservices.
What I would say is golang specific:
you don't need to install anything on servers since golang is compiled to single library
you can take advantage of std lib golang rpc package and gob binary decoding, instead of usin 3rd party solution (gorpc, protocol buffer etc)
Other than that you need to use your own judgement. There is plenty of ways of doing one things in microservices world; one day you will implement solution A but when after 3 month you will see that its better to do B, do that.
In internet, there is so much reading about microservices. I will recommend you 2 good resurces: https://books.google.co.uk/books/about/Building_Microservices.html?id=RDl4BgAAQBAJ&source=kp_cover&redir_esc=y&hl=en
And article: http://highscalability.com/blog/2014/4/8/microservices-not-a-free-lunch.html
Remember, microservices are not a golden bullet, they often can help making application easier to maintain and grow, but from the other side require lot of additional work, consequence in specifying API contracts and strong devops culture.

How to start performance testing

I have taken as an example for learning and gathered some information about tools, objectives,scenarios, but I need your inputs. Please assist me.
I am new to Performance testing and would like to test the following website www.volkswagen.co.nz
Can you tell me, what are need to be tested? What are the scenarios and activities for each scenario? What metrics do I need to add? Which is the best and free tool for testing it? How to test if it is deployed in cloud like AWS?
Please let me know, Thanks in advance.
Performance testing needs,
Identify critical/heavy/important scenario in your webapp (irrespective of deployment cloud/standalone)
Identify service level agreements in terms of response times, throughput, latency etc.
Identify workload model i.e. how much user load application is expecting. this should be as fine grained as possible (avg users per transaction/workflow at a point of time)
Identify tools (JMeter is freeware and best but if you can afford paid then look at loadrunner, neoload etc.)
Record the script for workflows and parameterise and correlate.
generate test setup for load test and execute the load test.
monitor system utilization, collect metrics like response time, throughput, error rate, latency etc.
This all comes in load testing. For more you can read http://www.guru99.com/performance-testing.html
I am new to Performance testing and would like to test the following website www.volkswagen.co.nz
That is a recipe for disaster. No one new should be allowed to work on their own without a full period of training and internship with a master in the field. This is true of stone masons, electricians, plumbers, barbers, accountants, engineers and physicians. And it is most certainly true of performance testers/engineers.
There are dozens of foundation skills you need to master before you touch any tool, open source or otherwise. Until you show mastery of those items along with tool mechanics for your tool you should not be allowed to test any website, particularly a production website. And, if you don't work for this company what you are engaging in is a denial of service attack and could leave you with exposed legal liability.
I strongly agree with James on this one.
Do not touched the site if:
it's not yours
not sure what you are doing
the owner gave you explicit (and sounds like irresponsible) permission
don't know or don't have the support to restore the environment into a working state
If you do work for the company then you need to have a test environment first, a playground where you can mess around and nobody would mind if you take it down.
Firstly get information from the business on which use cases needs to
be tested.
Get response times target for user actions and for environments utilisation.
Get response time targets for environments utilisation: define environment monitoring tactics.
Found a tool that can fit for purpose: Jmeter, Gattling,etc, lot's of free ones available.
Get a test environment, preferably similar scale to production
Create scripts to cover critical use cases
Comply scripts into scenarios
Create a reporting framework
Kick off monitoring
Kick off scenario
Collect and analyse results
Be mindful of the free editions of load testing tools: they tend to be easy to use at first but soon as you start to outgrow it it can cost a fortune and more often then not it's hard to port scripts/scenarios to another tool.

Marathon vs Aurora and their purposes

Both Marathon and Aurora are built on Mesos and supposedly are engineered for running long running services. My questions are:
What are their differences? I have struggled in finding any good explanations regarding their key differences
Do these frameworks run anything that runs on Linux? For Marathon they state that it can run anything that "is executable in a shell" but this is sort of vague :)
Thanks!
Disclaimer: I am the VP of Apache Aurora, and have been the tech lead of the Aurora team at Twitter for ~5 years. My likely-biased opinions are my own and do not necessarily represent those of Twitter or the ASF.
Do these frameworks run anything that runs on Linux? For Marathon they
state that it can run anything that "is executable in a shell" but
this is sort of vague :)
Essentially, yes. Ultimately these systems are sophisticated machinery to execute shell code somewhere in a cluster :-)
What are their differences? I have struggled in finding any good
explanations regarding their key differences
Aurora and Marathon do indeed offer similar feature sets, both being classified as "service schedulers". In other words, you hand us instructions for how to run your application servers, and we do our best to keep them up.
I'll offer some differences in broad strokes. When it comes to shortcomings mentioned in each, I think it's safe to say that the communities are aware and intend to fix them.
Ease of use
Aurora is not easy to install. It will likely feel like you are trailblazing while setting it up. It exposes a thrift API, which means you'll need a thrift client to interact with it programmatically (a REST-like API is coming, but is vaporware at the moment), or use our command line client. Aurora has a DSL for configuration which can be daunting, but allows you to easily share templates and common patterns as you use the system more.
Marathon, on the other hand, helps you to run 'Hello World' as quickly as possible. It has great docs to do this in many environments and there's little overhead to get going. It has a REST API, making it easier to adapt to custom tools. It uses JSON for configuration, which is easy to start with but more prone to cargo culting.
Targeted use cases
Aurora has always been designed to handle a large engineering organization. The clusters at Twitter have tens of thousands of machines and hundreds of engineers using them. It is critical to Twitter's business. As a result, we take our requirements of scale, stability, and security very seriously. We make sure to only condone features that we believe are trustworthy at scale in production (for example, we have our Docker support labeled as beta because of known issues with Docker itself and the Mesos-Docker integration). We also have features like preemption that make our clusters suitable for mixing business-critical services with prototypes and experiments.
I can't make any claim for or against Marathon's scalability. On the feature front, Marathon has build out features quickly, but this can feel bleeding edge in practice (Docker support is a good example). This is not always due to Marathon itself, but also layers down the stack. Marathon does not provide preemption.
Ownership
To some, ownership and governance of a project is important. It feel that in practice it does not define the openness of a project, but for some people/companies the legal fine print can be a deal-breaker.
Marathon is owned by a company (Mesosphere)
To some, this is beneficial, to others is is not. It means that you can pay for support and features. It also means that there is something to be sold, and the project direction is ultimately decided by Mesosphere's interests.
Aurora is owned by the Apache Software Foundation
This means it is subject to the governance model of the ASF, driven by the community. Aurora does not have paying customers, and there is not currently a software shop that you can pay for development.
tl;dr If you are just getting your feet wet with running services on Mesos, I would suggest Marathon as your first port of call. It will be easier for you to get running and poke around the ecosystem. If you are forming the 'private cloud strategy' for a company, I suggest seriously considering Aurora, as it is proven and specifically designed for that.
So I've been evaluating both and this is my summary.
Aurora
[+] also handles recurring jobs
[+] finer grained, extensive file-based configuration
[+] has namespaces so multiple environments can co-exist
[-] read-only UI, no official API
[~] file based configuration and cli based execution brings overhead (which can be justified with more extensive feature set)
Marathon
[+] very easy to setup and use
[+] UI that provides control and extensive API (even with features missing from UI at the moment)
[+] event bus to listen in on api calls
[-] handles only long-running jobs
[-] does not have separate deployment-run-cleanup steps, these if necessary need to be combined in a script of one-liner
Even though Aurora has better capabilities, I prefer Marathon due to Auroras complexity/overhead and lack of UI (for control) & API
I have more experience with Marathon.
Ideological:
Marathon is a relatively tested product that is used in production at AirBnB. Aurora is an early Apache project (so YMMV).
Both are open source and active. Feel free to contribute pull requests or file issues!
Technical:
Marathon doesn't schedule batch tasks or cron jobs
Marathon has a friendly UI and better health indicators (in 0.8.x)
In regards to your second question, you can run any command or docker container, and Mesos will do the resource isolation for you. If you have 50% CentOS nodes and 50% Ubuntu nodes and you run a task that executes apt-get, the task will have a 50% chance of failure. Mesos and Marathon have no awareness of the actual machines.
Disclaimer: I don't have hands-on experience with Aurora, only with Marathon.
ad Q1: In a nutshell Apache Aurora is capable of doing what Marathon + Chronos can provide, that is, schedule both long-running services and recurring (batch) jobs; see also Aurora user guide.
ad Q2: Yes, anything. Currently based on cgroups and Docker but hey, you can roll your own.

Resources