Icinga2 checks over multiple hosts - cluster-computing

I have an HPC cluster and I would like to monitor its health with Icinga2. I have a number of checks defined for each node in the cluster, but what I would really like is to get a notification if more than a certain percentage of the nodes are sick.
I notice that is possible to define a dummy host which represents the cluster and use the Icinga domain specific language to achieve something like I'm interested (http://docs.icinga.org/icinga2/latest/doc/module/icinga2/chapter/advanced-topics?highlight-search=up_count#access-object-attributes-at-runtime). However this seems like an inelegant and awkward solution.
Is it possible to define this kind of "aggregate" or "meta check" over a hostgroup?

There wasn't any solution, and such a thing put inside the docs helped quite a few users, even if it isn't that elegant. External addons such as business process can do the same but require additional configuration. The Vagrant box integrates the Icinga Web 2 module for instance.
Other users tend to use check_multi or check_cluster for that. Isn't that elegant either.
There are no immediate plans to implement such a feature although the idea is good and lasts long.

Related

Openshift master slave

I have Openshift project with 3 pods: FE, BE1, BE2.
FE communicates with BE1 via REST API, BE1 with BE2 via REST API too.
I need to implement replication of pods. I have idea to make copy of pods, and if one of pod in set will not work, traffic will be redirected to another set.
It will be like this:
Set_1 : FEr1 -> BE1r1 -> BE2r1,
Set_2 : FEr2 -> BE1r2 -> BE2r2
FE is React react in container
BE1 and BE2 is Java apps in separate containers.
I don't know how to configure it. Every container contains pipeline configration and application.template files.
Somebody knows how is it possible to do, or maybe some another way to create it?
Thanks!
If I'm understanding you correctly, your question essentially boils down to "How do I run an active-passive K8S Service"? Because if I could give you answer on how to run an "active-passive service" for FEr1 / FEr2 then you could use the same technique for each pod in your "sets". So, to simplify my answer, I'm going to focus on how to have a single "active-passive" service. You can then you can extrapolate on your own how to create a chain of "active-passive" services.
I will begin with the fact there is no such native "active-passive" service object in Kubernetes or Openshift. It's kind of antithetical to most K8S design patterns. So you are going to have either change your architecture or you are going to have build something fairly customized.
When trying to find a link I could share to demonstrate some of your options, I found this blog post from Paul Dally which details most of the the options I was going to outline. It is a great exploration of active-passive services in Kubernetes. For convenience, I'm going to summarize here and add some commentary. But he goes into some great detail and I'd recommend reading the original blog post from Paul.
His option #1, and his recommended approach, is essentially "don't do that". He talks about the disadvantages of an active-passive approach and why K8S patterns generally don't take an active-passive approach. I concur: your best option is just to rearchitect your services so that they are not active-passive.
His option #2 is essentially another recommendation of "don't do that". I will paraphrase his second option as "if you are in a situation where you are forced to only have one active pod the more Kubernetes native approach would be to only run one pod". In this option you use only a single pod, but use Kubernetes native Deployments/Statefulsets and liveness probes to keep the single pod available. Obviously if your pod has slow startup, this has some challenges.
His option #3 is basically his option of last resort. To quote his article, "Make sure that you have fully considered and thoughtfully ruled out the preceding options before continuing with an active/passive load balancing approach." But then he details an approach where you could use a normal K8S Deployment/StatefulSet to create your pods and a normal K8S Service to route traffic between them. But, so that they don't have active-active traffic balancing you add an additional selector to the service e.g. "role=active". Since none of the pods will have this label, the selector will prevent either of the pods from being routed to.
But this leads to the trick: you create an additional Deployment (and Pod) whose sole job is to maintain that "role=active" label. It's perfectly possible to patch the labels of a running pod. So he provides some pseudo-code for a script that you could run in that "failover manager" pod. Essentially the "failover manager" is just checking for availability, by whatever rules you define, and then controls the failover from the active to passive pod by deleting and adding the label.
He does talk about the challenges of this. Including making sure it's hardened enough and has the proper permissions. I'd suggest that if you take this approach that you make it a full-fledged operator. Because essentially that's what this kind of approach is: writing a custom operator.
I will also, however, mention another similar approach that I'll call option #4. Essentially what you are doing with option #3 is create custom routing logic by patching the service. You could just embrace that customer routing approach and deploy something like your own HAProxy. I don't have a sample config for you. But active-passive failover is a fairly well explored area for an HAProxy. You are adding an additional layer of routing, but you are using more off the shelf functionality rather than patching services on-the-fly.

How do I change an attribute for a policy group without having to re-compile all of the policyfiles?

I am trying to shift from environment files in Chef to using Policyfiles. https://docs.chef.io/policy.html. I really like the concept, especially since you can include a policy from policy into another, but I am trying to understand how do a simple attribute change.
For instance, if I want to change a globally-used attribute that may be an error message for a problem that is happening now. ("The system will be down for 10 minutes. Thanks for your patience"). Or perhaps I want to turn off some AB testing with an attribute working as a flag. From what I can tell, the only way I can do this is to change an attribute in the policyfile, and then I need to create a new version of the policy file.
And if the policyfile is an included in many other policyfiles, like in the case of a base policyfile, then I have a lot of work to do for a simple change.
default['production']['maintenance_message'] = 'We will be down for the next 15 minutes!'
default['production']['start_new_feature'] = true
How do I make a simple change to an attribute that affects an entire policy group? Is there a simple way to change an attribute, or do I have to move all my environment properties to a data bag??
OK, I used Chef Support and got an answer: Nope.
This is their response:
"You've called out one of the main reasons we recommend that people use something like Jenkins pipelines to deliver cookbooks to their infrastructure. All that work can be kicked off by a build system recognizing a change in a dependency and initiating new builds for all the downstream consumer jobs.
For what it's worth, I don't really like putting application configurations like that maintenance message example you used in configuration management, I think something like Habitat is a better system for that kind of rapid-change configuration delivery, although you could also go down the route of storing application configuration like that in a system like Hashicorp Vault, Consul, or etcd, and ensuring that whatever apps need to ingest those changes are able to do so without configuration management fighting with the key-value config store.
If that was just an example to illustrate things, ignore the previous comment and refer only to my recommendation to use pipelines to deliver cookbooks, attributes, etc to your infrastructure (and I would generally recommend against data bags these days, but that's mostly a preference thing)."

Puppet vs Ansible - why would organisation use both?

I have worked in an organisation where we used both puppet and ansible for configuration management... but I always wondered why would they use both tools ... what can puppet do that Ansible cannot do?
The only thought that came to my mind was:
- Puppet was used to check if the system is in the desired state at regular intervals; while Ansible was used to deploy one time things (code, scripts, packages etc)
Can someone please explain why would an organisation use both the tools? Can regular config check be done by Ansible?
Cheers
In the interest of full disclosure, I'm an upstream community contributing developer to Ansible but I will do my best to keep my response neutral.
I think this is largely opinionated and you'll get varied results depending on who you talk to but I think about it effectively like this:
Ansible is an automation tool and Puppet is a configuration management tool. I don't consider them to be direct competitors they way they seem to get compared by tech journalists except for the fact that there's some overlap in their abilities to perform the functions you would want out of a configuration management tool: service/system state, configuration file templating, application lifecycle management, etc.
The main place where I see these tools in completely different light is that Ansible performs automation of tasks, those tasks can be one of many "type" of things that you don't really expect from a configuration management tool, such as IaaS provisioning (AWS, GCE, Azure, RAX, Linode, etc), physical network configuration (Cisco IOS/ASA, JunOS, Arista, VyOS, Netscaler, etc), virtual machine creation/management, physical load balancer configuration (F5 BigIP) and the list goes on. Effectively, Ansible is your "automation glue" to create and automate a process that you and your team might have otherwise had to do by hand. It as a tool gets compared to things like Puppet, Chef, and SaltStack because one of the many "types" of task you would automate more or less add up to configuration management.
On the flip side though Configuration Management tools such as Puppet generally have a daemon running on the nodes, which needs to be provisioned/bootstrapped (maybe with Ansible), which has it's advantages and disadvantages (which I won't debate here, it's largely out of scope). One thing that daemon provides you is continuous eventual consistency. You can set configuration management authoritatively on the Puppet Master and then the agent will maintain that state on the systems and will provide reporting when it has to change something which can be wired up to alert monitoring to notify you when something's wrong. While Ansible will also report when something needed changing, it only does this when you run the Ansible Playbook. It's a push-model and not pull-model (nor is it a continuously running daemon that will enforce system state). This has it's advantages for reporting and the like. I will note that something like Ansible Tower/AWX can more or less emulate this functionality, but it's not a "baked in" feature. Just something to keep in mind.
Ultimately, I think it boils down to a matter of familiarity of technologies, desired feature set, and if you have a pre-existing investment (both time and money) into a toolchain. If you have been using Puppet for 5 years, there's no real motivation to fork-lift replace it with something else when you can use Ansible to augment it (there's even a puppet module in Ansible) and allow each to play nicely with each other, getting the features you want from both. However, if you're starting from scratch, then I think you may consider actually doing a Pros/Cons or feature comparison for what you really want out of the tool(s) to find out if it's worth the investment of picking up two tools from scratch or finding one that can fulfill all your needs and, while I'm biased towards Ansible in this regard, the choice ultimately lies on the person who's going to have to use the utility to maintain the infrastructure.
I think a good example of the hybrid approach is I know of a few companies that use Puppet for configuration management, and Ansible for software lifecycle release process where one of the tasks in their playbooks is literally calling the puppet module to bring all the systems into configuration consistency. The Ansible component in this is to automate/orchestrate between various systems, the basic outline of the process is this: start with removing a group of hosts from the load balancer, ensure database connections have stopped, perform upgrades/migrations, run puppet for configuration/state consistency, and then bring things back online in whatever order they've deemed appropriate. This all happens from a single command (or a click of a button in Tower/AWX).
Anyhoo, I know that was kind of long winded but hopefully it was helpful.

Monitoring solution for EC2 based deployment

We have some 20 or so servers in EC2, most are dynamically spawned (scaling groups).
We're looking for a solution to monitor the uptime of our application.
As an added bonus this solution could also extend to actually monitoring the servers involved so its easy to go back in time and see what happened just before a downtime or whatnot.
We're looking for a hosted solution ideally, and it should be easy to scale with it (it needs to somehow dynamically deal with servers being added/removed with no interaction from us).
Anyways, hoping for some recommendations from you guys.
A bit of background ...
We're currently using a custom Nagios setup, its been reduced to basically doing a simple http check now that the servers have become fully dynamic. We've already been using PagerDuty to deliver the pages. It does ok, but for the maintenance cost we could well be using a http check # Server Density of Pingdom.
I've looked briefly at ServerDensity, and it does look promising, I especially like their install mechanism of just dumping their files into your AMI and it takes care of the rest.
I'd like to know what options there are tho before diving deeper into any particular solution.
We use a combination of Server Density for monitoring and PagerDuty for alerting. The two work quite well together.

website - redundancy and failure

After researching various hosts, I still get the feeling that it is somewhat impossible to get a host that would never go down.
Maybe these hosts employ redundancy, maybe they do not. Either case, how would one display a friendly message to the user along the lines of "BRB". What if your host goes down completely for an hour? You would need a way to tell users you would be back. How do you accomplish that?
I doubt any ISP or hosting provider would do that for you. To archieve that you need very expensive and complicated infrastructure like redundant fail-safe routers and backbones in addition to servers of course - and you need multiple. The concepts like Simple Failover requires DNS updates which take minutes to hours to propagate normally, so it's not a 100% solution either. See a good Joel's article for a related discussion.
If the host is down and you're on a single server, then you are definitely down. This is a limitation of shared hosting... there's not much you can do about it. You can ask your host if you are hosted on multiple servers for redundancy... if so, then you wouldn't have to worry about it.
If you host your own server, then you could maybe get your hands on Simple Failover and maybe have a cheap Virtual Dedicated server that goes UP when your primary goes down.
Ok, every host will have downtime at some point. Your best bet would be to go with someone who has the great customer service that can help get your box back up. 99% of the time when your box goes down its your fault (if you have access to the OS/Apache etc).
The people at Rackspace are awesome for hosting + customer service. The rackspace cloud is great allowing you to create and take down servers instantly. (slicehost is good for persistent boxes charged by month, also owned by rackspace)
As for a way to communicate to your users, i would employ twitter, tumblr, or a hosted blog service. This way if your box goes down you can communicate your message via these services which are most likely on a different host/network.

Resources