We have some 20 or so servers in EC2, most are dynamically spawned (scaling groups).
We're looking for a solution to monitor the uptime of our application.
As an added bonus this solution could also extend to actually monitoring the servers involved so its easy to go back in time and see what happened just before a downtime or whatnot.
We're looking for a hosted solution ideally, and it should be easy to scale with it (it needs to somehow dynamically deal with servers being added/removed with no interaction from us).
Anyways, hoping for some recommendations from you guys.
A bit of background ...
We're currently using a custom Nagios setup, its been reduced to basically doing a simple http check now that the servers have become fully dynamic. We've already been using PagerDuty to deliver the pages. It does ok, but for the maintenance cost we could well be using a http check # Server Density of Pingdom.
I've looked briefly at ServerDensity, and it does look promising, I especially like their install mechanism of just dumping their files into your AMI and it takes care of the rest.
I'd like to know what options there are tho before diving deeper into any particular solution.
We use a combination of Server Density for monitoring and PagerDuty for alerting. The two work quite well together.
Related
I have an HPC cluster and I would like to monitor its health with Icinga2. I have a number of checks defined for each node in the cluster, but what I would really like is to get a notification if more than a certain percentage of the nodes are sick.
I notice that is possible to define a dummy host which represents the cluster and use the Icinga domain specific language to achieve something like I'm interested (http://docs.icinga.org/icinga2/latest/doc/module/icinga2/chapter/advanced-topics?highlight-search=up_count#access-object-attributes-at-runtime). However this seems like an inelegant and awkward solution.
Is it possible to define this kind of "aggregate" or "meta check" over a hostgroup?
There wasn't any solution, and such a thing put inside the docs helped quite a few users, even if it isn't that elegant. External addons such as business process can do the same but require additional configuration. The Vagrant box integrates the Icinga Web 2 module for instance.
Other users tend to use check_multi or check_cluster for that. Isn't that elegant either.
There are no immediate plans to implement such a feature although the idea is good and lasts long.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I am unsure about which monitoring framework to use. Currently I am looking at either Nagios or Sensu.
Can anybody give me a good reference which shows a comparison of these two (or any other monitoring tool which may be a good solution)? My main intention is to scale-out on EC2. I am using Opscode Chef for system integration.
One important difference between Nagios and Sensu -
Nagios requires all the configuration for 1)checks 2)handlers but most importantly 3)hosts to be written in configuration files on the Nagios server. This means that each time one of the 3 above is changed (for example new hosts added, old hosts removed) you need to re-write the configuration files and restart Nagios.
Sensu is almost the same as the above, with one important difference -- when hosts are added or removed from your architecture (as is the case in most auto-scaling cloud deployments) -- the hosts themselves run a sensu-client that "subscribes" to different available checks. So when a new server comes into existence and says "I'm a webserver", the sensu-client running on it will ask the sensu-server "what checks should a webserver run on itself?" and run those.
Other than this, operations wise both Nagios (also Icinga) and Sensu are great and have a lot of facilities for checks, handlers, and visibility through a dashboard (YMMV).
From a little recent experience with Sensu and quite a bit of experience with Nagios I'd say both are excellent choices.
Sensu is definitely the new kid. It has a nice UI and nice API. It does however require Redis and RabbitMQ in your setup to work. So consider if you'll therefore want something to monitor those dependencies outside the sensu monitoring stack. Sonian provide Chef recipes for trying it out too.
https://github.com/sensu/sensu-chef
Nagios has been around for an awfully long time. It's generally packaged for most distros which makes installation simple and it has few dependencies. It's track record also means that finding people who know it or that have used it and can offer advice is easy. On the other hand the UI is ugly and programatic access is often hacky or via third party add-ons. Chef recipes also exist for Nagios:
https://github.com/bryanwb/chef-nagios
If you have time I'd try both, there is little harm in having two monitoring systems running as a trial. The main think to focus on, especially in a dynamic EC2 setup, is how easily the monitoring configuration files can be generated by your configuration management tool.
In terms of other tools I'd personally include something to record time series data, for instance requests per second or load over time. Graphs are a great help with monitoring, and can be used to drive alerting via Nagios or similar. Personally I'm a fan of both Ganglia and Graphite while Librato Metrics (https://metrics.librato.com/) is a very nice non-free option.
I tried using Nagios for a while: I got the feeling that the only reason that it's common is that 'everyone else uses it', because it's absolutely hideous to work with. Massively overcomplicated, difficult and long-winded to make it do anything new: if you find something it doesn't do, you know you're in for a week of swearing at crummy documentation of an archaic design. At the end of all your efforts and it's all working, it looks hideous. Scrapping it made me sleep better.
Cacti looks nice, but again it's unnecessarily complex when creating new plugins.
For graphing I'd recommend Munin: it's completely trivial to write new plugins in any language, there are hundreds available, and it looks reasonable. It's incredibly easy to install - one command to install and set one access rule, so works well for automated deployments, easy to wrap into a chef recipe. 2.0 is out soon and addresses most of its shortcomings (in particular adding variable update intervals, zoomable graphs, ssh transport). Munin can talk to Nagios for notifications, or it can do that itself, and it provides a basic dashboard.
For local process/file/service monitoring, monit is simpler and works better than god. I've not tried it with m/monit.
When compared with Sensu and Nagios... The pick would be Sensu monitoring systems.
Below is the are the main reasons,
1.Easy Setup.. There is lot of reduction of restarting of Clients.. which is major trouble in the large enterprise
2. Nagios Plugins can be used with the Sensu Ecosystem.
3. Scalable and easily for the Cloud environment.
Has anyone heard about Zabbix.It has lot many features and comes as a single package. I doubt the scalability
As long as enterprise it consists of databases, sap, network devices, webservers, filers, backup libraries.... there is barely an alternative to nagios (or it's cousins icinga, shinken)
Maybe one day everything will come out of clouds automagically but still a few years there will be static servers (physical or virtual, it doesn't matter) with a defined purpose resting at least for a few months. We will still have to monitor interface bandwidth, tablespaces, business processes, database sessions, logfiles, jmx metrics. All things where the plugin concept of the nagios world has an advantage.
After researching various hosts, I still get the feeling that it is somewhat impossible to get a host that would never go down.
Maybe these hosts employ redundancy, maybe they do not. Either case, how would one display a friendly message to the user along the lines of "BRB". What if your host goes down completely for an hour? You would need a way to tell users you would be back. How do you accomplish that?
I doubt any ISP or hosting provider would do that for you. To archieve that you need very expensive and complicated infrastructure like redundant fail-safe routers and backbones in addition to servers of course - and you need multiple. The concepts like Simple Failover requires DNS updates which take minutes to hours to propagate normally, so it's not a 100% solution either. See a good Joel's article for a related discussion.
If the host is down and you're on a single server, then you are definitely down. This is a limitation of shared hosting... there's not much you can do about it. You can ask your host if you are hosted on multiple servers for redundancy... if so, then you wouldn't have to worry about it.
If you host your own server, then you could maybe get your hands on Simple Failover and maybe have a cheap Virtual Dedicated server that goes UP when your primary goes down.
Ok, every host will have downtime at some point. Your best bet would be to go with someone who has the great customer service that can help get your box back up. 99% of the time when your box goes down its your fault (if you have access to the OS/Apache etc).
The people at Rackspace are awesome for hosting + customer service. The rackspace cloud is great allowing you to create and take down servers instantly. (slicehost is good for persistent boxes charged by month, also owned by rackspace)
As for a way to communicate to your users, i would employ twitter, tumblr, or a hosted blog service. This way if your box goes down you can communicate your message via these services which are most likely on a different host/network.
What makes a site good for high traffic?
Does it have more to do with the hardware/infrastructure, or with how one writes the software, using Java as the example, if it matters?
I'm wondering how the software changes just because it is expected that billions of users will be on the site, if at all.
My understanding up to this point is that the code doesn't change, but that it is deployed on multiple servers, in a cluster, and a load balancer distributes the load, so really, on any one server/deployment, the application is just as any other standard application/website.
I highly recommend reading Jeff Atwood's blog on Micro-Optimization. In previous blogs he talks somewhat about how this site was created and the hardware upgrades he has had (which quickly summarized said that better hardware performs better only the extent that it is faster/better), but the real speed of a site comes from good programming, and this article seems like it should sum up some of your site programming questions quite well.
Hardware is cheap. Programming is expensive.
There are some programming techniques to make sure your code can handle multiple simultaneous views/updates. If you're using an existing framework, much of that work is (hopefully) done for you, but otherwise you're going to find stuff that worked for a few hundred hits an hour on one server isn't going to work when you're getting hundreds of thousands of hits and you have to deploy multiple load balancing machines.
Well, it is primarily an issue of hardware scaling but there are a few things to keep in mind with respect to the software involved in scaling. For example, if you are on a server farm, you'll need to work with a session management server (either via SQL Server or via a state server - which has implications in that your session variables need to be serializable).
But, in the bigger picture, there are a variety of things that you would want to do to scale to an enterprise level. For example, it becomes particularly important that you abstract out your database calls to a DAL because you may well need to adopt the use of a middleware package for high volume environments.
I was currently looking into memcached as way to coordinate a group of server, but came across Apache's ZooKeeper along the way. It looks interesting, and Yahoo uses it, so it shouldn't be bad, but I'd never heard of it before, so I'm kind of skeptical. Has anyone else given it a try? Any comments or ideas?
ZooKeeper and Memcached have different purposes. You can use memcached to do server coordination, but you'll have to do most of this work yourself. Memcached only allows coordination in that it caches common data lookups to be used by multiple clients. From reading ZooKeeper's documentation, it has a much broader focus than this. ZooKeeper seems to provide support for server clustering, which isn't the same as the cache clustering memcached provides.
Have a look at Brad Fitzpatrick's Linux Journal article on memcached to get a better idea what I mean.
To get an overview of what Zookeper is capable of, watch the following presentation by it's creators. It's capable of so much more (creating queue's, electing master processes amongst a group of peers, distributed high performance run time configurations, rendezvous points for dis-joined processes, determining if processes are still running, etc).
http://zookeeper.sourceforge.net/index.sf.shtml
To answer your question, if "coordination" is what you are looking for Zookeeper is much better targeted at that than memcached.
Zookeeper is great for coordinating data across servers. It does a good job of ordering every transaction and making guarantees that transactions happen in order. However when first breaking into it the documentation sucks; it's very 'high-level' without enough concrete examples or explanations as how to properly handle certain events. One of the included examples (as of version 3.3.3) had its own bugs in it.
Your code will also need to be cognizant of event driven interactions, and polling interactions. With massively distributed architecture, when acting upon 'events' you can inadvertently create a stampede that could not be desirable for your environment (herding effect).