Monitoring & Alerting on production applications [closed]

Monitoring & Alerting on production applications [closed] - performance

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I've been looking for a discussion on ways to monitor and alert on production applications for a little while now, but haven't found any overwhelming information.
I'm in the process of converting a behemoth of an application into smaller microservices and thought now would be a great time to implement some better monitoring of this application. What are some ways, ideally without using paid applications, to monitor the health of the overall application, and individual microservices?
Some possibilities I've considered.
- Building a small application that periodically checks or receives heartbeats.
- Setting up logstash with kabana on openstack to monitor various logs that the services spit out.
Aaaannnddd that's about all I got.

We're running a fairly large environment (hundreds of servers) which is microservices/docker based, multi-tier, highly available and completely elastic.
When it comes to monitoring and alerting, we're using two different tools:
Nagios for availability monitoring - it basically sends us an email if a service is down, lacks resources or suffers from any other problem which prevents it from operating
ELK - We use it to find the root cause of the problem and to alert about issues, trends before they actually impact the application/business.
So when there is a significant issue, Nagios will alert and we will jump into the Log analytics console to try to find the problem. In some cases, the ELK will alert when issues start to build up before it is seen on Nagios. That way we can prevent the issue from deteriorating. You can read more about setting your own ELK setup on AWS here - http://logz.io/blog/deploy-elk-production/
There are obviously many commercial tools for both monitoring, alerting and log analytics but since you're looking for free/open-source tools I've recommended these.
**As a disclaimer, I'm the CEO and Co-Founder of Logz.io which amongst other things offers Enterprise-ELK as a service

There are two elements to monitoring:
Availability - will it work
Performance - is it working properly
Availability is easy, there are hundreds of tools which do synthetic transactions. You can use a service (I can provide a specific life, but there are so many out there from pingdom, to site 24x7, to various other point solutions)
If you want to understand performance have a look at the APM technologies. THey range from more simplistic tracing products which look at the end user and component level performance to more sophisticated tools which actually stitch the whole transaction path together including the browser data.
Gartner has research on both of these markets (I wrote a lot of it before I left). I work for a company AppDynamics which does all of the above in a single product including application availability and performance (mobile or web). We offer the solution SaaS or you can install it internally. FInally we also pull the data together including logs into a backend.
You can build availability monitoring and log collection, you can also collect client side data and other telemetry you emit, but there is no good open source APM tooling out there for a true transaction tracing technology. Also how much time do you want to spend managing ELK, opentsdb, graphite, statsd, collectd, Nagios, etc etc to get this done...

There are multiple way to monitor your production servers, you can go with some of the free limited server monitor like Nagios which is hard to configure and not as simple to work. Or you can look at some of the players in this market like Stackify, LogicMonitor or several others. If you want additional tools like code level monitoring, then you'll need to look on vendors that provide APM (application performance management) such as Stackify, New Relic, AppDynamics You'll find vast price differences and features, so it is really about what are your requirements.

Related

application insights vs elastic (ELK)

Or I am really bad at searching or there is no detailed comparison between App Insights and ELK stack ?
All monitoring is going to be used for simple Web API, there going to be tons of end points but user traffic should not be too high.
So my question.. Is there any general points/differences when choosing between ELK and App Insights, personally never had a chance to set up any of those, but before setting up test environment would be nice to know in advance, what to expect/look for.

I'm from App Insights team. I think the link provided by #rickvdbosch in a comment gives quite good perspective. It is 1+ years old at this point, so, some items regarding App Insights evolved since then.
I think App Insights and ELK are quite different offerings. The former is managed offering (you can set it up within couple minutes), focused on very broad range of out-of-the-box experiences (collecting incoming/outgoing requests, exceptions, smart alerts, availability monitoring, analytics, live metrics, application map, end-to-end transactions across apps).
My understanding of ELK is that it has very powerful UI visualization and powerful dashboards (though there are adapters for Kibana to work with Azure Monitor). For scenarios where there is a need to store a lot of data (highly loaded apps with adaptive sampling still store limited amount of data) ELK solution might be cheaper to run.

Final decision was to use ELK as servers already have all the configuration, because other team uses it and mainly because logging will need a lot customization.

Passively Logging React App Performance in Production

I'm wondering if there are any utilities/patterns/paradigms/standards for monitoring React applications in production.
I've seen a lot of documentation about React performance debugging that recommends the Chrome Dev Tools (which are great, but aren't a passive way to monitor end user performance)
How could I log data to know how long users are waiting for components to mount or render?
The only thing I've thought of so far is creating a Loggable[Pure]Component that extends React.[Pure]Component whose constructor, componentWillMount/Update, and componentDidMount/Update methods log render/mount times to a server. Then, components I want to monitor can extend these components and, if need be, call super() in the lifecycle methods before doing their own work. To specifically know which components these metrics go to, I'd have to expose a method in the Loggable[Pure]Component class that does something silly like setUniqueId and then each derived class would have to call it in the constructor.
This all seems terrible and I'm very much hoping there are some things people out there have implemented, but I haven't found anything thus far.

I would have a look at some APM tools, they handle the frontend monitoring, and the backend monitoring as well. They all support react, and folks use these all the time for that use case. It really depends on your goals in the monitoring, are you doing this for fun? Do you have a startup? Are you working for a large enterprise? There are 3 major players in this market.
AppDynamics - Enterprise APM, handles the most complex apps. Unified product offering delivered SaaS or on-premises. Has deep database, server, and other monitoring.
Dynatrace - Enterprise APM, handles complex apps well. Fragmented portfolio, but the SaaS product is good. The SaaS product has limited depth in some ways. Handles server and cloud infrastructure monitoring well.
New Relic - Easy and cheap(er than others), not as in-depth as some other options. Tends to be popular with small companies. Does a good job monitoring cloud infrastructure services.
These products all do what you are looking for, but it depends on your goals with the data and how you plan to analyze it.
If you want something free and less functional there are ways to do this with open source, but you'll have to stand up and manage a pretty complex stack. Here is one option.
Check out boomerang, which can log/extract the metrics you are looking for, it doesn't "understand" react, but it should work. This data can be posted to many different systems. The best suited is likely the ELK stack (open source log analytics, and more). Here is one of several examples which marries these two together to provide analysis of the browser performance https://github.com/naukri-engineering/NewMonk

Prometheus vs ElasticSearch. Which is better for container and server monitoring? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
ElasticSearch is a document store and more of a search engine, I think ElasticSearch is not good choice for monitoring high dimensional data as it consumes lot of resources. On the other hand prometheus is a TSDB which is designed for capturing high dimensional data.
Anyone experienced in this please let me know what's the best tool to go with for container and server monitoring.

ELK is a general-purpose no-sql stack that can be used for monitoring. We've successfully deployed one on production and used it for some aspects of our monitoring system. You can ship metrics into it (if you wish) and use it to monitor them, but its not specifically designed to do that. Nor does the non-commercial version (version 7.9) come with an alerting system - you'll need to setup another component for that (like Sensu) or pay for ES commercial license.
Prometheus, on the other hand, is designed to be used for monitoring. And along with its metric-gathering clients (or other 3rd party clients like Telegraf and its service discovery options (like consul) and its alert-manager is just the right tool for this job.
Ultimately, both solutions can work, but in my opinion Elasticsearch will require more work and more upkeep (we found that ES clusters are a pain to maintain - but that depends on the amount of data you'll have).

I am using openshift and we are running both tool and both have different job. aggregating all the logging and shipping to elastic search for ease of browsing all the logging and similar things.
our prometheus use is mainly for metrics either for the nodes or the pods and definitely grafana makes a great interface to view all of prometheus metrics for sure.

Agree that it depends on what you mean by "high dimensional" and for container and server monitoring. You could use some opensource monitoring solution, I've tried Pandora FMS and they offer several options for high environments and distributed architectures, server monitoring is mostly agent based tho, but I feel it has a lot of potential.

understanding about stackoverflow underlying software infrastructure

I wonder what all databases/combination of databases stack overflow uses underneath, managing extensive user profile information over various verticals.
As i case of social networking sites like twitter and facebook the Big Data managemnet is done over hadoop. Is stack overflow also handles such higher volumes of data?
How about indexing the information , is redis part of stackoverflow solutions?
It will be really interesting to understand solution deployed at world most popular technical forum .

This article provides a glimpse at what stackoverflow's architecture looks like circa March 2011: http://highscalability.com/blog/2011/3/3/stack-overflow-architecture-update-now-at-95-million-page-vi.html
At a high level, its a .NET application which uses MS SQL server for a database, Redis for caching, HAProxy for load balancing, and a whole host of tools and hosted on both windows servers and linux servers (ubuntu+centos).
It doesn't look like they had any hadoop usage at the time of that article, but that could have changed. They might also be doing something different/custom for map/reduce type jobs or might not need anything like that at all yet. With delicacy, SQL servers can be scaled pretty far without needing to lean on "big data" toys. This is especially true if you can get most of your data out of your caching layer.

Which is better, Nagios or Sensu? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I am unsure about which monitoring framework to use. Currently I am looking at either Nagios or Sensu.
Can anybody give me a good reference which shows a comparison of these two (or any other monitoring tool which may be a good solution)? My main intention is to scale-out on EC2. I am using Opscode Chef for system integration.

One important difference between Nagios and Sensu -
Nagios requires all the configuration for 1)checks 2)handlers but most importantly 3)hosts to be written in configuration files on the Nagios server. This means that each time one of the 3 above is changed (for example new hosts added, old hosts removed) you need to re-write the configuration files and restart Nagios.
Sensu is almost the same as the above, with one important difference -- when hosts are added or removed from your architecture (as is the case in most auto-scaling cloud deployments) -- the hosts themselves run a sensu-client that "subscribes" to different available checks. So when a new server comes into existence and says "I'm a webserver", the sensu-client running on it will ask the sensu-server "what checks should a webserver run on itself?" and run those.
Other than this, operations wise both Nagios (also Icinga) and Sensu are great and have a lot of facilities for checks, handlers, and visibility through a dashboard (YMMV).

From a little recent experience with Sensu and quite a bit of experience with Nagios I'd say both are excellent choices.
Sensu is definitely the new kid. It has a nice UI and nice API. It does however require Redis and RabbitMQ in your setup to work. So consider if you'll therefore want something to monitor those dependencies outside the sensu monitoring stack. Sonian provide Chef recipes for trying it out too.
https://github.com/sensu/sensu-chef
Nagios has been around for an awfully long time. It's generally packaged for most distros which makes installation simple and it has few dependencies. It's track record also means that finding people who know it or that have used it and can offer advice is easy. On the other hand the UI is ugly and programatic access is often hacky or via third party add-ons. Chef recipes also exist for Nagios:
https://github.com/bryanwb/chef-nagios
If you have time I'd try both, there is little harm in having two monitoring systems running as a trial. The main think to focus on, especially in a dynamic EC2 setup, is how easily the monitoring configuration files can be generated by your configuration management tool.
In terms of other tools I'd personally include something to record time series data, for instance requests per second or load over time. Graphs are a great help with monitoring, and can be used to drive alerting via Nagios or similar. Personally I'm a fan of both Ganglia and Graphite while Librato Metrics (https://metrics.librato.com/) is a very nice non-free option.

I tried using Nagios for a while: I got the feeling that the only reason that it's common is that 'everyone else uses it', because it's absolutely hideous to work with. Massively overcomplicated, difficult and long-winded to make it do anything new: if you find something it doesn't do, you know you're in for a week of swearing at crummy documentation of an archaic design. At the end of all your efforts and it's all working, it looks hideous. Scrapping it made me sleep better.
Cacti looks nice, but again it's unnecessarily complex when creating new plugins.
For graphing I'd recommend Munin: it's completely trivial to write new plugins in any language, there are hundreds available, and it looks reasonable. It's incredibly easy to install - one command to install and set one access rule, so works well for automated deployments, easy to wrap into a chef recipe. 2.0 is out soon and addresses most of its shortcomings (in particular adding variable update intervals, zoomable graphs, ssh transport). Munin can talk to Nagios for notifications, or it can do that itself, and it provides a basic dashboard.
For local process/file/service monitoring, monit is simpler and works better than god. I've not tried it with m/monit.

When compared with Sensu and Nagios... The pick would be Sensu monitoring systems.
Below is the are the main reasons,
1.Easy Setup.. There is lot of reduction of restarting of Clients.. which is major trouble in the large enterprise
2. Nagios Plugins can be used with the Sensu Ecosystem.
3. Scalable and easily for the Cloud environment.
Has anyone heard about Zabbix.It has lot many features and comes as a single package. I doubt the scalability

As long as enterprise it consists of databases, sap, network devices, webservers, filers, backup libraries.... there is barely an alternative to nagios (or it's cousins icinga, shinken)
Maybe one day everything will come out of clouds automagically but still a few years there will be static servers (physical or virtual, it doesn't matter) with a defined purpose resting at least for a few months. We will still have to monitor interface bandwidth, tablespaces, business processes, database sessions, logfiles, jmx metrics. All things where the plugin concept of the nagios world has an advantage.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio