I'm designing an infrastructure for application tracing with the following constructs:
The applications are written in .NET 6
The logging protocol is OTLP
The logging target is Dynatrace
Looking at multi-organizational microservices with potential heavy tracing
Keeping in mind the expected heavy concurrent workload, what would make better sense to use?
the Dynatrace dedicated exporter or
a logging framework (e.g. Serilog with a Dynatrace sink)
Dynatrace does not offer a custom exporter for OpenTelemetry Logs.
Maybe you got confused by the Dynatrace metrics exporter? Dynatrace also does not offer a Serilog sink.
Dynatrace offers HTTP proto OTLP ingest for the stable signals - Traces and Metrics.
The Logs API/SDK is not yet stable. I'd expect Dynatrace to follow suit and offer the same ingest channel for OTel logs once it gets into a mature state.
Currently, you can still ingest logs following one of the approaches listed here: https://www.dynatrace.com/support/help/extend-dynatrace/extend-logs
Related
I use Prometheus to gather k8s' resources.
The resource data pipeline is as follows:
k8s -> Prometheus -> Java app -> Elasticsearch -> (whghl) Java app
Here I have a question.
Why use Prometheus?
Wouldn't Prometheus not be necessary if it was stored in DB like mine?
Whether I use Elasticsearch or MongoDB, wouldn't I need Prometheus?
It definitely depends on what exactly you are trying to achieve by using these tools. In general, the scope of usage is quite different.
Prometheus is specifically designed for metrics collection, system monitoring and creating alerts based on those metrics. That's why it is the better choice if it is primarily required to pull metrics from services and run alerts on them.
Elasticsearch in its turn is a system with wider scope, as it is used to store and search all data types, perform different types of analytics of this data - and mostly it is used as log analysis system. But it also can be configured for monitoring, though it is not particularly made for it, unlike Prometheus.
Both tools are good to use, but Prometheus provides more simplicity in setting up monitoring for Kubernetes.
I have come to know about opentracing and is even working on a POC with Jaeger and Spring. We have around 25+ micro services in production. I have read about it but is a bit confused as how it can be really used.
I'm thinking to use it as a troubleshooting tool to identify the root cause of a failure in the application. For this, we can search for httpStatus codes, custom tags, traceIds and application logs in JaegerUI. Also, we can find areas of bottlenecks/slowness by monitoring the traces.
What are the other usages?
Jaeger has a request sampler and I think we should not sample every request in Prod as it may have adverse impact. Is this true?
If yes, then why and what can be the impact on the application? I guess it can't be really used for troubleshooting in this case as we won't have data on every request.
What sampling configuration is recommended for Prod?
Also, how a tool like Jaeger is different from APM tools and where does it fit in? I mean you can do something similar with APM tools as well. For e.g., one can drill through a service's transaction and jump to corresponding request to other service in AppDynamics. Alerts can be put on slow transactions. One can also capture request headers/body so that they can be searched upon, etc.
There's a lot of different questions here, and some of them don't have answers without more information about your specific setup, but I'll try to give you a good overview.
Why Tracing?
You've already intuited that there are a lot of similarities between "APM" and "tracing" - the differences are fairly minimal. Distributed Tracing is a superset of capabilities marketed as APM (application performance monitoring) and RUM (real user monitoring), as it allows you to capture performance information about the work being done in your services to handle a single, logical request both at a per-service level, and at the level of an entire request (or transaction) from client to DB and back.
Trace data, like other forms of telemetry, can be aggregated and analyzed in different ways - for example, unsampled trace data can be used to generate RED (rate, error, duration) metrics for a given API endpoint or function call. Conventionally, trace data is annotated (tagged) with properties about a request or the underlying infrastructure handling a request (things like a customer identifier, or the host name of the server handling a request, or the DB partition being accessed for a given query) that allows for powerful exploratory queries in a tool like Jaeger or a commercial tracing tool.
Sampling
The overall performance impact of generating traces varies. In general, tracing libraries are designed to be fairly lightweight - although there are a lot of factors that influence this overhead, such as the amount of attributes on a span, the log events attached to it, and the request rate of a service. Companies like Google will aggressively sample due to their scale, but to be honest, sampling is more beneficial to consider from a long-term storage perspective rather than an up-front overhead perspective.
While the additional overhead per-request to create a span and transmit it to your tracing backend might be small, the cost to store trace data over time can quickly become prohibitive. In addition, most traces from most systems aren't terribly interesting. This is why dynamic and tail-based sampling approaches have become more popular. These systems move the sampling decision from an individual service layer to some external process, such as the OpenTelemetry Collector, which can analyze an entire trace and determine if it should be sampled in or out based on user-defined criteria. You could, for example, ensure that any trace where an error occurred is sampled in, while 'baseline' traces are sampled at a rate of 1%, in order to preserve important error information while giving you an idea of steady-state performance.
Proprietary APM vs. OSS
One important distinction between something like AppDynamics or New Relic and tools like Jaeger is that Jaeger does not rely on proprietary instrumentation agents in order to generate trace data. Jaeger supports OpenTelemetry, allowing you to use open source tools like the OpenTelemetry Java Automatic Instrumentation libraries, which will automatically generate spans for many popular Java frameworks and libraries, such as Spring. In addition, since OpenTelemetry is available in multiple languages with a shared data format and trace context format, you can guarantee that your traces will work properly in a polyglot environment (so, if you have Node.JS or Golang services in addition to your Java services, you could use OpenTelemetry for each language, and trace context propagation would work seamlessly between all of them).
Even more advantageous, though, is that your instrumentation is decoupled from a specific vendor or tool. You can instrument your service with OpenTelemetry and then send data to one - or more - analysis tools, both commercial and open source. This frees you from vendor lock-in, and allows you to select the best tool for the job.
If you'd like to learn more about OpenTelemetry, observability, and other topics I wrote a longer series that you can find here (look for the other 'OpenTelemetry 101' posts).
I am new in Google PubSub. I am using GoLang for the client library.
How to see the opencensus metrics that recorded by the google-cloud-go library?
I already success publish a message to Google PubSub. And now I want to see this metrics, but I can not find these metrics in Google Stackdriver.
PublishLatency = stats.Float64(statsPrefix+"publish_roundtrip_latency", "The latency in milliseconds per publish batch", stats.UnitMilliseconds)
https://github.com/googleapis/google-cloud-go/blob/25803d86c6f5d3a315388d369bf6ddecfadfbfb5/pubsub/trace.go#L59
This is curious; I'm surprised to see these (machine-generated) APIs sprinkled with OpenCensus (Stats) integration.
I've not tried this but I'm familiar with OpenCensus.
One of OpenCensus' benefits is that it loosely-couples the generation of e.g. metrics from the consumption. So, while the code defines the metrics (and views), I expect (!?) the API leaves it to you to choose which Exporter(s) you'd like to use and to configure these.
In your code, you'll need to import the Stackdriver (and any other exporters you wish to use) and then follow these instructions:
https://opencensus.io/exporters/supported-exporters/go/stackdriver/#creating-the-exporter
NOTE I encourage you to look at the OpenCensus Agent too as this further decouples your code; you reference the generic Opencensus Agent in your code and configure the agent to route e.g. metrics to e.g. Stackdriver.
For Stackdriver, you will need to configure the exporter with a GCP Project ID and that project will need to have Stackdriver Monitor enabled (and configured). I've not used Stackdriver in some months but this used to require a manual step too. Easiest way to check is to visit:
https://console.cloud.google.com/monitoring/?project=[[YOUR-PROJECT]]
If I understand the intent (!) correctly, I expect API calls will then record stats at the metrics in the views defined in the code that you referenced.
Once you're confident that metrics are being shipped to Stackdriver, the easiest way to confirm this is to query a metric using Stackdriver's metrics explorer:
https://console.cloud.google.com/monitoring/metrics-explorer?project=[[YOUR-PROJECT]]
You may wish to test this approach using the Prometheus Exporter because it's simpler. After configuring the Prometheus Exporter, when you run your code, it will be create an HTTP server and you can curl the metrics that are being generated on:
http://localhost:8888/metrics
NOTE Opencensus is being (!?) deprecated in favor of a replacement solution called OpenTelemetry.
I was looking at APM tools. Essentially Dynatrace and I could see that it also provides tracing capabilities that seem to be language agnostic and also without code modifications.
Where would jaeger/open tracing be a better option than a tool like dynatrace?
Yes, dynatrace (or others like Elastic APM) is capable of providing a lot more insight into an application other than tracing.
But just from tracing perspective...
What advantages or capabilities does jaeger have that are better than APM tooling or not available in APMs. ONLY from the tracing perspective.
As you said, dynatrace can provide more insights, but obviously that comes with a price tag.
A Jaeger/openTracing solution
can be up and running very quickly
provides quality insights into performance/bottlenecks in your execution paths
having the source code available is very useful if you want to customize any part of the process (for example i added some code to use a different message queue)
I would add that dynatrace is a great tool, but it is a full APM tool, so provides a wide variety of insights, and its expensive. Jaeger focusses on the tracing aspect and for an open source, free tool, it does a very good job.
I am using GKE platform to implement a Kubernetes scheduler. I am using Prometheus Grafana to monitor the applications.
For implementing a scheduler in golang, I need to get the metrics as an input to the scheduler.
Please suggest me some methods to do so.
Also please suggest proper documentations so that I can easily understand the things.
I am a newbie, so I don't know anything it.
Your help will be appreciated.
First, I would encourage you to read some relevant documentation about Kubernetes monitoring architecture which explains a lot of useful information about main concepts of Kubernetes metrics. Since you have used Prometheus as a main monitoring cluster agent, you might be operating with some specific metrics exposed by the application in your Kubernetes cluster infrastructure; therefore when you plan to implement custom scheduler it should be the main factor to adapt these metrics in order to define the further scheduler behavior. The good example to achieve this goal can be Sysdig monitoring tool, as it can perform automatic collection of Prometheus metrics and propagate these metrics across applications in the cluster.
You can also visit Custom scheduler project on GitHub based on Sysdig monitoring metrics and driven by open-source community enthusiasts.