Usages of Opentracing tools like Jaeger - spring

I have come to know about opentracing and is even working on a POC with Jaeger and Spring. We have around 25+ micro services in production. I have read about it but is a bit confused as how it can be really used.
I'm thinking to use it as a troubleshooting tool to identify the root cause of a failure in the application. For this, we can search for httpStatus codes, custom tags, traceIds and application logs in JaegerUI. Also, we can find areas of bottlenecks/slowness by monitoring the traces.
What are the other usages?
Jaeger has a request sampler and I think we should not sample every request in Prod as it may have adverse impact. Is this true?
If yes, then why and what can be the impact on the application? I guess it can't be really used for troubleshooting in this case as we won't have data on every request.
What sampling configuration is recommended for Prod?
Also, how a tool like Jaeger is different from APM tools and where does it fit in? I mean you can do something similar with APM tools as well. For e.g., one can drill through a service's transaction and jump to corresponding request to other service in AppDynamics. Alerts can be put on slow transactions. One can also capture request headers/body so that they can be searched upon, etc.

There's a lot of different questions here, and some of them don't have answers without more information about your specific setup, but I'll try to give you a good overview.
Why Tracing?
You've already intuited that there are a lot of similarities between "APM" and "tracing" - the differences are fairly minimal. Distributed Tracing is a superset of capabilities marketed as APM (application performance monitoring) and RUM (real user monitoring), as it allows you to capture performance information about the work being done in your services to handle a single, logical request both at a per-service level, and at the level of an entire request (or transaction) from client to DB and back.
Trace data, like other forms of telemetry, can be aggregated and analyzed in different ways - for example, unsampled trace data can be used to generate RED (rate, error, duration) metrics for a given API endpoint or function call. Conventionally, trace data is annotated (tagged) with properties about a request or the underlying infrastructure handling a request (things like a customer identifier, or the host name of the server handling a request, or the DB partition being accessed for a given query) that allows for powerful exploratory queries in a tool like Jaeger or a commercial tracing tool.
Sampling
The overall performance impact of generating traces varies. In general, tracing libraries are designed to be fairly lightweight - although there are a lot of factors that influence this overhead, such as the amount of attributes on a span, the log events attached to it, and the request rate of a service. Companies like Google will aggressively sample due to their scale, but to be honest, sampling is more beneficial to consider from a long-term storage perspective rather than an up-front overhead perspective.
While the additional overhead per-request to create a span and transmit it to your tracing backend might be small, the cost to store trace data over time can quickly become prohibitive. In addition, most traces from most systems aren't terribly interesting. This is why dynamic and tail-based sampling approaches have become more popular. These systems move the sampling decision from an individual service layer to some external process, such as the OpenTelemetry Collector, which can analyze an entire trace and determine if it should be sampled in or out based on user-defined criteria. You could, for example, ensure that any trace where an error occurred is sampled in, while 'baseline' traces are sampled at a rate of 1%, in order to preserve important error information while giving you an idea of steady-state performance.
Proprietary APM vs. OSS
One important distinction between something like AppDynamics or New Relic and tools like Jaeger is that Jaeger does not rely on proprietary instrumentation agents in order to generate trace data. Jaeger supports OpenTelemetry, allowing you to use open source tools like the OpenTelemetry Java Automatic Instrumentation libraries, which will automatically generate spans for many popular Java frameworks and libraries, such as Spring. In addition, since OpenTelemetry is available in multiple languages with a shared data format and trace context format, you can guarantee that your traces will work properly in a polyglot environment (so, if you have Node.JS or Golang services in addition to your Java services, you could use OpenTelemetry for each language, and trace context propagation would work seamlessly between all of them).
Even more advantageous, though, is that your instrumentation is decoupled from a specific vendor or tool. You can instrument your service with OpenTelemetry and then send data to one - or more - analysis tools, both commercial and open source. This frees you from vendor lock-in, and allows you to select the best tool for the job.
If you'd like to learn more about OpenTelemetry, observability, and other topics I wrote a longer series that you can find here (look for the other 'OpenTelemetry 101' posts).

Related

Should event driven architecture be targeted for all data & analytics platforms?

For example,
You have an IT estate where a mix of batch and real-time data sources exists from multiple systems, e.g. ERP, Project management, asset, website, monitoring etc.
The aim is to integrate the datasources into a cloud environment (agnostic).
There is a need for reporting and analytics on combinations of all data sources.
Inevitably, some source systems are not capable of streaming, hence batch loading is required.
Potential use-cases for performing functionality/changes/updates based on the ingested data.
Given a steer for creating a future-proofed platform, architecturally, how would you look to design it?
It's a very open-end question, but there are some good principles you can adopt to help direct you in the right direction:
Avoid point-to-point integration, and get everything going through a few common points - ideally one. Using an API Gateway can be a good place to start, the big players (Azure, AWS, GCP) all have their own options, plus there's lots of decent independent ones like Tyk or Kong.
Batches and event-streams are totally different, but even then you can still potentially route them all through the gateway so that you get the centralised observability (reporting, analytics, alerting, etc).
Use standards-based API specifications where possible. A good REST based API, based off a proper resource model is a non-trivial undertaking, not sure if it fits with what you are doing if you are dealing with lots of disparate legacy integration. If you are going to adopt REST, use OpenAPI to specify the API's. Using this standard not only makes it easier for consumers, but also helps you with better tooling as many design, build and test tools support OpenAPI. There's also AsyncAPI for event/async API's
Do some architecture. Moving sh*t to cloud doesn't remove the sh*t - it just moves it to the cloud. Don't recreate old problems in a new place.
Work out the logical components in your new solution: what does each of them do (what's it's reason to exist)? Don't forget ancillary components like API catalogues, etc.
Think about layering the integration (usually depending on how they will be consumed and what role they need to play, e.g. system interface, orchestration, experience APIs, etc).
Want to handle data in a consistent way regardless of source (your 'agnostic' comment)? You'll need to think through how data is ingested and processed. This might lead you into more data / ETL centric considerations rather than integration ones.
Co-design. Is the integration mainly data coming in or going out? Is the integration with 3rd parties or strictly internal?
If you are designing for external / 3rd party consumers then a co-design process is advised, since you're essentially designing the API for them.
If the API's are for internal use, consider designing them for external use so that when/if you decide to do that later it's not so hard.
Taker a step back:
Continually ask yourselves "what problem are we trying to solve?". Usually, a technology initiate is successful if there's a well understood reason for doing it, which has solid buy-in from the business (non-IT).
Who wants the reporting, and why - what problem are they trying to solve?
As you mentioned its an IT estate aka enterprise level solution mix of batch and real time so first you have to identify what is end goal of this migration. You can think of refactoring applications. If you are trying to make it event driven then assess the refactoring efforts and cost. Separation of responsibility is the key factor for refactoring and migration.
If you are thinking about future proofing your solution then consider Cloud for storing and processing your data. Not necessary it will be cheap but mix of Cloud and on-prem could be a way. There are services available by cloud providers to move your data in minimal cost. Cloud native solutions are there for performing analysis on your data. Database migration service in AWS or Azure can move data and then capture on-going changes. So you can keep using on-prem db & apps and perform analysis for reporting on cloud. It will ease out load on your transactional DB. Most data sync from on-prem to cloud is near real time.

KDB+/Q: GRPC implementation?

gRPC is a modern open source high performance RPC framework that can
run in any environment. It can efficiently connect services in and
across data centers with pluggable support for load balancing,
tracing, health checking and authentication. It is also applicable in
last mile of distributed computing to connect devices, mobile
applications and browsers to backend services.
I'm finding GRPC is becoming increasingly more pertinent in backend infrastructure, and would've liked to have it in my favorite language/tsdb kdb+/q.
I was surprised to find that kdb+ does not have a grpc implementation. Obviously, the (https://code.kx.com/q/interfaces/protobuf/)
package doesn't support the parsing of rpc's, is there anything quantitatively preventing there being a KDB+ implementation of the rpc requests/services etc. found in grpc?
Why would one not want to implement rpc's (grpc) in kdb+ and would it be a good idea to wrap a c++/c implemetation therin inorder to achieve this functionality.
Thanks for your advice.
Interesting post:
https://zimarev.com/blog/event-sourcing/myth-busting/2020-07-09-overselling-event-sourcing/
outlines event sourcing, which I think might be a better fit for kdb?
What is the main issue with services using RPC calls to exchange information? Well, it’s the high degree of coupling introduced by RPC by its nature. The whole group of services or even the whole system can go down if only one of the services stops working. This approach diminishes the whole idea of independent components.
In my practice I hardly encounter any need to use RPC for inter-service communication. Partially because I often use Event Sourcing, more about it later. But we always use asynchronous communication and exchange information between services using events, even without Event Sourcing.
For example, an order microservice in an e-commerce system needs customer data from the customer microservice. These dependencies between microservices are not ideal. Other microservices can go down and synchronous RESTful requests over https do not scale well due to their blocking nature. If there was a way to completely eliminate dependencies between microservices completely the result would be a more robust architecture with less bottlenecks.
You don’t need Event Sourcing to fix this issue. Event-driven systems are perfectly capable of doing that. Event Sourcing can eliminate some of the associated issues like two-phase commits, but again, not a requirement to remove the temporal coupling from your system.

Passively Logging React App Performance in Production

I'm wondering if there are any utilities/patterns/paradigms/standards for monitoring React applications in production.
I've seen a lot of documentation about React performance debugging that recommends the Chrome Dev Tools (which are great, but aren't a passive way to monitor end user performance)
How could I log data to know how long users are waiting for components to mount or render?
The only thing I've thought of so far is creating a Loggable[Pure]Component that extends React.[Pure]Component whose constructor, componentWillMount/Update, and componentDidMount/Update methods log render/mount times to a server. Then, components I want to monitor can extend these components and, if need be, call super() in the lifecycle methods before doing their own work. To specifically know which components these metrics go to, I'd have to expose a method in the Loggable[Pure]Component class that does something silly like setUniqueId and then each derived class would have to call it in the constructor.
This all seems terrible and I'm very much hoping there are some things people out there have implemented, but I haven't found anything thus far.
I would have a look at some APM tools, they handle the frontend monitoring, and the backend monitoring as well. They all support react, and folks use these all the time for that use case. It really depends on your goals in the monitoring, are you doing this for fun? Do you have a startup? Are you working for a large enterprise? There are 3 major players in this market.
AppDynamics - Enterprise APM, handles the most complex apps. Unified product offering delivered SaaS or on-premises. Has deep database, server, and other monitoring.
Dynatrace - Enterprise APM, handles complex apps well. Fragmented portfolio, but the SaaS product is good. The SaaS product has limited depth in some ways. Handles server and cloud infrastructure monitoring well.
New Relic - Easy and cheap(er than others), not as in-depth as some other options. Tends to be popular with small companies. Does a good job monitoring cloud infrastructure services.
These products all do what you are looking for, but it depends on your goals with the data and how you plan to analyze it.
If you want something free and less functional there are ways to do this with open source, but you'll have to stand up and manage a pretty complex stack. Here is one option.
Check out boomerang, which can log/extract the metrics you are looking for, it doesn't "understand" react, but it should work. This data can be posted to many different systems. The best suited is likely the ELK stack (open source log analytics, and more). Here is one of several examples which marries these two together to provide analysis of the browser performance https://github.com/naukri-engineering/NewMonk

Performance monitoring all layers of a system

I use several loadtesting tools (Loadrunner, JMeter, NeoLoad) to performance test different applications. Im wondering if it is possible to monitor all layers of an application stack so for example. Say i have the following data chain.
Loadbalancer <-x-> Application Server <-x-> RMI <-x-> Java Application <-x-> MQ <-x-> Legacy application <-x-> Database
Where i have marked the x in the chain i am interested in monitoring, for example avg responsetimes.
Obviously we could simply create a wrapper on all endpoints which would gather the statistics for us and maybe we could import it into loadrunner or other loadtesting tools and sideline hem with the tools inbuilt performance statistics, but maybe there is tools/applications which already does this?
If not, how should we proceed, in order to gather this kind of statistics?
The standard for this was supposed to be Application Response Measurement (ARM). It was a cross language set of APIs that did just what you were looking for. The issue is that the products that implement this spec all tend to be big, expensive "enterprise" level monitoring tools. Think multi-week installs, consultants, more infrastructure and lots of buzzwords.
Still, if this is a mission critical app with a mission critical budget, this may be what you need. But you may be able to build your own that does just enough without too much effort. A quick search turns up at least one open source ARM implementation if you still want to use that API.
Another option is to simply to have transactions you can run against each tier of the system to check general responsiveness. For example you can have a static web page on the LB, a no-op tx on the app server, a "hello" servlet on the Java app, put a message directly on the queue, etc. During a performance / load test, these could be hit directly by the load testing tool or you could write a wrapper servlet / application call that does this as a single HTTP (RMI?) call. Running these a few times a minute won't add too much load to the system, but it should help you pinpoint which tier is slower. The nice thing about this approach is that it also works in production, just watch out for security issues.
For single user kind of test, where you know you have problem (e.g. this tx is "slow"), I have also had pretty good luck with network tracing. It's very tedious, but when you aren't sure what tier is slow, starting up a network trace on a few machines and running a single tx usually gives a good idea of what the system is doing.
I have handled this decomposition a number of ways in the past. The first is at a very low level using protocol analyzer dumped data to find the time points where a conversation leaves tier X and enters tier Y. The second method is through the use of log examination for the various tiers. Something that can make your examination quite usefule in this case is a common log server for all of your components (syslog, Rsyslog, etc....) and a nice log parsing tool, such as the freely available Microsoft Logparser. The third method utilization of the audit trail for an application stored in the database. You may find this when working on enterprise services bus style applications which have a consumer/producer model and a bus to pass information rather than a direct connection. The audit trails I have seen are typically stored in a database and allow the tracking of an individual transaction through the entire application infrastructure. Your Load balancer, as a network device, may be out of the hunt on this one.
Note, if you go the protocol analyzer or log route, then be sure and synchronize all of your source information devices to a common time server. Having one of your collectors (analyzer, app log) off on a time stamp basis can really be a hair pulling experience when you get into the analysis phase.
As to how you move from your collected data into LoadRunner, that part is very mechanical. The Analysis program supports an interface to import external datapoints. The format is very specific and is documented in both help and the online docs. This import process works very well, as I often have to use it for collection of statistics from hosts which I do not have direct monitoring access to, but which need to be included as a part of the monitored test infrastructure.
James Pulley
Moderator (YahooGroups LoadRunner, Advanced-Loadrunner; GoogleGroups lr-LoadRunner; Linkedin LoadRunner, LoadRunnerByTheHour; SQAForums LoadRunner, WinRunner)

Measuring Application Performance

I was wondering if there is a tool to keep track of application performance. What I have in mind is a tool that will listen for updates and register performance metrics published by an application. i.e. time to serve a request, time a certain operation took to finish. And this tool would then aggregate the data and measure performance trends.
If you want to measure your application from outside, then you can use RRDtool to collect the data.
You can use slamd for webapp written in Java.
For Django use hotshot.
Search for profiler + your language, framework
Take a look at HP SiteScope. It's ability to drive the system with a Web User Script, to monitor the metrics on the backend, even to the extent of creation of custom shell scripts and database queries, plus the ability to add logic for report/alert against these combined data sets appears to be what you need.
Other mechanisms that you might consider would be a roll your own service using CURL to push information in, queries to the systems involved to pull metrics or database information and then your own interface for alerting and reporting.
Then it becomes a cost question, can you roll the level of functionality for less money than you can purchase an already existing solution on the open market.
Ref:
HP SiteScope Wiki Page

Resources