WLDF (WebLogic Diagnostic Framework) allows many performance-related analyses - in particular resource demands tracking and tracing across classes and methods. In that sense, it is similar to a profiler - however, it works on the server side, and is bound to the particular product/vendor.
Are there any other products (maybe even open/free) which offer similar level of detail? I'm not interested in "conventional" monitoring products such as JMX, VisualVM, Hyperic etc. but in low-level, detailed tracing and request tracking.
Many thanks,
Michael
The free version of http://appdynamics.com/ is probably your best bet. It does the low level detailed tracing without the traditional overhead of a profiler (so yes, you can run it in prod).
Related
I have come to know about opentracing and is even working on a POC with Jaeger and Spring. We have around 25+ micro services in production. I have read about it but is a bit confused as how it can be really used.
I'm thinking to use it as a troubleshooting tool to identify the root cause of a failure in the application. For this, we can search for httpStatus codes, custom tags, traceIds and application logs in JaegerUI. Also, we can find areas of bottlenecks/slowness by monitoring the traces.
What are the other usages?
Jaeger has a request sampler and I think we should not sample every request in Prod as it may have adverse impact. Is this true?
If yes, then why and what can be the impact on the application? I guess it can't be really used for troubleshooting in this case as we won't have data on every request.
What sampling configuration is recommended for Prod?
Also, how a tool like Jaeger is different from APM tools and where does it fit in? I mean you can do something similar with APM tools as well. For e.g., one can drill through a service's transaction and jump to corresponding request to other service in AppDynamics. Alerts can be put on slow transactions. One can also capture request headers/body so that they can be searched upon, etc.
There's a lot of different questions here, and some of them don't have answers without more information about your specific setup, but I'll try to give you a good overview.
Why Tracing?
You've already intuited that there are a lot of similarities between "APM" and "tracing" - the differences are fairly minimal. Distributed Tracing is a superset of capabilities marketed as APM (application performance monitoring) and RUM (real user monitoring), as it allows you to capture performance information about the work being done in your services to handle a single, logical request both at a per-service level, and at the level of an entire request (or transaction) from client to DB and back.
Trace data, like other forms of telemetry, can be aggregated and analyzed in different ways - for example, unsampled trace data can be used to generate RED (rate, error, duration) metrics for a given API endpoint or function call. Conventionally, trace data is annotated (tagged) with properties about a request or the underlying infrastructure handling a request (things like a customer identifier, or the host name of the server handling a request, or the DB partition being accessed for a given query) that allows for powerful exploratory queries in a tool like Jaeger or a commercial tracing tool.
Sampling
The overall performance impact of generating traces varies. In general, tracing libraries are designed to be fairly lightweight - although there are a lot of factors that influence this overhead, such as the amount of attributes on a span, the log events attached to it, and the request rate of a service. Companies like Google will aggressively sample due to their scale, but to be honest, sampling is more beneficial to consider from a long-term storage perspective rather than an up-front overhead perspective.
While the additional overhead per-request to create a span and transmit it to your tracing backend might be small, the cost to store trace data over time can quickly become prohibitive. In addition, most traces from most systems aren't terribly interesting. This is why dynamic and tail-based sampling approaches have become more popular. These systems move the sampling decision from an individual service layer to some external process, such as the OpenTelemetry Collector, which can analyze an entire trace and determine if it should be sampled in or out based on user-defined criteria. You could, for example, ensure that any trace where an error occurred is sampled in, while 'baseline' traces are sampled at a rate of 1%, in order to preserve important error information while giving you an idea of steady-state performance.
Proprietary APM vs. OSS
One important distinction between something like AppDynamics or New Relic and tools like Jaeger is that Jaeger does not rely on proprietary instrumentation agents in order to generate trace data. Jaeger supports OpenTelemetry, allowing you to use open source tools like the OpenTelemetry Java Automatic Instrumentation libraries, which will automatically generate spans for many popular Java frameworks and libraries, such as Spring. In addition, since OpenTelemetry is available in multiple languages with a shared data format and trace context format, you can guarantee that your traces will work properly in a polyglot environment (so, if you have Node.JS or Golang services in addition to your Java services, you could use OpenTelemetry for each language, and trace context propagation would work seamlessly between all of them).
Even more advantageous, though, is that your instrumentation is decoupled from a specific vendor or tool. You can instrument your service with OpenTelemetry and then send data to one - or more - analysis tools, both commercial and open source. This frees you from vendor lock-in, and allows you to select the best tool for the job.
If you'd like to learn more about OpenTelemetry, observability, and other topics I wrote a longer series that you can find here (look for the other 'OpenTelemetry 101' posts).
I was looking at APM tools. Essentially Dynatrace and I could see that it also provides tracing capabilities that seem to be language agnostic and also without code modifications.
Where would jaeger/open tracing be a better option than a tool like dynatrace?
Yes, dynatrace (or others like Elastic APM) is capable of providing a lot more insight into an application other than tracing.
But just from tracing perspective...
What advantages or capabilities does jaeger have that are better than APM tooling or not available in APMs. ONLY from the tracing perspective.
As you said, dynatrace can provide more insights, but obviously that comes with a price tag.
A Jaeger/openTracing solution
can be up and running very quickly
provides quality insights into performance/bottlenecks in your execution paths
having the source code available is very useful if you want to customize any part of the process (for example i added some code to use a different message queue)
I would add that dynatrace is a great tool, but it is a full APM tool, so provides a wide variety of insights, and its expensive. Jaeger focusses on the tracing aspect and for an open source, free tool, it does a very good job.
I'm wondering if there are any utilities/patterns/paradigms/standards for monitoring React applications in production.
I've seen a lot of documentation about React performance debugging that recommends the Chrome Dev Tools (which are great, but aren't a passive way to monitor end user performance)
How could I log data to know how long users are waiting for components to mount or render?
The only thing I've thought of so far is creating a Loggable[Pure]Component that extends React.[Pure]Component whose constructor, componentWillMount/Update, and componentDidMount/Update methods log render/mount times to a server. Then, components I want to monitor can extend these components and, if need be, call super() in the lifecycle methods before doing their own work. To specifically know which components these metrics go to, I'd have to expose a method in the Loggable[Pure]Component class that does something silly like setUniqueId and then each derived class would have to call it in the constructor.
This all seems terrible and I'm very much hoping there are some things people out there have implemented, but I haven't found anything thus far.
I would have a look at some APM tools, they handle the frontend monitoring, and the backend monitoring as well. They all support react, and folks use these all the time for that use case. It really depends on your goals in the monitoring, are you doing this for fun? Do you have a startup? Are you working for a large enterprise? There are 3 major players in this market.
AppDynamics - Enterprise APM, handles the most complex apps. Unified product offering delivered SaaS or on-premises. Has deep database, server, and other monitoring.
Dynatrace - Enterprise APM, handles complex apps well. Fragmented portfolio, but the SaaS product is good. The SaaS product has limited depth in some ways. Handles server and cloud infrastructure monitoring well.
New Relic - Easy and cheap(er than others), not as in-depth as some other options. Tends to be popular with small companies. Does a good job monitoring cloud infrastructure services.
These products all do what you are looking for, but it depends on your goals with the data and how you plan to analyze it.
If you want something free and less functional there are ways to do this with open source, but you'll have to stand up and manage a pretty complex stack. Here is one option.
Check out boomerang, which can log/extract the metrics you are looking for, it doesn't "understand" react, but it should work. This data can be posted to many different systems. The best suited is likely the ELK stack (open source log analytics, and more). Here is one of several examples which marries these two together to provide analysis of the browser performance https://github.com/naukri-engineering/NewMonk
I use several loadtesting tools (Loadrunner, JMeter, NeoLoad) to performance test different applications. Im wondering if it is possible to monitor all layers of an application stack so for example. Say i have the following data chain.
Loadbalancer <-x-> Application Server <-x-> RMI <-x-> Java Application <-x-> MQ <-x-> Legacy application <-x-> Database
Where i have marked the x in the chain i am interested in monitoring, for example avg responsetimes.
Obviously we could simply create a wrapper on all endpoints which would gather the statistics for us and maybe we could import it into loadrunner or other loadtesting tools and sideline hem with the tools inbuilt performance statistics, but maybe there is tools/applications which already does this?
If not, how should we proceed, in order to gather this kind of statistics?
The standard for this was supposed to be Application Response Measurement (ARM). It was a cross language set of APIs that did just what you were looking for. The issue is that the products that implement this spec all tend to be big, expensive "enterprise" level monitoring tools. Think multi-week installs, consultants, more infrastructure and lots of buzzwords.
Still, if this is a mission critical app with a mission critical budget, this may be what you need. But you may be able to build your own that does just enough without too much effort. A quick search turns up at least one open source ARM implementation if you still want to use that API.
Another option is to simply to have transactions you can run against each tier of the system to check general responsiveness. For example you can have a static web page on the LB, a no-op tx on the app server, a "hello" servlet on the Java app, put a message directly on the queue, etc. During a performance / load test, these could be hit directly by the load testing tool or you could write a wrapper servlet / application call that does this as a single HTTP (RMI?) call. Running these a few times a minute won't add too much load to the system, but it should help you pinpoint which tier is slower. The nice thing about this approach is that it also works in production, just watch out for security issues.
For single user kind of test, where you know you have problem (e.g. this tx is "slow"), I have also had pretty good luck with network tracing. It's very tedious, but when you aren't sure what tier is slow, starting up a network trace on a few machines and running a single tx usually gives a good idea of what the system is doing.
I have handled this decomposition a number of ways in the past. The first is at a very low level using protocol analyzer dumped data to find the time points where a conversation leaves tier X and enters tier Y. The second method is through the use of log examination for the various tiers. Something that can make your examination quite usefule in this case is a common log server for all of your components (syslog, Rsyslog, etc....) and a nice log parsing tool, such as the freely available Microsoft Logparser. The third method utilization of the audit trail for an application stored in the database. You may find this when working on enterprise services bus style applications which have a consumer/producer model and a bus to pass information rather than a direct connection. The audit trails I have seen are typically stored in a database and allow the tracking of an individual transaction through the entire application infrastructure. Your Load balancer, as a network device, may be out of the hunt on this one.
Note, if you go the protocol analyzer or log route, then be sure and synchronize all of your source information devices to a common time server. Having one of your collectors (analyzer, app log) off on a time stamp basis can really be a hair pulling experience when you get into the analysis phase.
As to how you move from your collected data into LoadRunner, that part is very mechanical. The Analysis program supports an interface to import external datapoints. The format is very specific and is documented in both help and the online docs. This import process works very well, as I often have to use it for collection of statistics from hosts which I do not have direct monitoring access to, but which need to be included as a part of the monitored test infrastructure.
James Pulley
Moderator (YahooGroups LoadRunner, Advanced-Loadrunner; GoogleGroups lr-LoadRunner; Linkedin LoadRunner, LoadRunnerByTheHour; SQAForums LoadRunner, WinRunner)
I am currently looking at some JProfiler traces from our WebSphere-based application, and am noticing that a significant amount of CPU time is being spent in the class com.ibm.io.async.AsyncLibrary.getCompletionData2.
I am guessing, but I am wondering whether this is PMI-related (and we do have this enabled).
My knowledge of PMI is limited, as this is managed by another team.
Is it expected that PMI can have this sort of impact?
(If so) Is the only option to turn it off completely? Or are there some types of data capture that have a particularly high overhead?
PMI has multiple levels that can be instrumented. The basic one should have minimal impact.
This particular class that you are referring to should not be related to PMI. I am only guessing here as these classes are not exposed publicly and they are used by the WAS runtime internally.
What version of WAS are you on? There were some known issues on WAS in this space.
For e.g refer to this link below:
[1]: http://www-01.ibm.com/support/docview.wss?uid=swg1PK41617 PK41617: PERFORMANCE OF THE AIO LIBRARY SUFFERS UNDER CERTAIN TYPES OF TRAFFIC FLOW
Please check if this is applicable to your environment.
Also try and skim through this link -
[1]: http://www-01.ibm.com/support/docview.wss?uid=swg21366862 Disabling AIO (Asynchronous Input/Output) in WebSphere Application Server
You can probably give this a try if this is not a production environment and see if disabling AIO eliminates the spikes for you. Even if it works well, i would go through the formal IBM support process to find out exact reasons before doing the same for a production enviroment.
HTH
Manglu