I'm having a strange behaviour when I enable Datadog Continuous Profiling in my Node.js application.
Here is my dd-trace.init
require('dd-trace').init({
enabled: true.
env: process.env.NODE_ENV,
profiling: true,
logInjection: true
}
When profiling is false I see this kind of trace list:
When profiling is true I see this one:
Except the different colors of the traces, the main issue is that with profiling: true I see all the requests grouped under POST /graphql, instead, with profiling: false I see another service specific for GraphQL that list all the transaction names which is obviously very useful. I need profiling enabled to access a Flame Graph to analyze a memory leak.
Related
Since importing net/http/pprof package is enough to enable the golang profiler in a program, my question is, is the profiler enabled by default for all golang programs, and importing the package just exposes the metrics endpoint? If that is the case, how could I disable the profiler to eliminate performance overhead?
Is profiling enabled by default for all go programs?
No.
is the profiler enabled by default for all golang programs, and importing the package just exposes the metrics endpoint?
No.
If that is the case, how could I disable the profiler to eliminate performance overhead?
No, that's not the case.
If you look at the source code of the pprof.go file, you can see in its handlers, that the profiling, tracing etc. are running only when you do a GET on any of its endpoints.
If you specify a time, for example: http://localhost:6060/debug/pprof/trace?seconds=10 the server will take 10 seconds to respond with the trace data. So the profiling is happening only if you call on an endpoint.
You can find some examples in the first comment block of the pprof.go file.
I have a number of Lambdas that are running with provisioned concurrency. They scale up properly when users are hitting our site, and everything works great there. Unfortunately, at night when nobody is hitting our site, it's not scaling back down. Our AWS bill is now a good $1-2k higher than it should be.
Here's what I see in the Cloudwatch Alarms when there are no users on the site. Specifically, it says Insufficient Data beside the alarm that should be scaling the provisioned concurrency down.
How do I configure my Cloudwatch AutoScaling alarms so that they scale down properly? There are no requests on the site at all.
I also recorded this Loom video, in case that helps.
Here's my Serverless.yml configuration that I'm using to create the Lambda (using the serverless-provisioned-concurrency-autoscaling plugin):
home_page:
handler: homePage/home_page_handler.get
memorySize: 3072 # This is high just to speed up page load times
events:
- http:
path: homePage
method: get
cors: true
authorizer: ${file(system/restAPI/function_authorizer.yml)}
layers:
- ${cf:layers-${opt:stage}.CommonLibsLambdaLayerExport}
package:
patterns:
- reports/**
provisionedConcurrency: 1
concurrencyAutoscaling: true
Halp!
I think I've figured this out. Provisioned Concurrency alarms have a setting on what to do when there is "Insufficient data" (i.e. no data).
There's a setting way down on the metric (open the Advanced Configuration section).
I had this set to Treat missing data as missing and this needs to be set to Treat missing data as bad (breaching threshold) for the Lambda Provisioned concurrency to be scaled down when there are no users.
In my case, I've opened up a ticket with the serverless-provisioned-concurrency-autoscaling plugin to get this fixed, since I don't control it myself.
I conducted performance testing on e-commerce website and I have the test results with some matrices. I already found some problems on some component for example on checkout or post login with high response time and error. But I also would like to find issues that are limiting the application to scale. I only did the testing on the application server. And I observed that CPU , I/O rate are very stable as well. But still the application gives high response time. Is there any other way I can determine from the test result why it is not scaling well? Thank!
From JMeter test result only - unlikely, JMeter just sends requests, waits for the responses and measures the time in-between plus collects some extra metrics like connect time and latency, see the JMeter Glossary for full list with explanations
The integrated system acts at the speed of its slowest component, possible reasons could be in:
Network issues (i.e. lack of bandwidth, faulty router, long DNS resolution time, etc.)
Your application is not properly configured for high loads. Inspect the current setup of the application in terms of thread pools, maximum number of open connections, any limitations on resource usage, etc. Look for documentation on performance tuning of individual middleware compoments as well.
Repeat your test run with a profiler tool telemetry enabled or look at the APM tool output for the test time frame if the tool is in place, it will allow you do perform a deep dive into what's going on under the hood of this or that function call as it might be inefficient algorithm or a slow database query
I'm kind of new to the pprof tool, and am wondering if its ok to keep running this in production. From the articles I have seen, it seems to be ok and standard, however I'm confused as to how this does not affect performance since it does a sampling N times every second and how come this does not lead to a degradation in performance.
Jaana Dogan does say in her article "Continuous Profiling of Go programs"
Profiling in production
pprof is safe to use in production.
We target an additional 5% overhead for CPU and heap allocation profiling.
The collection is happening for 10 seconds for every minute from a single instance. If you have multiple replicas of a Kubernetes pod, we make sure we do amortized collection.
For example, if you have 10 replicas of a pod, the overhead will be 0.5%. This makes it possible for users to keep the profiling always on.
We currently support CPU, heap, mutex and thread profiles for Go programs.
Why?
Before explaining how you can use the profiler in production, it would be helpful to explain why you would ever want to profile in production. Some very common cases are:
Debug performance problems only visible in production.
Understand the CPU usage to reduce billing.
Understand where the contention cumulates and optimize.
Understand the impact of new releases, e.g. seeing the difference between canary and production.
Enrich your distributed traces by correlating them with profiling samples to understand the root cause of latency.
So if you are using pprof for the right reason, yes, you can leave it in production.
But for basic monitoring, as commented, the system is enough.
As noted in "Continuous Profiling and Go" by Vladimir Varankin
Depending on the state of the infrastructure in the company, an “unexpected” HTTP server inside the application’s process can raise questions from your systems operations department ;)
At the same time, depending on the peculiar nature of a company, the very ability to access something inside a production application, that doesn’t directly relate to application’s business logic, can raise questions from the security department ;)) I
So the overhead is not the only criteria to consider when leaving active such a feature.
Being new to systematic debugging, I asked myself what these three terms mean:
Debugging
Profiling
Tracing
Anyone could provide definitions?
Well... as I was typing the tags for my question, it appeared that stack overflow already had defined the terms in the tags description. Here their definitions which I found very good:
Remote debugging is the process of running a debug session in a local development environment attached to a remotely deployed application.
Profiling is the process of measuring an application or system by running an analysis tool called a profiler. Profiling tools can focus on many aspects: functions call times and count, memory usage, cpu load, and resource usage.
Tracing is a specialized use of logging to record information about a program's execution.
In addition to the answer from Samuel:
Debugging is the process of looking for bugs and their cause in applications. a bug can be an error or just some unexpected behaviour (e.g. a user complains that he/she receives an error when he/she uses an invalid date format). typically a debugger is used that can pause the execution of an application, examine variables and manipulate them.
Profiling is a dynamic analysis process that collects information about the execution of an application. the type of information that is collected depends on your use case, e.g. the number of requests. the result of profiling is a profile with the collected information. the source for a profile can be exact events (see tracing below) or a sample of events that occured.
because the data is aggregated in a profile it is irrelevant when and in which order the events happened.
Tracing "trace is a log of events within to the program"(Whitham). those events can be ordered chronologically. thats why they often contain a timestamp. Tracing is the process of generating and collecting those events. the use case is typically flow analysis.
example tracing vs profiling:
Trace:
[2021-06-12T11:22:09.815479Z] [INFO] [Thread-1] Request started
[2021-06-12T11:22:09.935612Z] [INFO] [Thread-1] Request finished
[2021-06-12T11:22:59.344566Z] [INFO] [Thread-1] Request started
[2021-06-12T11:22:59.425697Z] [INFO] [Thread-1] Request finished
Profile:
2 "Request finished" Events
2 "Request started" Events
so if tracing and profiling measure the same events you can construct a profile from a trace but not the other way around.
source
Whitham: https://www.jwhitham.org/2016/02/profiling-versus-tracing.html
IPM: http://ipm-hpc.sourceforge.net/profilingvstracing.html