Identify Surge of API Processing Time in Datadog - spring

Hello this is not code related. I just want to ask how to debug the rootcause of a 'heavy procssing time of our APIs'.
Where do I look in Datadog.
Is it Logs dashboard?
Or Events Dashboard?
I want to be able to identify why between 8:00 AM and 8:30 AM, datadog showed a spike in API processing time. Should I check our MongodDB logs?

Related

When NewRelic starts to collect metrics

This might be unusual question but I have to be sure if my suspicions are correct. In our company we use NewRelic to monitor our applications. From time to time I check what NewRelic says about app developed by me and I'm always wondering why average response time is way much lower then tests made manually by me or some external tools, e.g:
average response time of some endpoint is always somewhere around 130 ms in NewRelic metrics
when I test it manually it is something about 230-250 ms
some tools used in our company which can make lots of requests per some period of time also claims that average response time is something about 200 ms
(similar difference is visible in other endpoints ~ 100ms)
Those tests are made from location in eastern Europe and our app is hosted in UK, so we can assume that request needs something about 40 ms to reach servers and then the same amount to go back. Another thing is that we have some infrastructure overhead like loadbalancing and url resolving so we can have there another "few" milliseconds. As You can see when we add everything we have the difference.
The question is: am I right ? Because those are only my speculations and I wasn't able to find clear answer where and when NewRelic starts to collect the data when we look at whole request path :
client ---request---> web-app ---response---> client

Kibana - How to count number of error logs and the type of error

I monitor our team project error logs in Kibana and report them, like: From yesterday to today, there has been 50 errors, 20 of them is IP authentification and 30 Host error... or something like that.
I wanted to automate this process, counting the number of errors and their types and displaying them on Slack, kind of a microsoft teams. I was looking at web scrapping with python to extract those error logs but it doesn't quite look like what I'm looking for.
How would you go about this?
Build a Watcher for that.
Query your stuff by timeframe, do the aggregations by "error category" & count your numbers, schedule the Watcher to fire at the frequency you're comfortable with, and send the results directly to Slack (connector is provided out of the box).
How to do it:
https://www.elastic.co/guide/en/elasticsearch/reference/current/watcher-api-put-watch.html

MlflowException: API request to ...URL... failed to return code 200 after 3 tries

I am currently trying to track my machine learning model metrics using the MLFlow API in Azure Databricks.
I registered the experiment under my team's machine learning workspace and had tried a few metric log commands that worked but were simply used as a test.
My notebook ran a for loop logging metrics per calculation within the loop.
It took a while (3-5 seconds) before sending out the error.
I tried to look at the experiment metrics and it seems to have logged a bit of the for loop's metrics before crashing.
Not sure as to why it does it and now it throws the exception to my earlier test calls to log metrics.

Linkerd not displaying count of requests, success and failure rates

I am new to linkerd and trying to proxy all the requests to my microservices via linkerd and with file based service discovery. I was able to do it successfully and the requests successfully got registered with the admin dashboard running on port 9990.
But my problem is the dashboard always shows N/A for the success rate and failure rate. It becomes 100% for just a sec the request is received and again goes back to N/A. But I want to keep track of all my request via linkerd i.e I want linkerd to remember the number of requests and the successrate and failure rate.
Here is the screenshot of my problem
This question was answered on the Linkerd community forum. Adding the answer here as well for the sake of completeness:
The dashboard gives a current snapshot of what's going on - it polls /admin/metrics.json every second and displays the metrics at that time (so at that instant, how many requests, retries, pending requests there are, so if there's nothing going through at that moment, those stats will be 0). For a longer term view of metrics, you'll need something else (see https://linkerd.io/getting-started/admin/index.html#metrics for more info on collecting metrics).
If you're on Kubernetes or DC/OS, you can also check out linkerd-viz. Hope that helps!

JMeter - How to measure exact response time if distance between client and server is too far

I'm beginner in JMeter and I have a issues with it: when I run jmeter on Vietnam and test a server on US and user "View Results in Table" to view result. In this report, I want to know how to calculate "Sample Time"? It's time the server response or time which client received response? and how to effect if distance between client and server is too far?
Sampler Time will be the time of Vietnam.
But you can configure your JMeter instance to use the US Timezone through System property:
-Duser.timezone=
See:
Force Java timezone as GMT/UTC
Regarding Response Time, it will include latency due to you being far from US but it will reflect what Vietnam users will face.
So if your requirement is to measure US feeling then you will need to load test from a US server, if your requirement is to measure Vietnamese feeling on a US hosted application then it's ok.

Resources