Linkerd not displaying count of requests, success and failure rates - microservices

I am new to linkerd and trying to proxy all the requests to my microservices via linkerd and with file based service discovery. I was able to do it successfully and the requests successfully got registered with the admin dashboard running on port 9990.
But my problem is the dashboard always shows N/A for the success rate and failure rate. It becomes 100% for just a sec the request is received and again goes back to N/A. But I want to keep track of all my request via linkerd i.e I want linkerd to remember the number of requests and the successrate and failure rate.
Here is the screenshot of my problem

This question was answered on the Linkerd community forum. Adding the answer here as well for the sake of completeness:
The dashboard gives a current snapshot of what's going on - it polls /admin/metrics.json every second and displays the metrics at that time (so at that instant, how many requests, retries, pending requests there are, so if there's nothing going through at that moment, those stats will be 0). For a longer term view of metrics, you'll need something else (see https://linkerd.io/getting-started/admin/index.html#metrics for more info on collecting metrics).
If you're on Kubernetes or DC/OS, you can also check out linkerd-viz. Hope that helps!

Related

When NewRelic starts to collect metrics

This might be unusual question but I have to be sure if my suspicions are correct. In our company we use NewRelic to monitor our applications. From time to time I check what NewRelic says about app developed by me and I'm always wondering why average response time is way much lower then tests made manually by me or some external tools, e.g:
average response time of some endpoint is always somewhere around 130 ms in NewRelic metrics
when I test it manually it is something about 230-250 ms
some tools used in our company which can make lots of requests per some period of time also claims that average response time is something about 200 ms
(similar difference is visible in other endpoints ~ 100ms)
Those tests are made from location in eastern Europe and our app is hosted in UK, so we can assume that request needs something about 40 ms to reach servers and then the same amount to go back. Another thing is that we have some infrastructure overhead like loadbalancing and url resolving so we can have there another "few" milliseconds. As You can see when we add everything we have the difference.
The question is: am I right ? Because those are only my speculations and I wasn't able to find clear answer where and when NewRelic starts to collect the data when we look at whole request path :
client ---request---> web-app ---response---> client

Google API Key giving Query over limit error

We have an web application which was working fine till yesterday. But since yesterday afternoon , one of our projects in google api console , all the keys started giving OVER_QUERY_LIMIT error.
And we cross checked that the quotas for that project and api are still not full. Can anybody help me to understand what may have caused this.
And after a days use also the API keys are still giving the same error.
Just to give more information we are using Geocoding API and Distance Matrix API in our application.
If you exceed the usage limits you will get an OVER_QUERY_LIMIT status code as a response. This means that the web service will stop providing normal responses and switch to returning only status code OVER_QUERY_LIMIT until more usage is allowed again. This can happen:
Within a few seconds, if the error was received because your application sent too many requests per second.
Within the next 24 hours, if the error was received because your application sent too many requests per day. The daily quotas are reset at midnight, Pacific Time.
This screencast provides a step-by-step explanation of proper request throttling and error handling, which is applicable to all web services.
Upon receiving a response with status code OVER_QUERY_LIMIT, your application should determine which usage limit has been exceeded. This can be done by pausing for 2 seconds and resending the same request. If status code is still OVER_QUERY_LIMIT, your application is sending too many requests per day. Otherwise, your application is sending too many requests per second.
Note: It is also possible to get the OVER_QUERY_LIMIT error:
From the Google Maps Elevation API when more than 512 points per request are provided.
From the Google Maps Distance Matrix API when more than 625 elements per request are provided.
Applications should ensure these limits are not reached before sending requests.
Documentation usage limits

Azure Performance - Ping Test - Inconsistent values between Availability and Performance

In Azure you can create simple ping test of your app. It's call ping but it is Get request of a url.
By default, the url is your root url.
The thing is the responses times of thes results are in the range of 2 to 10ms. However, I can never reach these response times nor with Fiddler or Postman. My range is more 100 to 400ms. And I'm closer to the datacenter than the computers running the ping tests in Azure.
It is a bit as if the ping tests where not downloading the content page.
Does anyone know?
UPDATE
I have setup my ping test in the Availability section. The responses times I mention above appear in the Performance section. Back in the Availability section, average response time is 1,6 sec. These two sections show inconsistent values.
UPDATED ANSWER:
The Performance section lists how long it took from your server receiving the request to sending something back to the client, it doesn't count network latency at all.
I believe they only check the response status without downloading the content if you don't require a content match.
Below is an example of the configuration for my blog.
If you wish, you can make sure the test downloads the content by ticking the Content match checkbox, and specifying that the content must contain the text that is somewhere near the end of your index page (like in a footer).

Time reported in WILY is very much less compared to Load Runner Time. Why?

I am trying to monitor the time spent on server using WILY Introscope but i observe that the time mentioned in WILY for each of the servers is in the range of 100 to 1000 ms. But the time taken for a page to load in browser is almost 5 seconds.
Why is the tool reporting incorrect value ? how to get the complete time in WILY ?
time mentioned in WILY for each of the servers is in the range of 100
to 1000 ms. But the time taken for a page to load in browser is almost
5 seconds.
Reason is - In Browser, you see all the outgoing traffic from the browser. Ideally, any web page would contain 1 POST request followed by multiple GET requests. POST could be your text/html data while Get could be image, CSS, javascript etc.
Mostly these Get requests would be answered by the Web server and post request would be served by involving app server.
The time reported in WILY is only the time spent on server to serve the POST request. Your GET request calls will not be captured by WILY.
Why is the tool reporting incorrect value ? how to get the complete
time in WILY ?
Tool is not reporting incorrect value. Tool sits on a JVM ideally. So it monitors the activity of the JVM and provides the metrics. That is the expected behavior.
A page is a complex item, requiring parsing of the page contents and then requests to multiple servers/sources. So, your page load time will be made up request time for an individual component, processing time for the page parsing and javascript (depending upon virtual user type), requests for the page components, where they are served from, etc... Compare this to your Wily monitoring, which may only be on one of the tiers involved.
For instance, you may have static components being served from a CDN which has zero visibility in your Wily Model. You might also be looking at your app server when the majority of the time is spent serving static components off of a web server, which is oft ignored from a monitoring perspective. Your page could have third party components which are loading which get counted in the Loadrunner time, but do not get counted in the Wily time.
It all comes down a a question of sampling. It is very common for what you see in your deep diag tool to be a piece of the total page load, or an individual request which makes up a page where there are many more components to be loaded. If you want and even more interesting look then enable the w3c time-taken field in your web HTTP request logs and look to see the cost of every individual request. You can do this in the web layer of your app servers as well. Wily will then provide internal breakdown for those items which are "slow."

Possible explanation of sudden spike in memcached get

Newbie in Newrelic here. I have an API service hosted on Heroku and being monitored at Newrelic.
While I was studying how to use newrelic. I found out my 2 workers are being underutilised with very low RPM and low transaction time. So I decided to cut down to one worker which saves me $36 a month. =]
Shortly after that I received tonnes of logEntries emails stating request timeouts of one of my web dynos. Looking into Newrelic. I found out that one of my actions are being called suspciously high number of times for 2-3 minutes.
The action being V1::CarsController#Index, which basically shows a collection of cars.
While I was not sure whether the deletion of one worker dyno has caused memcached to do something, I also suspect that may be someone is trying scrap the data off the database. I am not too sure how to further investigate into the issue. I wonder if I can track down the request IP and see it is the same? or how can I further investigate?
If further information is needed I am happy to provide in Edits!
Thanks

Resources