How to monitor HTTP status codes with serverless datadog plugin - aws-lambda

I am using serverless-plugin-datadog, which uses datadog-lambda-layer under the hood.
The docs state, that by using this plugin it is not necessary to wrap a handler anymore. This is, by the way, the main reason why I decided to go for it.
The lambda itself is a REST API, which responds with dedicated status codes.
My question now is, how can I monitor the number of 4xx and 5xx http status codes? Do I have to define custom metrics in datadog for this to work? I was under the assumption that the plugin comes with those data out-of-the-box, but it looks like I'm missing an important part here.

but it looks like I'm missing an important part here.
That was the point. The lambda itself has not much todo with particular statusCodes. So I either may log each status code and let datadog parse it accordingly.
Or, that's the solution I went for, I can leverage API-Gateway for monitoring status codes per lambda.

Related

Disallow queuing of requests in gRPC microservices

SetUp:
We have gRPC pods running in a k8s cluster. The service mesh we use is linkerd. Our gRPC microservices are written in python (asyncio grpcs as the concurrency mechanism), with the exception of the entry-point. That microservice is written in golang (using gin framework). We have an AWS API GW that talks to an NLB in front of the golang service. The golang service communicates to the backend via nodeport services.
Requests on our gRPC Python microservices can take a while to complete. Average is 8s, up to 25s in the 99th %ile. In order to handle the load from clients, we've horizontally scaled, and spawned more pods to handle concurrent requests.
Problem:
When we send multiple requests to the system, even sequentially, we sometimes notice that requests go to the same pod as an ongoing request. What can happen is that this new request ends up getting "queued" in the server-side (not fully "queued", some progress gets made when context switches happen). The issue with queueing like this is that:
The earlier requests can start getting starved, and eventually timeout (we have a hard 30s cap from API GW).
The newer requests may also not get handled on time, and as a result get starved.
The symptom we're noticing is 504s which are expected from our hard 30s cap.
What's strange is that we have other pods available, but for some reason the loadbalancer isn't routing it to those pods smartly. It's possible that linkerd's smarter load balancing doesn't work well for our high latency situation (we need to look into this further, however that will require a big overhaul to our system).
One thing I wanted to try doing is to stop this queuing up of requests. I want the service to immediately reject the request if one is already in progress, and have the client (meaning the golang service) retry. The client retry will hopefully hit a different pod (do let me know if that won’t happen). In order to do this, I set the "maximum_concurrent_rpcs" to 1 on the server-side (Python server). When i sent multiple requests in parallel to the system, I didn't see any RESOURCE_EXHAUSTED exceptions (even under the condition when there is only 1 server pod). What I do notice is that the requests are no longer happening in parallel on the server, they happen sequentially (I think that’s a step in the right direction, the first request doesn’t get starved). That being said, I’m not seeing the RESOURCE_EXHAUSTED error in golang. I do see a delay between the entry time in the golang client and the entry time in the Python service. My guess is that the queuing is now happening client-side (or potentially still server side, but it’s not visible to me)?
I then saw online that it may be possible for requests to get queued up on the client-side as a default behavior in http/2. I tried to test this out in custom Python client that mimics the golang one with:
channel = grpc.insecure_channel(
"<some address>",
options=[("grpc.max_concurrent_streams", 1)]
)
# create stub to server with channel…
However, I'm not seeing any change here either. (Note, this is a test dummy client - eventually i'll need to make this run in golang. Any help there would be appreciated as well).
Questions:
How can I get the desired effect here? Meaning server sends resource exhausted if already handling a request, golang client retries, and it hits a different pod?
Any other advice on how to fix this issue? I'm grasping at straws here.
Thank you!

How does the serverless datadog forwarder encrypt/encode their logs?

I am having trouble figuring out how the datadog forward encodes/encrypts its messages from the datadog forwarder. We are utilizing the forwarder on datadog using the following documentation: https://docs.datadoghq.com/serverless/forwarder/ . On that page there, Datadog has an option to send the same event to another lambda that it invokes via the AdditionalTargetLambdaARNs flag. We are doing this and having the other lambda invoke but the event input that we are getting is long string that looks like it is base64 encoded but when I put it into a base64 decoder, I get gibberish back. I was wondering if anyone knew how datadog is compressing/encoding/encrypting their data/logs that they send so that I can read the logs in my lambda and be able to preform actions off of the data being forwarded? I have been searching google and the datadog site for documentation on this but I can't find any.
It looks like Datadog uses zstd compression in order to compress its data before sending it: https://github.com/DataDog/datadog-agent/blob/972c4caf3e6bc7fa877c4a761122aef88e748b48/pkg/util/compression/zlib.go

How to tracing a request through a chain of microservices end-to-end?

I am using OpenCensus in Go to push tracing data to Stackdriver for calls involving a chain of 2 or more micro services and I noticed that I get many traces which contain spans only for certain services but not the entire end to end call.
At the moment I attribute this to the fact that not all calls are traced (only a certain sample) and each service decides whether to trace its current span or not.
Is this the way it is intended to work? Is there any way to make sure when a trace is sampled, it is done so by all services in the call chain?
Architecturally I will say when you are developing your microservices make sure your API Gateway creates a unique id like GUID, which gets propagated through all the microservices & similarly you make sure that you have log aggregator which will be collecting logs from all the services & finally you are getting nice tracebility of request.

Handling remote api validation errors in service layer

Imagine that there is some manager class that talks to remote service, for example, to user microservice that can create new and update existing user profile. This manager class is used everywhere in the code: in controllers and other classes. Before talking to remote service, our manager class doesn't know if submitted DTO is valid. The question is: if remote service returns an validation errors, what to do next? How to handle this errors? I've thought about it, and have some options:
Throw an Exception when validation fails
Pass an Errors object that collects validation errors to the manager
make a method getLastErrors() in a manager class
Maybe there are other better solution exist?
p.s. Suppose that remote service returns errors in JSON format, it doesn't matter if it's JSON-RPC, SOAP or REST microservice.
Unless you want to translate service errors into something different, or even handle them to take certain decisions in the client tier, usually service errors are formatted in a human-readable way to be shown in the UI to let the user know what went wrong.
In the other hand, if there's no UI, there should be a logger. Like you would do in a UI layer, you would format those errors log them to a file or any other storage approach.
Also, you might want to learn more about what's the fail-fast concept:
In systems design, a fail-fast system is one which immediately reports
at its interface any condition that is likely to indicate a failure.
Fail-fast systems are usually designed to stop normal operation rather
than attempt to continue a possibly flawed process. Such designs often
check the system's state at several points in an operation, so any
failures can be detected early. A fail-fast module passes the
responsibility for handling errors, but not detecting them, to the
next-highest level of the system.
OP commented out this:
If validation errors are returned from the microservice, what manager
class should do then? Throw an Exception or put these errors in some
field in it's class?
About this concern, I've arrived to some conclusion, and it's that the entire flow should pass through a specialized DTO that I've called accumulated result (check the full description):
Represents a multi-purpose, orthogonal entity which transports both
results of a called operation and also useful information for the
callers like a status, status description and details about the actual
result of the whole operation.
That way, even in multi-tier architectures, each tier/layer can either add more info to the accumulated result or take decisions.
Probably some may argue that you should throw exceptions, but I don't find that a broken rule is an exception but an expected use case.
how to handle validation errors from remote service?
Return the relevant HTTP status code, along with as much information as is necessary (sometimes none) in the response body.
It is not important if it's SOAP or RESTful, imagine that JSON
response is returned
The type of service it is will determine what your failure handling approach will be. For SOAP services, you should return a SOAP fault.

How can I implement Pre- and Post-Commit Hooks in Riak?

There is but scant information on the web as to how to actually implement these features of Riak besides this blog post and a few others. Are any client libraries (ripple etc.) capable of receiving messages via the hook so that working with the changed data in the app (i.e. outside of Riak) becomes possible? Thanks.
It's not possible to have Riak call back into your application, however if you use the "returnbody" option when storing, you'll get back the value that was actually stored as modified by pre-commit hooks.
Post-commit hooks are run asynchronously after the object is stored and so should not be used to modify the stored object. One way you might get "messages via the hook" would be to have your post-commit hook post messages to RabbitMQ (or some other queue), which your application could then consume and do its own processing.
I hope that gives you an idea of where to start. In the meantime, we'll add some examples to that wiki page.

Resources