PROMQL Joining more metrics - metrics

I'm using SYSDIG monitoring in IBM Cloud.
I have these two metrics
first:
sum by(container_image_repo,container_image_tag) (sysdig_container_cpu_cores_used)
Which return by repo and tag the total used cpu (in Value_A)
second:
count by(container_image_repo,container_image_tag)(sysdig_container_info)
Which return by repo and tag the total number of containers (in Value_B)
My problem is that I would like to have one single request which returns the two metrics at the same time by repo and tag, i.e.:
Repo Tag Value_A Value_B
Any hints?
I tried joining the two requests,
sum by(container_image_repo,container_image_tag) (sysdig_container_cpu_cores_used) *on (container_image_repo,container_image_tag) (count by(
container_image_repo,container_image_tag)(sysdig_container_info))
but I get still one value (which is the multiplication of the two values A*B, grouped by repo and tag. No surprise indeed...)
Thank you

This is not related to Sysdig, but to the way how PromQL is designed. In PromQL, when you apply a function over a vector, the resulting vector or scalar does not contain the metric name (since this is not the same metric anymore, but its derivative).
In your example, these two metrics that you are using denote two different things:
sysdig_container_cpu_cores_used: is the number of cores a particular container occupies
sysdig_container_info : a set of additional labels for each container. In order not to add all container information to every container metric (such as agent id, container id, the id of the image in the container, image digest value, etc.), when you need it, you can join that container metric with sysdig_container_info to enrich it with these additional labels.
In my opinion, your query gives you all the relevant info you stated you need:
e.g.
Repo Tag Value (CPU used)
k8s.gcr.io/kops/kops-controller 1.22.3 0.00148
Disclosure:
I work as an engineer at Sysdig, but my answers/comments are strictly my own.

Related

Grafana/Prometheus visualizing multiple ips as query

I want to have a graph where all recent IPs that requested my webserver get shown as total request count. Is something like this doable? Can I add a query and remove it afterwards via Prometheus?
Technically, yes. You will need to:
Expose some metric (probably a counter) in your server - say, requests_count, with a label; say, ip
Whenever you receive a request, inc the metric with the label set to the requester IP
In Grafana, graph the metric, likely summing it by the IP address to handle the case where you have several horizontally scaled servers handling requests sum(your_prometheus_namespace_requests_count) by (ip)
Set the Legend of the graph in Grafana to {{ ip }} to 'name' each line after the IP address it represents
However, every different label value a metric has causes a whole new metric to exist in the Prometheus time-series database; you can think of a metric like requests_count{ip="192.168.0.1"}=1 to be somewhat similar to requests_count_ip_192_168_0_1{}=1 in terms of how it consumes memory. Each metric instance currently being held in the Prometheus TSDB head takes something on the order of 3kB to exist. What that means is that if you're handling millions of requests, you're going to be swamping Prometheus' memory with gigabytes of data just from this one metric alone. A more detailed explanation about this issue exists in this other answer: https://stackoverflow.com/a/69167162/511258
With that in mind, this approach would make sense if you know for a fact you expect a small volume of IP addresses to connect (maybe on an internal intranet, or a client you distribute to a small number of known clients), but if you are planning to deploy to the web this would allow a very easy way for people to (unknowingly, most likely) crash your monitoring systems.
You may want to investigate an alternative -- for example, Grafana is capable of ingesting data from some common log aggregation platforms, so perhaps you can do some structured (e.g. JSON) logging, hold that in e.g. Elasticsearch, and then create a graph from the data held within that.

Query for a cache hit rate graph with prometheus

I'm using Caffeine cache with Spring Boot application. All metrics are enabled, so I have them on Prometheus and Grafana.
Based on cache_gets_total metric I want to build a HitRate graph.
I've tried to get a cache hits:
delta(cache_gets_total{result="hit",name="myCache"}[1m])
and all gets from cache:
sum(delta(cache_gets_total{name="myCache"}[1m]))
Both of the metrics works correctly and have values. But when I'm trying to get a hit ratio, I have no data points. Query I've tried:
delta(cache_gets_total{result="hit",name="myCache"}[1m]) / sum(delta(cache_gets_total{name="myCache"}[1m]))
Why this query doesn't work and how to get a HitRate graph based on information, I have from Spring Boot and Caffeine?
Run both ("cache hits" and "all gets") queries individually in prometheus and compare label sets you get with results.
For "/" operation to work both sides have to have exactly the same labels (and values). Usually some aggregation is required to "drop" unwanted dimensions/labels (like: if you already have one value from both queries then just wrap them both in sum() - before dividing).
First of all, it is recommended to use increase() instead of delta for calculating the increase of the counter over the specified lookbehind window. The increase() function properly handles counter resets to zero, which may happen on service restart, while delta() would return incorrect results if the given lookbehind window covers counter resets.
Next, Prometheus searches for pairs of time series with identical sets of labels when performing / operation. Then it applies individually the given operation per each pair of time series. Time series returned from increase(cache_gets_total{result="hit",name="myCache"}[1m]) have at least two labels: result="hit" and name="myCache", while time series returned from sum(increase(cache_gets_total{name="myCache"}[1m])) have zero labels because sum removes all the labels after the aggregation.
Prometheus provides the solution to this issue - on() and group_left() modifiers. The on() modifier allows limiting the set of labels, which should be used when searching for time series pairs with identical labelsets, while the group_left() modifier allows matching multiple time series on the left side of / with a single time series on the right side of / operator. See these docs. So the following query should return cache hit rate:
increase(cache_gets_total{result="hit",name="myCache"}[1m])
/ on() group_left()
sum(increase(cache_gets_total{name="myCache"}[1m]))
There are alternative solutions exist:
To remove all the labels from increase(cache_gets_total{result="hit",name="myCache"}[1m]) with sum() function:
sum(increase(cache_gets_total{result="hit",name="myCache"}[1m]))
/
sum(increase(cache_gets_total{name="myCache"}[1m]))
To wrap the right part of the query into scalar() function. This enables vector op scalar matching rules described here:
increase(cache_gets_total{result="hit",name="myCache"}[1m])
/
scalar(sum(increase(cache_gets_total{name="myCache"}[1m])))
It is also possible to get cache hit rate for all the caches with a single query via sum(...) by (name) template:
sum(increase(cache_gets_total{result="hit"}[1m])) by (name)
/
sum(increase(cache_gets_total[1m])) by (name)

Advantages of using timeSeries over container resource

The timeSeries resource represents a container for data instances and timeSeriesInstance resource represents a data instance in the resource.
The main difference from container and contentInstance is to keep the time information with data and to be able to detect the missing data.
Is there any other advantage which can be achieved using timeSeries and timeSeriesInstance resource instead of container and contentInstance resources?
Does it also help in saving data redundancy e.g. if my one application instance is sending data every 30 seconds so in a day 24*120 contentInstance will be created.
If timeSeries and timeSeriesInstance resources are being used then will the same number of timeSeriesInstance be created in a day (i.e. 24*120) for the above case?
Also, is there any specific purpose for keeping contentInfo attribute in timeSeries instead of timeSeriesInstance (like we have contentInfo in contentInstance resource)
There are a couple of differences between the <container> and <timeSeries> resource types.
A <container> resource may contain an arbitrary number of <contentInstance> resources as well as <flexContainer> and (sub) <container> resources as child resources. The advantage of this is that a <container> can be further structured to represent more complex data types.
This is also the reason why the contentInfo attribute cannot be part of the <container> resource, because the type of the content can just be mixed, or the <container> resource my not have direct <contentInstance> resources at all.
A <timeSeries> resource can only have <timeSeriesInstance> resources as a child resource (except from <subscription>, <oldest>, <latest> etc). It is assumed that all the child <timeSeriesInstance> resources are of the same type, therefore the contentInfo is located in the <timeSeries> resource.
<timeSeriesInstance> resources may also have a sequenceNr attribute which allows the CSE to check for missing or out-of-sequence data. See, for example, the missingDataDetect attribute in the <timeSeries> resource.
For your application (sending and storing data every 30 seconds): It depends on the requirements. Is it important that measurements are transmitted all the time, or when it is important to know when data is missing? Then use <timeSeries> and <timeSeriesInstances>. If your application just sends data when the measurement changes and it is only important to retrieve the latest value, then use <container> and <contentInstance>.
Two uses cases for <timeSeries> that seem better to me than using a <container>.
The first use case involves the dataGenerationTime attribute. This allows a sensor to specifically record the time that a sensor value was captured, whereas with a <contentInstance> you have the creation time (you could put the capture time into the content attribute, but then that requires additional processing to extract from the content). If you use the creationtime attribute of the <contentInstance> there will be variations in the time based on when the CSE receives the primitive. When using the <timeSeriesInstance> the variations go away because the CREATE request includes the dataGenerationTime attribute. That makes the data more accurate.
The second use case involves the missingDataDetect attribute. In short, using this, along with the expected periodicInterval you can implement a "heartbeat" type functionality for your sensor. If the sensor does not send a measurement indicating that the door is closed/open every 30 seconds, a notification can be sent indicating that the sensor is malfunctioning or tampered with.

How to define/send derived metrics in addition to built-in Aerospike metrics

I'm trying to ship Aerospike metrics to another node using some available methods, e.g., collectd.
For example, among the Aerospike monitoring metrics, given two fields: say X and Y, how can I define and send a derived metric like Z = X+Y or X/Y?
We could calculate it on the receiver side but it degrades the performance of our application overall. Will appreciate your guidance in advance.
Thanks.
It can't be done within the Aerospike collectd plugin, as the metrics are more or less shipped immediately once they are read. There's no variable that saves the metrics that have been shipped.
If you can use the Graphite plugin, it keeps track of all gathered metrics then sends once at the very end. You can add another stanza for your calculated metrics right before nmsg line. You'll have to search through the msg[] array for your source metrics.
The Nagios plugin is a very different method. It's a single metric pull, so a wrapper script would be needed to run the plugin for each operand, and run the calculation in the wrapper.
Or you can supplement existing plugins with your own script(s) just for derived metrics. All of our monitoring plugins utilize the Aerospike Info Protocol and you can use asinfo to gather metrics for your operands similar to the previous Nagios method.

Micro Services and noSQL - Best practice to enrich data in micro service architecture

I want to plan a solution that manages enriched data in my architecture.
To be more clear, I have dozens of micro services.
let's say - Country, Building, Floor, Worker.
All running over a separate NoSql data store.
When I get the data from the worker service I want to present also the floor name (the worker is working on), the building name and country name.
Solution1.
Client will query all microservices.
Problem - multiple requests and making the client be aware of the structure.
I know multiple requests shouldn't bother me but I believe that returning a json describing the entity in one single call is better.
Solution 2.
Create an orchestration that retrieves the data from multiple services.
Problem - if the data (entity names, for example) is not stored in the same document in the DB it is very hard to sort and filter by these fields.
Solution 3.
Before saving the entity, e.g. worker, call all the other services and fill the relative data (Building Name, Country name).
Problem - when the building name is changed, it doesn't reflect in the worker service.
solution 4.
(This is the best one I can come up with).
Create a process that subscribes to a broker and receives all entities change.
For each entity it updates all the relavent entities.
When an entity changes, let's say building name changes, it updates all the documents that hold the building name.
Problem:
Each service has to know what can be updated.
When a trailing update happens it shouldnt update the broker again (recursive update), so this can complicate to the microservices.
solution 5.
Keeping everything normalized. Fileter and sort in ElasticSearch.
Problem: keeping normalized data in ES is too expensive performance-wise
One thing I saw Netflix do (which i like) is create intermediary services for stuff like this. So maybe a new intermediary service that can call the other services to gather all the data then create the unified output with the Country, Building, Floor, Worker.
You can even go one step further and try to come up with a scheme for providing as input which resources you want to include in the output.
So I guess this closely matches your solution 2. I notice that you mention for solution 2 that there are concerns with sorting/filtering in the DB's. I think that if you are using NoSQL then it has to be for a reason, and more often then not the reason is for performance. I think if this was done wrong then yeah you will have problems but if all the appropriate fields that are searchable are properly keyed and indexed (as #Roman Susi mentioned in his bullet points 1 and 2) then I don't see this as being a problem. Yeah this service will only be as fast as the culmination of your other services and data stores, so they have to be fast.
Now you keep your individual microservices as they are, keep the client calling one service, and encapsulate the complexity of merging the data into this new service.
This is the video that I saw this in (https://www.youtube.com/watch?v=StCrm572aEs)... its a long video but very informative.
It is hard to advice on the Solution N level, but certain problems can be avoided by the following advices:
Use globally unique identifiers for entities. For example, by assigning key values some kind of URI.
The global ids also simplify updates, because you track what has actually changed, the name or the entity. (entity has one-to-one relation with global URI)
CAP theorem says you can choose only two from CAP. Do you want a CA architecture? Or CP? Or maybe AP? This will strongly affect the way you distribute data.
For "sort and filter" there is MapReduce approach, which can distribute the load of figuring out those things.
Think carefully about the balance of normalization / denormalization. If your services operate on URIs, you can have a service which turns URIs to labels (names, descriptions, etc), but you do not need to keep the redundant information everywhere and update it. Do not do preliminary optimization, but try to keep data normalized as long as possible. This way, worker may not even need the building name but it's global id. And the microservice looks up the metadata from another microservice.
In other words, minimize the number of keys, shared between services, as part of separation of concerns.
Focus on the underlying model, not the JSON to and from. Right modelling of the data in your system(s) gains you more than saving JSON calls.
As for NoSQL, take a look at Riak database: it has adjustable CAP properties, IIRC. Even if you do not use it as such, reading it's documentation may help to come up with suitable architecture for your distributed microservices system. (Of course, this applies if you have essentially parallel system)
First of all, thanks for your question. It is similar to Main Problem Of Document DBs: how to sort collection by field from another collection? I have my own answer for that so i'll try to comment all your solutions:
Solution 1: It is good if client wants to work with Countries/Building/Floors independently. But, it does not solve problem you mentioned in Solution 2 - sorting 10k workers by building gonna be slow
Solution 2: Similar to Solution 1 if all client wants is a list enriched workers without knowing how to combine it from multiple pieces
Solution 3: As you said, unacceptable because of inconsistent data.
Solution 4: Gonna be working, most of the time. But:
Huge data duplication. If you have 20 entities, you are going to have x20 data.
Large complexity. 20 entities -> 20 different procedures to update related data
High cohesion. All your services must know each other. Data model change will propagate to every service because of update procedures
Questionable eventual consistency. It can be done so data will be consistent after failures but it is not going to be easy
Solution 5: Kind of answer :-)
But - you do not want everything. Keep separated services that serve separated entities and build other services on top of them.
If client wants enriched data - build service that returns enriched data, as in Solution 2.
If client wants to display list of enriched data with filtering and sorting - build a service that provides enriched data with filtering and sorting capability! Likely, implementation of such service will contain ES instance that contains cached and indexed data from lower-level services. Point here is that ES does not have to contain everything or be shared between every service - it is up to you to decide better balance between performance and infrastructure resources.
This is a case where Linked Data can help you.
Basically the Floor attribute for the worker would be an URI (a link) to the floor itself. And Any other linked data should be expressed as URIs as well.
Modeled with some JSON-LD it would look like this:
worker = {
'#id': '/workers/87373',
name: 'John',
floor: {
'#id': '/floors/123'
}
}
floor = {
'#id': '/floor/123',
'level': 12,
building: { '#id': '/buildings/87' }
}
building = {
'#id': '/buildings/87',
name: 'John's home',
city: { '#id': '/cities/908' }
}
This way all the client has to do is append the BASE URL (like api.example.com) to the #id and make a simple GET call.
To remove the extra calls burden from the client (in case it's a slow mobile device), we use the gateway pattern with micro-services. The gateway can expand those links with very little effort and augment the return object. It can also do multiple calls in parallel.
So the gateway will make a GET /floor/123 call and replace the floor object on the worker with the reply.

Resources