Spring Cloud Stream - How to calculate average of metrics' values - apache-kafka-streams

question from a Spring Cloud Stream rookie w/ little Kafka streams knowledge.
I need to produce a stream of time-series regarding compute resources metrics.
I receive a stream of json messages containing metrics like cpu usage, memory usage and so on.
I'm stuck in understanding how to sum metrics' values and the number of processed values in order to calculate the average value on a time-window base (i.e. 300s).
Incoming messages are like this:
[
{
"value": 27,
"unit": "%",
"type": "gauge",
"metric": "cpu_usage",
"time":1519314305.896,
"host":"vm01-partition01",
}
]
I'd like to obtain a json message like the following:
[
{
"value": 32,
"aggregation": "mean",
"metric": "cpu_usage",
"time":1519314305.896, //timestamp of the first processed value
"host":"vm01-partition01",
}
]
Anybody can switch on the light for me?
Thanks

Related

How do I SUM the average from diferent souce in elasticsearch?

Good morning.
First of all I want to say that I am new in elastic so maybe this question is too easy but I don't know how to do it.
I have 3 source (3 kafka broker) generating the same metric:
$curl http://brokerX:7771/jolokia/read/kafka.server:type=Produce,user=*/byte-rate
{
"request": {
"mbean": "kafka.server:type=Produce,user=*",
"attribute": "byte-rate",
"type": "read"
},
"value": {
"kafka.server:type=Produce,user=USWNBIB01": {
"byte-rate": 55956.404059932334
},
"kafka.server:type=Produce,user=ngbi": {
"byte-rate": 19778.793941126038
},
"kafka.server:type=Produce,user=admin": {
"byte-rate": 2338235.8307990045
}
},
"timestamp": 1654588517,
"status": 200
}
and ingested in elastic by jolokia
Example of a record (I have only put some fields):
Field Value
_id giMtPYEB2QGR_VpVmCz4
_index idx-ls-confluent-metrics-ro-2022.06.07-000424
...
agent.type metricbeat
...
event.module jolokia
host.name broker1
index_template confluent-metrics
...
jolokia.jolokia_metrics.mbean kafka.server:type=Produce,user=sena
jolokia.jolokia_metrics.UserByteRate 885,160.3
logstash.pipeline bi-confluent
metricset.name jmx
...
I need a dashboard (stacked vertical bar), where I have the sum of the average of all the brokers.
When I create the dashboard, if I put in the vertical axis average(jolokia.jolokia_metrics.UserByteRate), I get the average of all nodes (but not the sum of the average), but if I put the sum(jolokia.jolokia_metrics.UserByteRate), I get a higher value than I should:
Example:
and the actual value should be the sum of:
"byte-rate": 2935617.4496644298
"byte-rate": 3328181.9137749737
"byte-rate": 2874583.589457018
Almost 9MB not 23MB.
I think the problem is that I need to sum(average(jolokia.jolokia_metrics.UserByteRate)), but this formula it is not accepted by elastic
The Formula sum(average(jolokia.jolokia_metrics.UserByteRate)) cannot be parsed.
If I put in the formula average(jolokia.jolokia_metrics.UserByteRate), the average of the whole broker appears, but I want the sum of that.
I do not know if I have been able to explain me well

Elasticsearch - query based on event frequency

I have multiple indexes to store user tracking log. In which there is 1 index is index-pageview. How can I query out the list of users who viewed the page 10 times between 2021-12-11 and 2021-12-13 using IOS operating system?
Log example:
index: index-pageview
[
{
"user_id": 1,
"session_id": "xxx",
"timestamp": "2021-12-11 hh:mm:ss",
"platform": "IOS"
},
{
"user_id": 1,
"session_id": "yyy",
"timestamp": "2021-12-13 hh:mm:ss",
"platform": "Android"
}
]
You can try building a normal bool query on timestamp and platform and then either terms aggregation (possibly with min_doc_count: 10) or collapse on user_id. Both ways will have some limitations though:
aggregation might be slower (needs benchmarking)
aggregation bucket number is limited (at 10k by default)
collapse will work on at most size docs at a time (capped at 10k as well) so you might need scrolling and app-side processing
Though performance of these might be pretty poor. If you need to run queries like those very often I would consider using another storage (SQL? Something more fancy?)

How can I show a table with the sum of value x of all childeren within Kibana

I'm have an elasticsearch database with documents stored the following way(, seperates the documents):
{
"path":"path/to/data"
"kind": "type1"
},
{
"path":"path/to/data/values1"
"kind": "type2"
"x": 2
},
{
"path":"path/to/data/values2"
"kind": "type2"
"x": 2
},
{
"path":"path/to/data/datasub"
"kind": "type1"
},
{
"path":"path/to/data/datasub/values1"
"kind": "type2"
"x": 1
}
Now I want the create table view/chart show all type2's with all the sum of x of all their childeren.
So I expect the total of path/to/data to be 5 and the total of path/to/data/datasub 1.
To consider: the depth of this structure could theoretically be unlimited
I'm running Elastichsearch 7 and Kibana 7 and I want to use the table visualisation to start with but I would like to be able to use this kind of aggregation throughout multiple visualisations. I have Googles a lot and found all kinds of Elastichsearch queries but nothing on how to achieve this in Kibana.
All help is much appreciated
For those who run into the same question:
The solution I ended up using is to split the path in to tokens prior to importing it into Elasticsearch. So consider a document having a path like "/this/is/a/path". This becomes the following array in the document:
[
"/this",
"/this/is",
"/this/is/a",
"/this/is/a/path"
]
You can then use a terms aggregation on it with various metrics to calculate your desired measurements.

Message order with Kafka Connect Elasticsearch Connector

We are having problems enforcing the order in which messages from a Kafka topic are sent to Elasticsearch using the Kafka Connect Elasticsearch Connector. In the topic the messages are in the right order with the correct offsets, but if there are two messages with the same ID created in quick succession, they are intermittently sent to Elasticsearch in the wrong order. This causes Elasticsearch to have the data from the second last message, not from the last message. If we add some artificial delay of a second or two between the two messages in the topic, the problem disappears.
The documentation here states:
Document-level update ordering is ensured by using the partition-level
Kafka offset as the document version, and using version_mode=external.
However I can't find any documentation anywhere about this version_mode setting, and whether it's something we need to set ourselves somewhere.
In the log files from the Kafka Connect system we can see the two messages (for the same ID) being processed in the wrong order, a few milliseconds apart. It might be significant that it looks like these are processed in different threads. Also note that there is only one partition for this topic, so all messages are in the same partition.
Below is the log snippet, slightly edited for clarity. The messages in the Kafka topic are populated by Debezium, which I don't think is relevant to the problem, but handily happens to include a timestamp value. This shows that the messages are processed in the wrong order (though they're in the correct order in the Kafka topic, populated by Debezium):
[2019-01-17 09:10:05,671] DEBUG http-outgoing-1 >> "
{
"op": "u",
"before": {
"id": "ac025cb2-1a37-11e9-9c89-7945a1bd7dd1",
... << DATA FROM BEFORE SECOND UPDATE >> ...
},
"after": {
"id": "ac025cb2-1a37-11e9-9c89-7945a1bd7dd1",
... << DATA FROM AFTER SECOND UPDATE >> ...
},
"source": { ... },
"ts_ms": 1547716205205
}
" (org.apache.http.wire)
...
[2019-01-17 09:10:05,696] DEBUG http-outgoing-2 >> "
{
"op": "u",
"before": {
"id": "ac025cb2-1a37-11e9-9c89-7945a1bd7dd1",
... << DATA FROM BEFORE FIRST UPDATE >> ...
},
"after": {
"id": "ac025cb2-1a37-11e9-9c89-7945a1bd7dd1",
... << DATA FROM AFTER FIRST UPDATE >> ...
},
"source": { ... },
"ts_ms": 1547716204190
}
" (org.apache.http.wire)
Does anyone know how to force this connector to maintain message order for a given document ID when sending the messages to Elasticsearch?
The problem was that our Elasticsearch connector had the key.ignore configuration set to true.
We spotted this line in the Github source for the connector (in DataConverter.java):
final Long version = ignoreKey ? null : record.kafkaOffset();
This meant that, with key.ignore=true, the indexing operations that were being generated and sent to Elasticsearch were effectively "versionless" ... basically, the last set of data that Elasticsearch received for a document would replace any previous data, even if it was "old data".
From looking at the log files, the connector seems to have several consumer threads reading the source topic, then passing the transformed messages to Elasticsearch, but the order that they are passed to Elasticsearch is not necessarily the same as the topic order.
Using key.ignore=false, each Elasticsearch message now contains a version value equal to the Kafka record offset, and Elasticsearch refuses to update the index data for a document if it has already received data for a later "version".
That wasn't the only thing that fixed this. We still had to apply a transform to the Debezium message from the Kafka topic to get the key into a plain text format that Elasticsearch was happy with:
"transforms": "ExtractKey",
"transforms.ExtractKey.type": "org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.ExtractKey.field": "id"

couchdb view/reduce. sometimes you can return values, sometimes you cant..?

This is on a recent version of couchbase server.
The end goal is for the reduce/groupby to aggregate the values of the duplicate keys in to a single row with an array value.
view result with no reduce/grouping (in reality there are maybe 50 rows like this emitted):
{
"total_rows": 3,
"offset": 0,
"rows": [
{
"id": "1806a62a75b82aa6071a8a7a95d1741d",
"key": "064b6b4b-8e08-4806-b095-9e59495ac050",
"value": "1806a62a75b82aa6071a8a7a95d1741d"
},
{
"id": "47abb54bf31d39946117f6bfd1b088af",
"key": "064b6b4b-8e08-4806-b095-9e59495ac050",
"value": "47abb54bf31d39946117f6bfd1b088af"
},
{
"id": "ed6a3dd3-27f9-4845-ac21-f8a5767ae90f",
"key": "064b6b4b-8e08-4806-b095-9e59495ac050",
"value": "ed6a3dd3-27f9-4845-ac21-f8a5767ae90f"
}
}
with reduce + group_level=1:
function(keys,values,re){
return values;
}
yields an error from couch with the actual 50 or so rows from the real view (even fails with fewer view rows). couch says something about the data not shrinking rapidly enough. However this same type of thing works JUST FINE when the view keys are integers and there is a small amount of data.
Can someone please explain the difference to me?
Reduce values need to remain as small as possible, due to the nature of how they are stored in the internal b-tree data format. There's a little bit of information in the wiki about why this is.
If you want to identify unique values, this needs to be done in your map function. This section on the same wiki page shows you one method you can use to do so. (I'm sure there are others)
I am almost always going to be querying this view with a "key" parameter, so there really is no need to aggregate values via couch, it can be easily and efficiently done in the app.

Resources