How to define/send derived metrics in addition to built-in Aerospike metrics - metrics

I'm trying to ship Aerospike metrics to another node using some available methods, e.g., collectd.
For example, among the Aerospike monitoring metrics, given two fields: say X and Y, how can I define and send a derived metric like Z = X+Y or X/Y?
We could calculate it on the receiver side but it degrades the performance of our application overall. Will appreciate your guidance in advance.
Thanks.

It can't be done within the Aerospike collectd plugin, as the metrics are more or less shipped immediately once they are read. There's no variable that saves the metrics that have been shipped.
If you can use the Graphite plugin, it keeps track of all gathered metrics then sends once at the very end. You can add another stanza for your calculated metrics right before nmsg line. You'll have to search through the msg[] array for your source metrics.
The Nagios plugin is a very different method. It's a single metric pull, so a wrapper script would be needed to run the plugin for each operand, and run the calculation in the wrapper.
Or you can supplement existing plugins with your own script(s) just for derived metrics. All of our monitoring plugins utilize the Aerospike Info Protocol and you can use asinfo to gather metrics for your operands similar to the previous Nagios method.

Related

How to count metrics from executions of AWS lambdas?

I have all sorts of metrics I would like to count and later query. For example I have a lambda that processes stuff from a queue, and for each batch I would like to save a count like this:
{
"processes_count": 6,
"timestamp": 1695422215,
"count_by_type": {
"type_a": 4,
"type_b": 2
}
}
I would like to dump these pieces somewhere and later have the ability to query how many were processed within a time range.
So these are the options I considered:
write the json to the logs, and later have a component (beats?) that processed these logs and send to a timeseries db.
in the end of each execution send it directly to a timeseries db (like elasticearch).
What is better in terms of cost / scalability? Are there more options I should consider?
I think Cloud Watch Embedded Metric Format (EMF) would be good here. There are client libraries for Node.js, Python, Java, and C#.
CW EMF allows you to push metrics out of Lambda into CloudWatch in a managed async way. So it's a cost-effective and low-effort way of producing metrics.
The client library produces a particular JSON format to stdout, when CW sees a message of this type it automatically creates the metrics for you from it.
You can also include key-value pairs in the EMF format which allows you to go back and query the data with these keys in the future.
High-level clients are available with Lambda Powertools in Python and Java.

Publishing high-volume metrics from Lambdas?

I have a bunch of Lambdas written in Go that produce certain events that are pushed out to various systems. I would like to publish metrics to CloudWatch that slice these by the event type. The volume is currently about 20000 events per second with peaks about twice that much.
Due to the load, I can't publish these metrics one-by-one on each Lambda invocation (each invocation produces a single event). What available approaches are there that cheap and don't hit any limits?
You can try to utilize shutdown phase from lambda lifecycle to publish you metric.
https://docs.aws.amazon.com/lambda/latest/dg/runtimes-context.html#runtimes-lifecycle-shutdown
To publish metric would suggest to utilize EMF(Embedded Metric Format) to combine multiple data points when calling PutMetricData API which takes also an array to act like a batch.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_PutMetricData.html

Configuring connectors for multiple topics on Kafka Connect Distributed Mode

We have producers that are sending the following to Kafka:
topic=syslog, ~25,000 events per day
topic=nginx, ~5,000 events per day
topic=zeek.xxx.log, ~100,000 events per day (total). In this last case there are 20 distinct zeek topics, such as zeek.conn.log and zeek.http.log
kafka-connect-elasticsearch instances function as consumers to ship data from Kafka to Elasticsearch. The hello-world Sink configuration for kafka-connect-elasticsearch might look like this:
# elasticsearch.properties
name=elasticsearch-sink
connector.class=io.confluent.connect.elasticsearch.ElasticsearchSinkConnector
tasks.max=24
topics=syslog,nginx,zeek.broker.log,zeek.capture_loss.log,zeek.conn.log,zeek.dhcp.log,zeek.dns.log,zeek.files.log,zeek.http.log,zeek.known_services.log,zeek.loaded_scripts.log,zeek.notice.log,zeek.ntp.log,zeek.packet_filtering.log,zeek.software.log,zeek.ssh.log,zeek.ssl.log,zeek.status.log,zeek.stderr.log,zeek.stdout.log,zeek.weird.log,zeek.x509.log
topic.creation.enable=true
key.ignore=true
schema.ignore=true
...
And can be invoked with bin/connect-standalone.sh. I realized that running or attempting to run tasks.max=24 when work is performed in a single process is not ideal. I know that using distributed mode would be a better alternative, but am unclear on the performance-optimal way to submit connectors to distributed mode. Namely,
In distributed mode, would I still want to submit just a single elasticsearch.properties through a single API call? Or would it be best to break up multiple .properties configs + connectors (e.g. one for syslog, one for nginx, one for zeek.**) and submit them separately?
I understand that tasks be equal to the number of topics x number of partitions, but what dictates the number of workers?
Is there anywhere in the documentation that walks through best practices for a situation such as this where there is a noticeable imbalance of throughput for different topics?
In distributed mode, would I still want to submit just a single elasticsearch.properties through a single API call?
It'd be a JSON file, but yes.
what dictates the number of workers?
Up to you. JVM usage is one factor that you can monitor and scale on
Not really any documentation that I am aware of

Flink web UI: Monitor Metrics doesn't work

run with flink-1.9.0 on yarn(2.6.0-cdh5.11.1), but the flink web ui metrics does'nt work, as shown below:
I guess you are looking at the wrong metrics. Due no data flows from one task to another (you can see only one box at the UI) there is nothing to show. The metrics you are looking at only show the data which flows from one flink task to another. At your example everything happens within this task.
Look at this example:
You can see two tasks sending data to the map-task which emits this data to another task. Therefore you see incoming and outgoing data.
But on the other hand a source task never has incoming data(I must admit that this is confusing at the first look):
The number of records recieved is 0 but it send a couple of records to the downstream task.
Back to your problem: What you can do is have a look at the operator metrics. If you look at the metrics tab (the one at the very right) you can select beside the task metrics also some operator metrics. These metrics have a name like 0.Map.numRecordsIn.
The name assembles like this <slot>.<operatorName>.<metricsname>. But be aware that this metrics are not recorded, you don't have any historic data and once you leave this tab or remove a metric the data collected until that point are gone. I would recommend to use a proper metrics backend like influx, prometheus or graphite. You can find a description at the flink docs.
Hope that helped.

Is there a feature for setting Min/Max/Fixed function/action replica in Openwhisk?

I have an Openwhisk setup on Kubernetes using [1]. For some study purpose, I want to have a fixed number of replicas/pods for each action that I deploy, essentially disabling the auto-scaling feature.
Similar facility exists for OpenFaas [2], where during deployment of a function, we can configure the system to have N function replicas at all times. These N function replicas (or pods) for the given function will always be present.
I assume this can be configured somewhere while deploying an action, but being a beginner in OpenWhisk, I could not find a way to do this. Is there a specific configuration that I need to change?
What can I do to achieve this in Openwhisk? Thanks :)
https://github.com/apache/openwhisk-deploy-kube
https://docs.openfaas.com/architecture/autoscaling/#minmax-replicas
OpenWhisk serverless functions follow closer to AWS lambda. You don’t set the number of replicas. OpenWhisk uses various heuristics and can specialize a container in milliseconds and so elasticity on demand is more practical than kube based solutions. There is no mechanism in the system today to set minimums or maximums. A function gets to scale proportional to the resources available in the system and when that capacity is maxed out, requests will queue.
Note that while AWS allows one to set the max concurrency, this isn’t the same as what you’re asking for, which is a fixed number of pre-provisioned resources.
Update to answer your two questions specifically:
Is there a specific configuration that I need to change?
There isn’t. This feature isn’t available at user level or deployment time.
What can I do to achieve this in Openwhisk?
You can modify the implementation in several ways to achieve what you’re after. For example, one model is to extend the stem-cell pool for specific users or functions. If you were interested in doing something like this, the project Apache dev list is a great place to discuss this idea.

Resources