OpenTSDB default metrics - metrics

I try to use OpenTSDB and does it has default metrics, for example cpu or smth else? I can create custom metric and put data, but does OpenTSDB provide some metrics under the hood? If this one provide this possibility, how can i turn on this feature?

Check out TCollector. It has a number of built-in metrics.

Related

Configuring connectors for multiple topics on Kafka Connect Distributed Mode

We have producers that are sending the following to Kafka:
topic=syslog, ~25,000 events per day
topic=nginx, ~5,000 events per day
topic=zeek.xxx.log, ~100,000 events per day (total). In this last case there are 20 distinct zeek topics, such as zeek.conn.log and zeek.http.log
kafka-connect-elasticsearch instances function as consumers to ship data from Kafka to Elasticsearch. The hello-world Sink configuration for kafka-connect-elasticsearch might look like this:
# elasticsearch.properties
name=elasticsearch-sink
connector.class=io.confluent.connect.elasticsearch.ElasticsearchSinkConnector
tasks.max=24
topics=syslog,nginx,zeek.broker.log,zeek.capture_loss.log,zeek.conn.log,zeek.dhcp.log,zeek.dns.log,zeek.files.log,zeek.http.log,zeek.known_services.log,zeek.loaded_scripts.log,zeek.notice.log,zeek.ntp.log,zeek.packet_filtering.log,zeek.software.log,zeek.ssh.log,zeek.ssl.log,zeek.status.log,zeek.stderr.log,zeek.stdout.log,zeek.weird.log,zeek.x509.log
topic.creation.enable=true
key.ignore=true
schema.ignore=true
...
And can be invoked with bin/connect-standalone.sh. I realized that running or attempting to run tasks.max=24 when work is performed in a single process is not ideal. I know that using distributed mode would be a better alternative, but am unclear on the performance-optimal way to submit connectors to distributed mode. Namely,
In distributed mode, would I still want to submit just a single elasticsearch.properties through a single API call? Or would it be best to break up multiple .properties configs + connectors (e.g. one for syslog, one for nginx, one for zeek.**) and submit them separately?
I understand that tasks be equal to the number of topics x number of partitions, but what dictates the number of workers?
Is there anywhere in the documentation that walks through best practices for a situation such as this where there is a noticeable imbalance of throughput for different topics?
In distributed mode, would I still want to submit just a single elasticsearch.properties through a single API call?
It'd be a JSON file, but yes.
what dictates the number of workers?
Up to you. JVM usage is one factor that you can monitor and scale on
Not really any documentation that I am aware of

Is there a feature for setting Min/Max/Fixed function/action replica in Openwhisk?

I have an Openwhisk setup on Kubernetes using [1]. For some study purpose, I want to have a fixed number of replicas/pods for each action that I deploy, essentially disabling the auto-scaling feature.
Similar facility exists for OpenFaas [2], where during deployment of a function, we can configure the system to have N function replicas at all times. These N function replicas (or pods) for the given function will always be present.
I assume this can be configured somewhere while deploying an action, but being a beginner in OpenWhisk, I could not find a way to do this. Is there a specific configuration that I need to change?
What can I do to achieve this in Openwhisk? Thanks :)
https://github.com/apache/openwhisk-deploy-kube
https://docs.openfaas.com/architecture/autoscaling/#minmax-replicas
OpenWhisk serverless functions follow closer to AWS lambda. You don’t set the number of replicas. OpenWhisk uses various heuristics and can specialize a container in milliseconds and so elasticity on demand is more practical than kube based solutions. There is no mechanism in the system today to set minimums or maximums. A function gets to scale proportional to the resources available in the system and when that capacity is maxed out, requests will queue.
Note that while AWS allows one to set the max concurrency, this isn’t the same as what you’re asking for, which is a fixed number of pre-provisioned resources.
Update to answer your two questions specifically:
Is there a specific configuration that I need to change?
There isn’t. This feature isn’t available at user level or deployment time.
What can I do to achieve this in Openwhisk?
You can modify the implementation in several ways to achieve what you’re after. For example, one model is to extend the stem-cell pool for specific users or functions. If you were interested in doing something like this, the project Apache dev list is a great place to discuss this idea.

How to define/send derived metrics in addition to built-in Aerospike metrics

I'm trying to ship Aerospike metrics to another node using some available methods, e.g., collectd.
For example, among the Aerospike monitoring metrics, given two fields: say X and Y, how can I define and send a derived metric like Z = X+Y or X/Y?
We could calculate it on the receiver side but it degrades the performance of our application overall. Will appreciate your guidance in advance.
Thanks.
It can't be done within the Aerospike collectd plugin, as the metrics are more or less shipped immediately once they are read. There's no variable that saves the metrics that have been shipped.
If you can use the Graphite plugin, it keeps track of all gathered metrics then sends once at the very end. You can add another stanza for your calculated metrics right before nmsg line. You'll have to search through the msg[] array for your source metrics.
The Nagios plugin is a very different method. It's a single metric pull, so a wrapper script would be needed to run the plugin for each operand, and run the calculation in the wrapper.
Or you can supplement existing plugins with your own script(s) just for derived metrics. All of our monitoring plugins utilize the Aerospike Info Protocol and you can use asinfo to gather metrics for your operands similar to the previous Nagios method.

Connecting NiFi to ElasticSearch

I'm trying to solve one task and will appreciate any help - links to documentation, or links to forums, or other FAQs besides https://cwiki.apache.org/confluence/display/NIFI/FAQs, or any meaningful answer in this post =) .
So, I have the following task:
Initial part of my system collects data each 5-15 min from different DB sources. Then I remove duplicates, remove junk, combine data from different sources according to logic and then redirect it to second part of the system as several streams.
As far as I know, "NiFi" can do this task in the best way =).
Currently I can successfully get information from InfluxDB by "GetHTTP" processor. However I can't configure same kind of processor for getting information from Elastic DB with all necessary options. I'd like to receive data each 5-15 minutes for time period from "now-minus-<5-15 minutes>" to "now". (depends on scheduler period) with several additional filters. If I understand it right, this can be achieved either by subscription to "_index" or by regular requests to DB with desired interval.
I know that NiFi has several specific Processors designed for Elasticsearch (FetchElasticsearch5, FetchElasticsearchHttp, QueryElasticsearchHttp, ScrollElasticsearchHttp) as well as GetHTTP and PostHTTP Processors. However, unfortunately, I have lack of information or even better - examples - how to configure their "Properties" for my purposes =(.
What's the difference between FetchElasticsearchHttp, QueryElasticsearchHttp? Which one fits better for my task? What's the difference between GetHTTP and QueryElasticsearchHttp besides several specific fields? Will GetHTTP perform the same way if I tune it as I need?
Any advice?
I will be grateful for any help.
The ElasticsearchHttp processors try to make it easier to interact with ES by generating the appropriate REST API call based on the properties you set. If you know the full URL you need, you could use GetHttp or InvokeHttp. However the ESHttp processors let you put in just the stuff you're looking for, and it will generate the URL and return the results.
FetchElasticsearch (and its variants) is used to get a particular document when you know the identifier. This is sometimes used after a search/query, to return documents one at a time after you know which ones you want.
QueryElasticsearchHttp is for when you want to do a Lucene-style query of the documents, when you don't necessarily know which documents you want. It will only return up to the value of index.max_result_window for that index. To get more records, you can use ScrollElasticsearchHttp afterwards. NOTE: QueryElasticsearchHttp expects a query that will work as the "q" parameter of the URL. This "mini-language" does not support all fields/operators (see here for more details).
For your use case, you likely need InvokeHttp in order to issue the kind of query you describe. This article describes how to issue a query for the last 15 minutes. Once your results are returned, you might need some combination of EvaluateJsonPath and/or SplitJson to work with the individual documents, see the Elasticsearch REST API documentation (and NiFi processor documentation) for more details.

Logstash/not logstash for kafka-elasticsearch integration?

I read that elasticsearch rivers/river plugins are deprecated. So we cannot directly have elasticsearch-kafka integration. If we want to do this then we need to have some java(or any language) layer in between that puts the data from kafka to elastic search using its apis.
On the other hand – if we have kafka-logstash-elasticsearch – that we get rid of the above middle layer and achieve that through logstash with just configuration. But I am not sure if having logstash in between is an overhead or not?
And is my undertsanding right?
Thanks in advance for the inputs.
Regards,
Priya
Your question is quite general. It would be good to understand your architecture, its purpose and assumptions you made.
Kafka, as it is stated in its documentation, is a massively scalable publish-subscribe messaging system. My assumption would be that you use it to as a data broker in your architecture.
Elasticsearch on the other hand, is a search engine, hence I assume that you use it as a data access/searching/aggregation layer.
These two separate systems require connectors to create a proper data-pipeline. That's where Logstash comes in. It allows you to create data streaming connection between, in your case, Kafka and Elasticsearch. It also allows you to mutate the data on the fly, depending on your needs.
Ideally, Kafka uses raw data events. Elasticsearch stores documents which are useful to your data consumers (web or mobile application, other systems etc.), so can be quite different to the raw data format. If you need to modify the data between its raw form, and ES document, that's where Logstash might be handy (see filters stage).
Another approach could be to use Kafka Connectors, building custom tools e.g. based on Kafka Streams or Consumers, but it really depends on the concepts of your architecture - purpose, stack, data requirements and more.

Resources